Difference between revisions of "SHORE v0.7 file formats"

From SHORE wiki
Jump to: navigation, search
Line 1: Line 1:
 
Any output generated by SHORE will usually be written to various text files that contain a number of tab-delimited columns.
 
Any output generated by SHORE will usually be written to various text files that contain a number of tab-delimited columns.
  
Typing ''shore fmt'' will display a quick reference on many of SHORE’s file formats.
+
Typing ''shore fmt'' will display a quick reference for many of SHORE’s file formats.
  
 
= Read file format =
 
= Read file format =

Revision as of 11:54, 19 July 2011

Any output generated by SHORE will usually be written to various text files that contain a number of tab-delimited columns.

Typing shore fmt will display a quick reference for many of SHORE’s file formats.

Read file format

Read files can be found in the LengthDirectories or PEDirectories. These files are created by shore import and are called ’reads_0.fl’.

The tab delimited entries are:

id A unique identifier for the read or read pair
sequence DNA sequence
pe Flag, ’1’ or ’2’, first read or second read of a read pair. ’0’ for a single read.
Sanger quality values String of sanger calibrated quality values

Encoding: ASCII 33 ('!', quality 0) to ASCII 73 ('I', quality 40); extended range ASCII 93 (']', quality 60)

[Chastity values]

(Optional column)

String of illumina chastity values (defined as <math>Intensity(max)/ (Intensity(max) + Intensity(second))</math>).

Encoding: ASCII 40 ('(', chastity of 0.5) to ASCII 90 ('Z', chastity of 1.0)

Note: Earlier versions of SHORE used a read file format featuring three different quality types per entry. This file format is still supported for reading, but is no longer written.


Alignment file format

SHORE alignment files are typically called map.list, map.list.1 or map.list.2. They can be found in the LengthDirectories or PEDirectories. The tab delimited entries are:

chr id Each chromosome has an internal id, simply enumerated according to their occurrence within the reference sequence file, starting from 1. Translation to the native chromosome name can be found in the *.shore.trans file in the IndexFolder.
pos Left-most position of the alignment relative to the reference sequence.
alignment String representation of the read alignment. Matches are reported as a single character, mismatches or gaps as two characters surrounded by brackets. The first character represents the reference base, the second character the sequenced base. Deleted nucleotides are represented as ’-’.
read id A unique identifier for the read or read pair
strand ’D’ for forward and ’P’ for reverse hits (direct and palindromic, respectively).
mismatches The number of mismatches+gaps in the alignment.
hits The total number of genomic positions the read is aligned to.
read length Length of the read.
reserved Reserved field for future use, should be parsed as a string
pe flag Paired-end information
  0: single read
  1: first read of a pair
  2: second read of a pair
  3: first read of a pair (concordant mapping)
  4: first read of a pair (discordant mapping)
  5: first read of a pair (orphan read)
  6: second read of a pair (concordant mapping)
  7: second read of a pair (discordant mapping)
  8: second read of a pair (orphan read)
A library ID <math>L</math> may be encoded in the pe flag for the flags with value <math>>2</math>; in this case the flag is calculated as <math>pe\_flag = pe\_flag + (L * 6)</math>
Sanger quality values String of sanger calibrated quality values.

Encoding: ASCII 33 ('!', quality 0) to ASCII 73 ('I', quality 40); extended range ASCII 93 (']', quality 60)

[Chastity values]

(Optional column)

String of illumina chastity values defined as the highest intensity divided by the sum of the highest and the second highest intensity of a single base.

Encoding: ASCII 40 ('(', chastity of 0.5) to ASCII 90 ('Z', chastity of 1.0)

Note: Earlier versions of SHORE used an alignment file format featuring three different quality types per entry. This file format is still supported for reading, but is no longer written.

TODO: Outdated Information

Consensus result file formats

Analyses of resequencing projects are performed by shore consensus. Multiple re- sult files are produced for SNPs, indels, CNVs, reference like bases and other types of predictions. All of them can be found in the ConsensusAnalysis folder. Currently ’SHORE consensus’ provides three types of consensus predictions: 1. Decision tree based approach as described in Ossowski et al. Genome Research 2009 and 2. Empiri- cal scoring matrix approach to be described in a future publications. 3. SNP prediction on transcriptomic or other quantitative data to be described in a future publication. The output files of the second approach have ’quality ” as name suffix. The third approach produces all files from the first two approaches but uses a slightly different scoring scheme. The following list gives an overview of all column types which can be found in the result file. Note that not all columns apply to each of the predictions:

<name> Project name or name of sequenced sample
<chr> Chromosome identifier
<pos> Position within the chromosome
<start> First position of a prediction within the chromosome (for predictions longer than just one base pair, e.g. indels)
<end> Last position of a prediction
<length> Length
<ref base> Reference base
<cons base> Consensus base (i.e. SNP call). Heterozygous SNPs and SNPs from pooled samples are divided into ’ma- jor allele’ and ’minor allele’
<seq> Sequence (i.e. inserted or deleted sequence)
<quality> Quality of a predicted feature (ranging from 0 to 40)
<read type> part of the reads that were used for prediction: repet- itive/nonrepetitive and core/complete
<support> Number of reads supporting a predicted feature. For heterozygous SNPs and SNPs from pooled samples the sequenced bases are divided into ’major support’ and ’minor support’
<concordance> Ratio of reads supporting a predicted feature to to- tal coverage (excluding quality masked bases). Het- erozygous SNPs and SNPs from pooled samples are divided into ’major concordance’ and ’minor concor- dance’
<max qual> Highest base quality supporting a prediction.
<avg hits> Average number of alignments of all reads covering this genomic position. (see section ’Repeat analysis based on short read alignment’ for details)
<repeat count> Number of repetitive positions in the range of the prediction (i.e. long deletion)
Number of ambiguous positions in the range of the prediction (’N’).
<exp cov> Expected coverage at a locus defined by repeat analysis and GC content. (Average expected coverage, if the prediction describes a range rather than a single location.)
<obs cov> Observed coverage at a locus as defined by the read alignment. (Average expected coverage, if the pre- diction describes a range rather than a single loca- tion.)
<obs/exp ratio> Ratio of observed to expected coverage is used to identify CNVs or highly over-sampled regions (often seen for rDNA clusters) in the genome.
<obs/exp ratio max> The maximum of <obs/exp ratio>.
<GC> Maximal GC content within the given range.
<cvp count> Copy variable positions (variation between instances of repeats) are an indication of duplications or CNVs.


Prediction file formats

The following listing describes the different prediction files and their columns. This description only reviews some of the major aspects to help to understand the files. It is important to note that all of them are predictions. Especially CNVs and duplica- tions are only indicating abnormalities from the observed mapping data compared to what would be expected under ideal sequencing circumstances. Moreover, CNV and duplication predictions have so far been implemented for single end data only, and should be carefully validated. More information on the prediction algorithms can be found in Ossowski et al. Genome Research 2008.


Homozygous SNP calls (decision tree approach)

(homozygous_snp.txt) All positions with a base call different to the reference. Base calls require a concor- dance of >= 80% and a support of at least three non-repetitive reads. Due to statistical sampling and sequencing biases, the accuracy of these call is affected by heterozygous SNPs as well when sequencing heterozygous samples.

<name> <chr> <pos> <ref base> <cons base> <read type> <support> <concordance> <max qual> <avg hits>


Homozygous small indels calls (decision tree approach)

(deletions.txt , insertions.txt) Deletions (length depends on the number of gaps allowed in the mapping process) called from the alignments. Parameters are identical to those from t he homozygous SNP predictions.

<name> <chr> <start> <end> <length> <seq> <read type> <support> <concordance> <avg hits>


Heterozygous SNP and small indel calls (decision tree approach)

(heterozygous_call.txt) All positions with at least 25% of the bases different to the majority call. This file includes minor alleles of indels.

<name> <chr> <pos> <ref base> <major allele> <major support> <major concordance> <minor allele> <minor support> <minor concordance> <unique prb> <avg hits>


SNP and small indel calls in pooled samples (decision tree approach)

(minor_allele_call.txt) All positions with either a homozygous SNP/Indel or a variant with a minor allele fre- quency of >= 2% are stored in this file. Due to the sequencing error of approximately 1% this can result in a high number of false positives. The minimum minor allele frequency should be adjusted according to the number of individuals in the sample. As a rule of thumb it should be greater than (100 / num samples / 2), i.e >= 10% for 5 pooled samples. Note that in case of homozygous variants the minor allele is ’X’ meaning no minor allele found. Positions with homozygous reference calls are not stored at all to save space.

<name> <chr> <pos> <ref base> <major allele> <major support> <major concordance> <minor allele> <minor support> <minor concordance> <avg hits>


Homozygous reference calls (decision tree approach)

(reference.txt) Reference like positions called from the alignments. Parameters are identical to those from the homozygous SNP prediction.

<name> <chr> <pos> <ref base> <cons base> <support type> <support> <concordance> <max qual> <unique prb> <avg hits>


Copy Variable Positions, CVPs

(copy_variable_position.txt) A duplication in the sequenced sample will map to the same locus in the reference sequence as the origin where it was generated from. If there is a (slight) difference between the original position and the duplication, there will be positions which look like het calls. These positions are called CVPs. Positions (so-called CVPs) with two different bases due to mislocated alignments of repetitive sequences are an indication of duplications. CVPs are classified according to their expected repetitiveness. If a position is expected to be unique but contains different bases this can indicate a du- plication of a former unique region.

<name> <chr> <start> <end> <length> <cvp count> <obs cov> <exp cov>


CNV

(CNV.txt) Copy number variation is predicted based on two major criterias. Requires strong skew between observed and expected coverage of at least 40bp of length and the existence of at least one CVP within this interval.

<name> <chr> <start> <end> <length> <cvp count> <obs cov> <exp cov>


Duplication

(duplication.txt) Duplications are predicted similar to CNVs described above, however the reference sequence has to be mostly unique within the duplicated interval and the length has to be greater than 250bp. Thus duplication predictions are more reliable than CNV predictions.

<name> <chr> <start> <end> <length> <cvp count> <obs cov> <exp cov>


Unsequenced regions

(unsequenced.txt, supplementary_data/unseq_cn.txt, supplementary_data/uns (supplementary\_data/unseq\_core.txt) Unsequenced regions are called, if a region of one or more bp is continuously uncov- ered by reads. However this does not necessarily mean that this is a deletion. It can indicate long deletions, insertions, (highly) polymorphic regions or a bias in the se- quencing coverage. To account for biases, namely the GC content influencing coverage, we report the average and maximum GC content and the expected coverage within the unsequenced interval. In addition the number of ’N’ in the interval is reported to indicate if an unsequenced region is solely due to unreliable reference sequence. Other predictions (SNPs, small indels) in the vicinity of unsequenced regions are unreliable due to an increased probability of alignment issues. In addition we provide two files showing regions with absence of non-repetitive reads and absence of ’core reads’.

<name> <chr> <start> <end> <length> <ambi count> <GC> <GC max> <repeat count> <exp cov>


Oversampled regions

(supplementary_data/oversampled.txt) Some highly repetitive genomic regions like rDNA or centromeric repeats are not an- notated correcly in the reference sequence. Often the repeat is only represented by one copy. This leads to unexpectedly high covergae in this regions, indicating that several copies of the repeat mapped to the same reference sequence. The total amount of re- peat instances can be estimated using the observed vs. expected coverage ratio. Other predictions (SNPs, small indels) in the vicinity of oversampled regions are unreliable due to an increased probability of erroneous alignments.

<name> <chr> <start> <end> <length> <ambi count> <GC> <GC max> <repeat count> <obs cov> <exp cov> <obs/exp cov ratio avg> <obs/exp cov ratio max>


SNPs and short indels predictions (scoring matrix approach)

(quality_variant.txt) We have developed an empirical scoring matrix with 12 features describing e.g. the quality of reads, the quality of alignments and the likeliness of wrong calls due to other features in the vicinity of a position. These features are used to calculate a quality value for each position/call of the genome ranging from 0 to 40. The scoring matrices can be adjusted by the user, however this is not recommended. There are predefined matrices for different prediction types (e.g. homozygous, heterozygous or pooled samples).

<name> <chr> <position> <ref base> <cons base> <quality> <support> <concordance> <avg hits>


Reference positions predictions (scoring matrix approach)

(quality_reference.txt) Same as above, just for reference like positions instead of SNPs and indels.

<name> <chr> <position> <ref base> <cons base> <quality> <support> <concordance> <avg hits>


Statistics

Running shore consensus with option -r creates a ConsensusStatistics folder. It contains the files run statistics.pdf and GC by Cov Statistics (if R was installed cor- rectly). These figures are useful to check for inconsistencies in the analysis.

Mismatches per Read

describes the fraction of reads which could be mapped without mismatches, with 1 mismatch, 2 mismatches and so on. Indels are counted as mismatches, an indel of length 2 will be counted as 2 mismatches. The black numbers above the orange spots are the absolute read counts, whereas the y-axis describe the percentage of reads. Mismatches can be created by sequence divergence and sequencing errors.

Number of hits in genome per read

This plot indicates the repetiveness of the reference sequence based on the number of alignments. It plots how often a certain number of alignments per read was observed. In this example the repeat has approx. 40 instances in the reference sequence.

Read Length Distribution

If read trimming was switched on in shore illumina2flat, there will be reads of different length. In Figure 1 read trimming was performed on reads of length 42 down to a minimal length of 36.

=== Errors in read positions === plots the percentage of sequencing errors per read posi- tion. In the example of Figure 1 the number of read errors decays to the read end. This might be surprising, but is a result of the read trimming. Read trimming clips the last bases if these are of bad quality. Therefore if the last bases are not trimmed they are of at least moderate quality and less error prone. The red line indicates the support (how often a certain read position was counted), also a result of read trimming. Errors are assessed in uniquely mapped regions, where consensus calling is possible. Read bases which contradict with the consensus are reported as errors.

Errors in mapped reads at uniquely mapped positions, excluding ambiguous calls - Quality

plots the Sanger scaled Illumina quality value (of a called base) versus the observed percentage of errors. You could refer to this as the probability for an error at a given quality value. Absolute numbers of observations are plotted as dashed red line, while they are written numerical above the orange bars. Note that it depends on the choice of quality (adjusted with shore illumina2flat) whether this plot shows qraw (prb) or qcal. If the choice was qcal you will not see any values below 0. It is somewhat surprising that in the example of the first plot in Figure 2 base calls with a quality value below 0 seem to perform better than those with a quality value of 0. Though we have observed this pattern quite often. The second plot on the second page shall help to adjust the quality cutoff for masking. Masking is a very effective way to reduce noise which affects variant/consensus calling. Though we have found that there is nothing like the optimal cutoff value for all runs, since the variance between Illumina data can be large. Therefore the cutoff for base masking by quality value might be re-adjusted. The plot can be read like this: check the green and orange bar at (let’s say) quality value (x-axis) equals 4. There you can see that the green bar is just above the second gray dashed line which indicates the 10% mark. This means setting the quality cutoff to 4 would have masked around 10% of correct bases. On the other hand, nearly 80% of the wrong bases would have been masked as well. This is indicated by the orange bar at quality value of 4. A much stricter quality cutoff would have been 32, which would have removed 80% of the correct data while it also would have removed 99% of the error prone bases. Obviously it is desirable to find a quality value (x-axis value) where the green bar is preferably small and the orange bar preferably large. It is a trade-off between strict masking (remove most of the errors) and weak masking (keeping the correct values). This decision depends on the application. Note, the erroneous and correct bases have been assessed on unique, well covered and reference like stretches of the sample.

Errors in mapped reads at uniquely mapped positions, excluding ambiguous calls - Chastity

The third page shows the same kind of plots as the second page, using this time the Chastity score as a base quality measure. The Chastity value was introduced by Illumina’s GAPipeline and is used to differentiate between clusters which are interfered by other clusters and those which are not. It is defined as the highest base intensity of a sequenced bases devided by the sum of the highest plus the second highest intensity of the specifc base. Therefore the Chastity score can reach values from 0.5 to 1, or in percent from 50 to 100. We find this value a useful support for the quality values. Again you can adjust the Chastity cutoff the same way as for the other two values.

Note, all these plots combine read 1 and read 2. It would be desirable to have them separately, and this will be implemented in one of the next releases of SHORE.

Coverage and GC content at uniquely mapped sites

After running shore consensus there will be a coverage analysis of various GC contents, found in GC by - Cov Statistics.pdf. The plot prints the average coverage (height of an orange bar) by local GC content (x-axis value). The local GC content is defined as the number of Gs and Cs divided by the number of As, Cs, Gs and Ts within a window which size is adjusted in shore preprocess. Ideally the average coverage is equal at all different GC contents, such a plot would have orange bars of the same height. This is not the case and it has been shown that average coverage scales with the GC content (for regions of low GC content) and has a reverse correlation for regions with high GC content. It is unclear where this comes from, and there has been Illumina data which is not influenced by such a bias at all, though correcting for the bias is a step which potentially can help in variant calling and is essential in calculating the expected coverage value. This value is essential for the dectection of CNVs. The red dashed line shows the support (occurrence) of a certain GC content window in the unique portion of the genome. Average coverage should be similar to the sequence depth.


Quality values