SHORE v0.7 file formats

From SHORE wiki
Revision as of 09:13, 29 August 2011 by Felo80 (Talk | contribs)

Jump to: navigation, search

Any output generated by SHORE will usually be written to various text files that contain a number of tab-delimited columns.

Typing shore fmt will display a quick reference for many of SHORE’s file formats.

Read file format

Read files can be found in the LengthFolders or ReadFolders. These files are created by shore import and are named reads_0.fl.

The tab delimited entries are:

id A unique identifier for the read or read pair
sequence DNA sequence
pe Flag, ’1’ or ’2’, first read or second read of a read pair. ’0’ for a single-end read.
Sanger quality values String of sanger calibrated quality values

Encoding: ASCII 33 ('!', quality 0) to ASCII 73 ('I', quality 40); extended range ASCII 93 (']', quality 60)

[Chastity values]

(Optional column)

String of illumina chastity values (defined as <math>Intensity(max)/ (Intensity(max) + Intensity(second))</math>).

Encoding: ASCII 40 ('(', chastity of 0.5) to ASCII 90 ('Z', chastity of 1.0)

Note: Earlier versions of SHORE used a read file format featuring three different quality types per entry. This file format is still supported for reading, but is no longer written.

Alignment file format

SHORE alignment files are typically named map.list, map.list.1 or map.list.2. They are stored in the LengthFolders or ReadFolders. The tab delimited entries are:

chr id Each chromosome has an internal id, simply enumerated according to their occurrence within the reference sequence file, starting from 1. Translation to the native chromosome name can be found in the *.shore.trans file in the IndexFolder or in the ref.txt files created by shore mapflowcell.
pos Left-most position of the alignment relative to the forward strand of the reference sequence. The first position of a chromosome is 1.
alignment String representation of the read alignment. The sequence is always reported with respect to the forward strand of the reference, i.e. the sequence of reads matching to the reverse strand is reverse complemented.
  • Matches are reported as a single IUPAC character
  • Mismatches or gaps as two characters surrounded by brackets. The first character represents the reference base, the second character the sequenced base. Deleted nucleotides are represented as ’-’.
    • Examples: [CT] (mismatch), [-T] (insertion), [C-] (deletion)
    • Long deletions with respect to the reference may be reported as the character L followed by the size of the deletion, e.g. [L100]
  • Unaligned sequence ('soft clip') may be reported in angle brackets, e.g. <TTTTTT>

Extensions, not supported by all tools:

  • Consecutive stretches of the same operation (mismatch, insertion, deletion) may be abbreviated, e.g. [CTT|---] instead of [C-][T-][T-]
  • F can be used to indicate a mapped part of a fragment with known size, but unknown sequence, e.g. [F100]
read id A unique identifier for the read or read pair
strand D for forward and P for reverse hits (direct and palindromic, respectively).
mismatches The number of mismatches+gaps in the alignment.
hits The total number of genomic positions the read is aligned to.
read length Length of the read ('soft clipped' nucleotides excluded).
reserved Reserved field for future use, should be parsed as a string
pe flag Paired-end information
  • 0: single read
  • 1: first read of a pair
  • 2: second read of a pair
  • 3: first read of a pair (concordant mapping)
  • 4: first read of a pair (discordant mapping)
  • 5: first read of a pair (orphan read)
  • 6: second read of a pair (concordant mapping)
  • 7: second read of a pair (discordant mapping)
  • 8: second read of a pair (orphan read)

A library ID <math>L</math> may be encoded in the pe flag for the flags with value <math>>2</math>; in this case the flag is calculated as <math>pe\_flag = pe\_flag + (L * 6)</math>

Sanger quality values String of sanger calibrated quality values.

Encoding: ASCII 33 ('!', quality 0) to ASCII 73 ('I', quality 40); extended range ASCII 93 (']', quality 60)

[Chastity values]

(Optional column)

String of illumina chastity values defined as the highest intensity divided by the sum of the highest and the second highest intensity of a single base.

Encoding: ASCII 40 ('(', chastity of 0.5) to ASCII 90 ('Z', chastity of 1.0)

Note: Earlier versions of SHORE used an alignment file format featuring three different quality types per entry. This file format is still supported for reading, but is no longer written.

SHORE consensus result files

Overview

Analyses of resequencing projects are performed by shore consensus. Multiple result files are produced for SNPs, indels, CNVs, reference like bases and other types of predictions. All of them can be found in the ConsensusAnalysis directory.

Currently shore consensus provides three types of consensus predictions:

  1. Decision tree based approach as described in Ossowski et al. Genome Research 2009
  2. Empirical scoring matrix approach to be described in a future publications.
  3. SNP prediction on transcriptomic or other quantitative data to be described in a future publication.

The output files of the second approach have quality_ as file name prefix. The third approach produces all files from the first two approaches but uses a slightly different scoring scheme. The following list gives an overview of the column types which can be found in any of the result files that are described in the next section. Note that not all columns apply to each of the predictions:

name Project name or name of sequenced sample
chr Chromosome identifier
pos Position within the chromosome
start First position of a prediction within the chromosome (for predictions longer than just one base pair, e.g. indels)
end Last position of a prediction
length Length
ref base Reference base
cons base Consensus base (i.e. SNP call). Heterozygous SNPs and SNPs from pooled samples are divided into major allele and minor allele
seq Sequence (i.e. inserted or deleted sequence)
quality Quality of a predicted feature (ranging from 0 to 40)
read type part of the reads that were used for prediction: repetitive/nonrepetitive and core/complete
support Number of reads supporting a predicted feature. For heterozygous SNPs and SNPs from pooled samples the sequenced bases are divided into major support and minor support
concordance Ratio of reads supporting a predicted feature to total coverage (excluding quality masked bases). Heterozygous SNPs and SNPs from pooled samples are divided into major concordance and minor concordance
max qual Highest base quality supporting a prediction.
avg hits Average number of alignments of all reads covering this genomic position.
repeat count Number of repetitive positions in the range of the prediction (i.e. long deletion)
ambi count Number of ambiguous positions in the range of the prediction (’N’).
exp cov Expected coverage at a locus defined by repeat analysis and GC content. (Average expected coverage, if the prediction describes a range rather than a single location.)
obs cov Observed coverage at a locus as defined by the read alignment. (Average expected coverage, if the prediction describes a range rather than a single location.)
obs/exp ratio Ratio of observed to expected coverage is used to identify CNVs or highly over-sampled regions (often seen for rDNA clusters) in the genome.
obs/exp ratio max The maximum of obs/exp ratio.
GC Maximal GC content within the given range.
cvp count Copy variable positions (variation between instances of repeats) are an indication of duplications or CNVs.


Prediction file formats

The following listing describes the different prediction files and their columns.

This description only reviews some of the major aspects to help to understand the files. It is important to note that all of them are predictions; especially CNVs and duplications are only indicating abnormalities from the observed mapping data compared to what would be expected under ideal sequencing circumstances. Moreover, CNV and duplication predictions have so far been implemented for single end data only, and should be carefully validated.

More information on the prediction algorithms can be found in Ossowski et al. Genome Research 2008.

Homozygous SNP calls (decision tree approach)

homozygous_snp.txt

All positions with a base call different to the reference. Base calls require a concordance of >= 80% and a support of at least three non-repetitive reads. Due to statistical sampling and sequencing biases, the accuracy of these calls is affected by heterozygous SNPs as well when sequencing heterozygous samples.

<name> <chr> <pos> <ref base> <cons base> <read type> <support> <concordance> <max qual> <avg hits>

Homozygous small indels calls (decision tree approach)

deletions.txt, insertions.txt

Deletions (length depends on the number of gaps allowed in the mapping process) called from the alignments. Parameters are identical to those from the homozygous SNP predictions.

<name> <chr> <start> <end> <length> <seq> <read type> <support> <concordance> <avg hits>

Heterozygous SNP and small indel calls (decision tree approach)

heterozygous_position.txt

All positions with at least 25% of the bases different to the majority call. This file includes minor alleles of indels.

<name> <chr> <pos> <ref base> <major allele> <major support> <major concordance> <minor allele> <minor support> <minor concordance> <avg hits>

SNP and small indel calls in pooled samples (decision tree approach)

minor_allele_position.txt

All positions with either a homozygous SNP/indel or a variant with a minor allele frequency of >= 2% are stored in this file. Due to the sequencing error of approximately 1% this can result in a high number of false positives.

The minimum minor allele frequency should be adjusted according to the number of individuals in the sample. As a rule of thumb it should be greater than (100 / num samples / 2), i.e >= 10% for 5 pooled samples. Note that in case of homozygous variants the minor allele is ’X’, meaning no minor allele found. Positions with homozygous reference calls are not stored at all to save space.

<name> <chr> <pos> <ref base> <major allele> <major support> <major concordance> <minor allele> <minor support> <minor concordance> <avg hits>

Homozygous reference calls (decision tree approach)

reference.txt

Reference like positions called from the alignments. Parameters are identical to those from the homozygous SNP prediction.

<name> <chr> <pos> <ref base> <cons base> <read type> <support> <concordance> <max qual> <avg hits>

Copy Variable Positions, CVPs

copy_variable_position.txt

A duplication in the sequenced sample will map to the same locus in the reference sequence as the origin where it was generated from. If there is a (slight) difference between the original position and the duplication, there will be positions which look like het calls. These positions are called CVPs. Positions (so-called CVPs) with two different bases due to mislocated alignments of repetitive sequences are an indication of duplications. CVPs are classified according to their expected repetitiveness. If a position is expected to be unique but contains different bases this can indicate a duplication of a former unique region.

<name> <chr> <pos> <ref base> <major allele> <major support> <major concordance> <minor allele> <minor support> <minor concordance> <avg hits>

CNV

CNV.txt

Copy number variation is predicted based on two major criterias. Requires strong skew between observed and expected coverage of at least 40bp of length and the existence of at least one CVP within this interval.

<name> <chr> <start> <end> <length> <cvp count> <obs cov> <exp cov>

Duplication

duplication.txt

Duplications are predicted similar to CNVs described above, however the reference sequence has to be mostly unique within the duplicated interval and the length has to be greater than 250bp. Thus duplication predictions are more reliable than CNV predictions.

<name> <chr> <start> <end> <length> <cvp count> <obs cov> <exp cov>

Unsequenced regions

unsequenced.txt, supplementary_data/unseq_cn.txt, supplementary_data/unseq_core.txt

Unsequenced regions are called, if a region of one or more base pairs is continuously uncovered by reads.

However this does not necessarily mean that this is a deletion. It can indicate long deletions, insertions, (highly) polymorphic regions or a bias in sequencing coverage. To account for biases, namely the GC content influencing coverage, we report the average and maximum GC content and the expected coverage within the unsequenced interval. In addition the number of ’N’ in the interval is reported to indicate if an unsequenced region is solely due to unreliable reference sequence. Other predictions (SNPs, small indels) in the vicinity of unsequenced regions are unreliable due to an increased probability of alignment issues. In addition we provide two files showing regions with absence of non-repetitive reads and absence of ’core reads’.

<name> <chr> <start> <end> <length> <ambi count> <GC> <GC max> <repeat count> <exp cov>

Oversampled regions

supplementary_data/oversampled.txt

Some highly repetitive genomic regions like rDNA or centromeric repeats are not annotated correcly in the reference sequence. Often the repeat is only represented by one copy. This leads to unexpectedly high coverage in these regions, indicating that several copies of the repeat mapped to the same location in the reference sequence. The total number of repeat instances can be estimated using the observed vs. expected coverage ratio. Other predictions (SNPs, small indels) in the vicinity of oversampled regions are unreliable due to an increased probability of erroneous alignments.

<name> <chr> <start> <end> <length> <ambi count> <GC> <GC max> <repeat count> <obs cov> <exp cov> <obs/exp cov ratio avg> <obs/exp cov ratio max>

SNPs and short indels predictions (scoring matrix approach)

quality_variant.txt

We have developed an empirical scoring matrix with 12 features describing e.g. the quality of reads, the quality of alignments and the likeliness of wrong calls due to other features in the vicinity of a position. These features are used to calculate a quality value for each position/call of the genome ranging from 0 to 40. The scoring matrices can be adjusted by the user, however this is not recommended. There are predefined matrices for different prediction types (e.g. homozygous, heterozygous or pooled samples).

<name> <chr> <position> <ref base> <cons base> <quality> <support> <concordance> <avg hits>


Reference positions predictions (scoring matrix approach)

quality_reference.txt

Same as above, just for reference like positions instead of SNPs and indels.

<name> <chr> <position> <ref base> <cons base> <quality> <support> <concordance> <avg hits>

Statistics

Running shore consensus with option -r creates a ConsensusStatistics directory. It contains the files run_statistics.pdf and GC_by_Cov_Statistics.pdf. These figures are useful to check for inconsistencies in the analysis.

Mismatches per Read

describes the fraction of reads which could be mapped without mismatches, with 1 mismatch, 2 mismatches and so on. Indels are counted as mismatches, an indel of length 2 will be counted as 2 mismatches. The black numbers above the orange spots are the absolute read counts, whereas the y-axis describe the percentage of reads. Mismatches can be created by sequence divergence and sequencing errors.

Number of hits in genome per read

This plot indicates the repetiveness of the reference sequence based on the number of alignments. It plots how often a certain number of alignments per read was observed. In this example the repeat has approx. 40 instances in the reference sequence.

Read Length Distribution

If read trimming was switched on in shore import, there will be reads of different lengths. In Figure 1 read trimming was performed on reads of length 42 down to a minimal length of 36.

Errors in read positions

plots the percentage of sequencing errors per read position. In the example of Figure 1 the number of read errors decays to the read end. This might be surprising, but is a result of the read trimming. Read trimming clips the last bases if these are of bad quality. Therefore if the last bases are not trimmed they are of at least moderate quality and less error prone. The red line indicates the support (how often a certain read position was counted), also a result of read trimming. Errors are assessed in uniquely mapped regions, where consensus calling is possible. Read bases which contradict with the consensus are reported as errors.

Errors in mapped reads at uniquely mapped positions, excluding ambiguous calls - Quality

plots the Sanger scaled Illumina quality value (of a called base) versus the observed percentage of errors. You could refer to this as the probability for an error at a given quality value. Absolute numbers of observations are plotted as dashed red line, while they are written numerical above the orange bars. Note that it depends on the choice of quality (adjusted with shore illumina2flat) whether this plot shows qraw (prb) or qcal. If the choice was qcal you will not see any values below 0. It is somewhat surprising that in the example of the first plot in Figure 2 base calls with a quality value below 0 seem to perform better than those with a quality value of 0. Though we have observed this pattern quite often.

The second plot on the second page shall help to adjust the quality cutoff for masking. Masking is a very effective way to reduce noise which affects variant/consensus calling. Though we have found that there is nothing like the optimal cutoff value for all runs, since the variance between Illumina data can be large. Therefore the cutoff for base masking by quality value might be re-adjusted. The plot can be read like this: check the green and orange bar at (let’s say) quality value (x-axis) equals 4. There you can see that the green bar is just above the second gray dashed line which indicates the 10% mark. This means setting the quality cutoff to 4 would have masked around 10% of correct bases. On the other hand, nearly 80% of the wrong bases would have been masked as well. This is indicated by the orange bar at quality value of 4. A much stricter quality cutoff would have been 32, which would have removed 80% of the correct data while it also would have removed 99% of the error prone bases. Obviously it is desirable to find a quality value (x-axis value) where the green bar is preferably small and the orange bar preferably large. It is a trade-off between strict masking (remove most of the errors) and weak masking (keeping the correct values). This decision depends on the application. Note, the erroneous and correct bases have been assessed on unique, well covered and reference like stretches of the sample.

Errors in mapped reads at uniquely mapped positions, excluding ambiguous calls - Chastity

The third page shows the same kind of plots as the second page, using this time the Chastity score as a base quality measure. The Chastity value was introduced by Illumina’s GAPipeline and is used to differentiate between clusters which are interfered by other clusters and those which are not. It is defined as the highest base intensity of a sequenced bases devided by the sum of the highest plus the second highest intensity of the specifc base. Therefore the Chastity score can reach values from 0.5 to 1, or in percent from 50 to 100. We find this value a useful support for the quality values. Again you can adjust the Chastity cutoff the same way as for the other two values.

Note, all these plots combine read 1 and read 2. It would be desirable to have them separately, and this will be implemented in one of the next releases of SHORE.

Coverage and GC content at uniquely mapped sites

After running shore consensus there will be a coverage analysis of various GC contents, found in GC by - Cov Statistics.pdf. The plot prints the average coverage (height of an orange bar) by local GC content (x-axis value). The local GC content is defined as the number of Gs and Cs divided by the number of As, Cs, Gs and Ts within a window which size is adjusted in shore preprocess. Ideally the average coverage is equal at all different GC contents, such a plot would have orange bars of the same height. This is not the case and it has been shown that average coverage scales with the GC content (for regions of low GC content) and has a reverse correlation for regions with high GC content. It is unclear where this comes from, and there has been Illumina data which is not influenced by such a bias at all, though correcting for the bias is a step which potentially can help in variant calling and is essential in calculating the expected coverage value. This value is essential for the dectection of CNVs. The red dashed line shows the support (occurrence) of a certain GC content window in the unique portion of the genome. Average coverage should be similar to the sequence depth.

SHORE peak result files

The main result file produced by shore peak is named SUMMARY.txt:

id An arbitrary numerical ID for the peak region
chr Sequence / chromosome ID
pos Left-most position of the peak region on the reference sequence
size Size of the peak region
p_rank1 Replicate 1 rank of the P-value of the peak
fdr_bh_q1 Replicate 1 Benjamini-Hochberg adjusted FDR of the peak
rc_chip1 Replicate 1 number of reads contributing to the peak in the sample
rc_ctrl1 Replicate 1 number of reads in the same region of the control
pbexcess1 Replicate 1 per-base-excess: mean_coverage_sample(peak) - (mean_coverage_control(peak) * normalization_constant)
fc_score1 Replicate 1 fold change score: 4 * atan(mean_coverage_sample(peak) / mean_coverage_control(peak) * normalization_constant) / PI - 1.0
height_excess1 Replicate 1 peak height excess:
frfc_score1 Replicate 1 forward-reverse fold change score: Calculated like fc_score, but compares the sample forward strand and reverse strand coverage
cog_xshift1 Replicate 1 forward-reverse peak shift
overlap_names Identifiers of the genes overlapping with the center of the peak region (only preset when the option -a was specified)
overlap_types Parts of the genes that overlap (exon, 5' UTR etc.) (only present when the option -a was specified)
up_names Identifiers of the closest genes 'to the left' from the center of the peak (only present when the option -a was specified)
up_dist Distance of the closest genes 'to the left' from the center of the peak (only present when the option -a was specified)
up_strands Strands of the closest genes 'to the left' from the center of the peak (only present when the option -a was specified)
down_names Identifiers of the closest genes 'to the right' from the center of the peak (only present when the option -a was specified)
down_dist Distance of the closest genes 'to the right' from the center of the peak (only present when the option -a was specified)
down_strands Strands of the closest genes 'to the right' from the center of the peak (only present when the option -a was specified)