Shore consensus

From SHORE wiki
Revision as of 16:54, 23 September 2011 by Felo80 (Talk | contribs)

Jump to: navigation, search

shore consensus is being replaced by shore qvar (but still a requirement for SHOREmap analysis).

The common output from whole genome re-sequencing projects are lists of all identified polymorphisms (e.g. SNPs, indels, CNVs) as well as reference-like positions. In addition a consensus sequence or contigs can be generated by combining all high quality predictions. shore consensus provides this functionality by sequentially scanning an alignment to gather all read information available at a specific locus (i.e. called bases, base qualities, coverage, repetitiveness, alignment quality). This information is subsequently used to predict differences to the reference sequence.

shore consensus can also be used to identify minor alleles (SNPs or short indels) in pooled samples. In addition shore consensus estimates several characteristics of a run ahead of the actual consensus calling. This includes min and max read length, min and max mismatches, sequencing depth, observed local repetitiveness and GC content bias. Consensus also provides multiple project statistics regarding sequencing error rate, correlation of quality values to observed errors and coverage biases due to local GC content, which can be used to optimize further analysis (e.g. deletions should not be called in low GC content regions if a strong GC bias is observed).

Note: shore consensus can also be applied to sRNA-seq, mRNA-seq and ChIP-seq data. However, SHORE provides more appropriate tools for those purposes (coverage and peak).


Command line options

Usage: shore consensus [OPTIONS]

Mandatory
-n STRING Name (any of species, strain, accession, project or any other ID)
-f STRING Reference genome sequence from the IndexFolder, *.shore file
-o STRING AnalysisFolder, will be created
-i STRING[,...] Shore directories or map.list file(s)
-g INT Core offset - do not trust the first and last -g positions of the alignment. default: max MM's

Short read alignments are prone to misalignments towards the ends of the read. This value specifies how many bases at each end will be regarded as suspicious.

Quality threshold
-q INT (Default: 5) Cutoff for base masking using Sanger calibrated qualities
-c INT Cutoff for base masking using chastity values
Basecalling (scoring matrix approach)
-a STRING Scoring matrix file (recommended, activates new basecalling approach)
-b FLOAT (Default: 0.2) Minimum allele frequency of alternative base call
Basecalling (decision tree approach)
-x INT (Default: 3) Minimum coverage threshold
-m INT (Default: 3) Maximum observed to expected coverage
-e FLOAT (Default: 0.1) Minimum observed to expected coverage
-y FLOAT (Default: 0.8) Minimum concordance of homozygous SNPs (0 to 1)
-d FLOAT (Default: 0.67) Minimum concordance of homozygous Indels
-t FLOAT (Default: 0.25) Minimum frequency for heterozygous pos (0 to 1)
-u FLOAT (Default: 0.02) Minimum frequency for minor allele pos (0 to 1)
-z INT (Default: 10) Quality threshold, max base quality
Optional
-R INT Allow base calling in highly repetitive regions
-s INT Consensus analysis using transcriptome (mRNA-seq) reads. Turns off CNV analysis
-S INT (Default: 0) Ignore position with transcriptome coverage not above threshold
-w INT Use graph based map.list format (only genomemapper)
-v INT Create additional output files containing all intermediate data (required for subsequent SHOREmap analysis)
-r INT Graphical output of statistics using R
-N INT Turn off calculation of long deletions, duplications and any other CNVs

SHORE consensus result files

Overview

Analyses of resequencing projects are performed by shore consensus. Multiple result files are produced for SNPs, indels, CNVs, reference like bases and other types of predictions. All of them can be found in the ConsensusAnalysis directory.

Currently shore consensus provides three types of consensus predictions:

  1. Decision tree based approach as described in Ossowski et al. Genome Research 2009
  2. Empirical scoring matrix approach to be described in a future publications.
  3. SNP prediction on transcriptomic or other quantitative data to be described in a future publication.

The output files of the second approach have quality_ as file name prefix. The third approach produces all files from the first two approaches but uses a slightly different scoring scheme. The following list gives an overview of the column types which can be found in any of the result files that are described in the next section. Note that not all columns apply to each of the predictions:

name Project name or name of sequenced sample
chr Chromosome identifier
pos Position within the chromosome
start First position of a prediction within the chromosome (for predictions longer than just one base pair, e.g. indels)
end Last position of a prediction
length Length
ref base Reference base
cons base Consensus base (i.e. SNP call). Heterozygous SNPs and SNPs from pooled samples are divided into major allele and minor allele
seq Sequence (i.e. inserted or deleted sequence)
quality Quality of a predicted feature (ranging from 0 to 40)
read type part of the reads that were used for prediction: repetitive/nonrepetitive and core/complete
support Number of reads supporting a predicted feature. For heterozygous SNPs and SNPs from pooled samples the sequenced bases are divided into major support and minor support
concordance Ratio of reads supporting a predicted feature to total coverage (excluding quality masked bases). Heterozygous SNPs and SNPs from pooled samples are divided into major concordance and minor concordance
max qual Highest base quality supporting a prediction.
avg hits Average number of alignments of all reads covering this genomic position.
repeat count Number of repetitive positions in the range of the prediction (i.e. long deletion)
ambi count Number of ambiguous positions in the range of the prediction (’N’).
exp cov Expected coverage at a locus defined by repeat analysis and GC content. (Average expected coverage, if the prediction describes a range rather than a single location.)
obs cov Observed coverage at a locus as defined by the read alignment. (Average expected coverage, if the prediction describes a range rather than a single location.)
obs/exp ratio Ratio of observed to expected coverage is used to identify CNVs or highly over-sampled regions (often seen for rDNA clusters) in the genome.
obs/exp ratio max The maximum of obs/exp ratio.
GC Maximal GC content within the given range.
cvp count Copy variable positions (variation between instances of repeats) are an indication of duplications or CNVs.

Prediction file formats

The following listing describes the different prediction files and their columns.

This description only reviews some of the major aspects to help to understand the files. It is important to note that all of them are predictions; especially CNVs and duplications are only indicating abnormalities from the observed mapping data compared to what would be expected under ideal sequencing circumstances. Moreover, CNV and duplication predictions have so far been implemented for single end data only, and should be carefully validated.

More information on the prediction algorithms can be found in Ossowski et al. Genome Research 2008.

Homozygous SNP calls (decision tree approach)

homozygous_snp.txt

All positions with a base call different to the reference. Base calls require a concordance of >= 80% and a support of at least three non-repetitive reads. Due to statistical sampling and sequencing biases, the accuracy of these calls is affected by heterozygous SNPs as well when sequencing heterozygous samples.

<name> <chr> <pos> <ref base> <cons base> <read type> <support> <concordance> <max qual> <avg hits>

Homozygous small indels calls (decision tree approach)

deletions.txt, insertions.txt

Deletions (length depends on the number of gaps allowed in the mapping process) called from the alignments. Parameters are identical to those from the homozygous SNP predictions.

<name> <chr> <start> <end> <length> <seq> <read type> <support> <concordance> <avg hits>

Heterozygous SNP and small indel calls (decision tree approach)

heterozygous_position.txt

All positions with at least 25% of the bases different to the majority call. This file includes minor alleles of indels.

<name> <chr> <pos> <ref base> <major allele> <major support> <major concordance> <minor allele> <minor support> <minor concordance> <avg hits>

SNP and small indel calls in pooled samples (decision tree approach)

minor_allele_position.txt

All positions with either a homozygous SNP/indel or a variant with a minor allele frequency of >= 2% are stored in this file. Due to the sequencing error of approximately 1% this can result in a high number of false positives.

The minimum minor allele frequency should be adjusted according to the number of individuals in the sample. As a rule of thumb it should be greater than (100 / num samples / 2), i.e >= 10% for 5 pooled samples. Note that in case of homozygous variants the minor allele is ’X’, meaning no minor allele found. Positions with homozygous reference calls are not stored at all to save space.

<name> <chr> <pos> <ref base> <major allele> <major support> <major concordance> <minor allele> <minor support> <minor concordance> <avg hits>

Homozygous reference calls (decision tree approach)

reference.txt

Reference like positions called from the alignments. Parameters are identical to those from the homozygous SNP prediction.

<name> <chr> <pos> <ref base> <cons base> <read type> <support> <concordance> <max qual> <avg hits>

Copy Variable Positions, CVPs

copy_variable_position.txt

A duplication in the sequenced sample will map to the same locus in the reference sequence as the origin where it was generated from. If there is a (slight) difference between the original position and the duplication, there will be positions which look like het calls. These positions are called CVPs. Positions (so-called CVPs) with two different bases due to mislocated alignments of repetitive sequences are an indication of duplications. CVPs are classified according to their expected repetitiveness. If a position is expected to be unique but contains different bases this can indicate a duplication of a former unique region.

<name> <chr> <pos> <ref base> <major allele> <major support> <major concordance> <minor allele> <minor support> <minor concordance> <avg hits>

CNV

CNV.txt

Copy number variation is predicted based on two major criterias. Requires strong skew between observed and expected coverage of at least 40bp of length and the existence of at least one CVP within this interval.

<name> <chr> <start> <end> <length> <cvp count> <obs cov> <exp cov>

Duplication

duplication.txt

Duplications are predicted similar to CNVs described above, however the reference sequence has to be mostly unique within the duplicated interval and the length has to be greater than 250bp. Thus duplication predictions are more reliable than CNV predictions.

<name> <chr> <start> <end> <length> <cvp count> <obs cov> <exp cov>

Unsequenced regions

unsequenced.txt, supplementary_data/unseq_cn.txt, supplementary_data/unseq_core.txt

Unsequenced regions are called, if a region of one or more base pairs is continuously uncovered by reads.

However this does not necessarily mean that this is a deletion. It can indicate long deletions, insertions, (highly) polymorphic regions or a bias in sequencing coverage. To account for biases, namely the GC content influencing coverage, we report the average and maximum GC content and the expected coverage within the unsequenced interval. In addition the number of ’N’ in the interval is reported to indicate if an unsequenced region is solely due to unreliable reference sequence. Other predictions (SNPs, small indels) in the vicinity of unsequenced regions are unreliable due to an increased probability of alignment issues. In addition we provide two files showing regions with absence of non-repetitive reads and absence of ’core reads’.

<name> <chr> <start> <end> <length> <ambi count> <GC> <GC max> <repeat count> <exp cov>

Oversampled regions

supplementary_data/oversampled.txt

Some highly repetitive genomic regions like rDNA or centromeric repeats are not annotated correcly in the reference sequence. Often the repeat is only represented by one copy. This leads to unexpectedly high coverage in these regions, indicating that several copies of the repeat mapped to the same location in the reference sequence. The total number of repeat instances can be estimated using the observed vs. expected coverage ratio. Other predictions (SNPs, small indels) in the vicinity of oversampled regions are unreliable due to an increased probability of erroneous alignments.

<name> <chr> <start> <end> <length> <ambi count> <GC> <GC max> <repeat count> <obs cov> <exp cov> <obs/exp cov ratio avg> <obs/exp cov ratio max>

SNPs and short indels predictions (scoring matrix approach)

quality_variant.txt

We have developed an empirical scoring matrix with 12 features describing e.g. the quality of reads, the quality of alignments and the likeliness of wrong calls due to other features in the vicinity of a position. These features are used to calculate a quality value for each position/call of the genome ranging from 0 to 40. The scoring matrices can be adjusted by the user, however this is not recommended. There are predefined matrices for different prediction types (e.g. homozygous, heterozygous or pooled samples).

<name> <chr> <position> <ref base> <cons base> <quality> <support> <concordance> <avg hits>


Reference positions predictions (scoring matrix approach)

quality_reference.txt

Same as above, just for reference like positions instead of SNPs and indels.

<name> <chr> <position> <ref base> <cons base> <quality> <support> <concordance> <avg hits>

Statistics

Running shore consensus with option -r creates a ConsensusStatistics directory. It contains the files run_statistics.pdf and GC_by_Cov_Statistics.pdf. These figures are useful to check for inconsistencies in the analysis.

Mismatches per Read

describes the fraction of reads which could be mapped without mismatches, with 1 mismatch, 2 mismatches and so on. Indels are counted as mismatches, an indel of length 2 will be counted as 2 mismatches. The black numbers above the orange spots are the absolute read counts, whereas the y-axis describe the percentage of reads. Mismatches can be created by sequence divergence and sequencing errors.

Number of hits in genome per read

This plot indicates the repetiveness of the reference sequence based on the number of alignments. It plots how often a certain number of alignments per read was observed. In this example the repeat has approx. 40 instances in the reference sequence.

Read Length Distribution

If read trimming was switched on in shore import, there will be reads of different lengths. In Figure 1 read trimming was performed on reads of length 42 down to a minimal length of 36.

Errors in read positions

plots the percentage of sequencing errors per read position. In the example of Figure 1 the number of read errors decays to the read end. This might be surprising, but is a result of the read trimming. Read trimming clips the last bases if these are of bad quality. Therefore if the last bases are not trimmed they are of at least moderate quality and less error prone. The red line indicates the support (how often a certain read position was counted), also a result of read trimming. Errors are assessed in uniquely mapped regions, where consensus calling is possible. Read bases which contradict with the consensus are reported as errors.

Errors in mapped reads at uniquely mapped positions, excluding ambiguous calls - Quality

plots the Sanger scaled Illumina quality value (of a called base) versus the observed percentage of errors. You could refer to this as the probability for an error at a given quality value. Absolute numbers of observations are plotted as dashed red line, while they are written numerical above the orange bars. Note that it depends on the choice of quality (adjusted with shore illumina2flat) whether this plot shows qraw (prb) or qcal. If the choice was qcal you will not see any values below 0. It is somewhat surprising that in the example of the first plot in Figure 2 base calls with a quality value below 0 seem to perform better than those with a quality value of 0. Though we have observed this pattern quite often.

The second plot on the second page shall help to adjust the quality cutoff for masking. Masking is a very effective way to reduce noise which affects variant/consensus calling. Though we have found that there is nothing like the optimal cutoff value for all runs, since the variance between Illumina data can be large. Therefore the cutoff for base masking by quality value might be re-adjusted. The plot can be read like this: check the green and orange bar at (let’s say) quality value (x-axis) equals 4. There you can see that the green bar is just above the second gray dashed line which indicates the 10% mark. This means setting the quality cutoff to 4 would have masked around 10% of correct bases. On the other hand, nearly 80% of the wrong bases would have been masked as well. This is indicated by the orange bar at quality value of 4. A much stricter quality cutoff would have been 32, which would have removed 80% of the correct data while it also would have removed 99% of the error prone bases. Obviously it is desirable to find a quality value (x-axis value) where the green bar is preferably small and the orange bar preferably large. It is a trade-off between strict masking (remove most of the errors) and weak masking (keeping the correct values). This decision depends on the application. Note, the erroneous and correct bases have been assessed on unique, well covered and reference like stretches of the sample.

Errors in mapped reads at uniquely mapped positions, excluding ambiguous calls - Chastity

The third page shows the same kind of plots as the second page, using this time the Chastity score as a base quality measure. The Chastity value was introduced by Illumina’s GAPipeline and is used to differentiate between clusters which are interfered by other clusters and those which are not. It is defined as the highest base intensity of a sequenced bases devided by the sum of the highest plus the second highest intensity of the specifc base. Therefore the Chastity score can reach values from 0.5 to 1, or in percent from 50 to 100. We find this value a useful support for the quality values. Again you can adjust the Chastity cutoff the same way as for the other two values.

Note, all these plots combine read 1 and read 2. It would be desirable to have them separately, and this will be implemented in one of the next releases of SHORE.

Coverage and GC content at uniquely mapped sites

After running shore consensus there will be a coverage analysis of various GC contents, found in GC by - Cov Statistics.pdf. The plot prints the average coverage (height of an orange bar) by local GC content (x-axis value). The local GC content is defined as the number of Gs and Cs divided by the number of As, Cs, Gs and Ts within a window which size is adjusted in shore preprocess. Ideally the average coverage is equal at all different GC contents, such a plot would have orange bars of the same height. This is not the case and it has been shown that average coverage scales with the GC content (for regions of low GC content) and has a reverse correlation for regions with high GC content. It is unclear where this comes from, and there has been Illumina data which is not influenced by such a bias at all, though correcting for the bias is a step which potentially can help in variant calling and is essential in calculating the expected coverage value. This value is essential for the dectection of CNVs. The red dashed line shows the support (occurrence) of a certain GC content window in the unique portion of the genome. Average coverage should be similar to the sequence depth.