Difference between revisions of "SHORE Overview"

From SHORE wiki
Jump to: navigation, search
(Consensus result file formats)
Line 179: Line 179:
 
|}
 
|}
  
 +
 +
=== Prediction file formats ===
  
 
The following listing describes the different prediction files and their columns. This description only reviews some of the major aspects to help to understand the files. It is important to note that all of them are predictions. Especially CNVs and duplica- tions are only indicating abnormalities from the observed mapping data compared to what would be expected under ideal sequencing circumstances. Moreover, CNV and duplication predictions have so far been implemented for single end data only, and should be carefully validated.
 
The following listing describes the different prediction files and their columns. This description only reviews some of the major aspects to help to understand the files. It is important to note that all of them are predictions. Especially CNVs and duplica- tions are only indicating abnormalities from the observed mapping data compared to what would be expected under ideal sequencing circumstances. Moreover, CNV and duplication predictions have so far been implemented for single end data only, and should be carefully validated.
 
More information on the prediction algorithms can be found in Ossowski et al. Genome Research 2008.
 
More information on the prediction algorithms can be found in Ossowski et al. Genome Research 2008.
 +
 +
 +
==== Homozygous SNP calls (decision tree approach) ====
 +
 +
(homozygous_snp.txt) All positions with a base call different to the reference. Base calls require a concor- dance of >= 80% and a support of at least three non-repetitive reads. Due to statistical sampling and sequencing biases, the accuracy of these call is affected by heterozygous SNPs as well when sequencing heterozygous samples.
 +
 +
<code>
 +
<name> <chr> <pos> <ref base> <cons base> <read type> <support> <concordance> <max qual> <avg hits>
 +
</code>
 +
 +
 +
==== Homozygous small indels calls (decision tree approach) ====
 +
(deletions.txt , insertions.txt) Deletions (length depends on the number of gaps allowed in the mapping process) called from the alignments. Parameters are identical to those from t he homozygous SNP predictions.
 +
 +
<code>
 +
<name> <chr> <start> <end> <length> <seq> <read type> <support> <concordance> <avg hits>
 +
</code>
 +
 +
 +
==== Heterozygous SNP and small indel calls (decision tree approach) ====
 +
(heterozygous_call.txt) All positions with at least 25% of the bases different to the majority call. This file includes minor alleles of indels.
 +
 +
<code>
 +
<name> <chr> <pos> <ref base> <major allele> <major support> <major concordance> <minor allele> <minor support> <minor concordance> <unique prb> <avg hits>
 +
</code>
 +
 +
 +
==== SNP and small indel calls in pooled samples (decision tree approach) ====
 +
(minor_allele_call.txt) All positions with either a homozygous SNP/Indel or a variant with a minor allele fre- quency of >= 2% are stored in this file. Due to the sequencing error of approximately 1% this can result in a high number of false positives. The minimum minor allele
 +
frequency should be adjusted according to the number of individuals in the sample. As a rule of thumb it should be greater than (100 / num samples / 2), i.e >= 10% for 5 pooled samples. Note that in case of homozygous variants the minor allele is ’X’ meaning no minor allele found. Positions with homozygous reference calls are not stored at all to save space.
 +
 +
<code>
 +
<name> <chr> <pos> <ref base> <major allele> <major support> <major concordance> <minor allele> <minor support> <minor concordance> <avg hits>
 +
</code>
 +
 +
 +
==== Homozygous reference calls (decision tree approach) ====
 +
(reference.txt) Reference like positions called from the alignments. Parameters are identical to those from the homozygous SNP prediction.
 +
 +
<code>
 +
<name> <chr> <pos> <ref base> <cons base> <support type> <support> <concordance> <max qual> <unique prb> <avg hits>
 +
</code>
 +
 +
 +
==== Copy Variable Positions, CVPs ====
 +
(copy_variable_position.txt) A duplication in the sequenced sample will map to the same locus in the reference sequence as the origin where it was generated from. If there is a (slight) difference between the original position and the duplication, there will be positions which look like het calls. These positions are called CVPs. Positions (so-called CVPs) with two different bases due to mislocated alignments of repetitive sequences are an indication of duplications. CVPs are classified according to their expected repetitiveness. If a position is expected to be unique but contains different bases this can indicate a du- plication of a former unique region.
 +
 +
<code>
 +
<name> <chr> <start> <end> <length> <cvp count> <obs cov> <exp cov>
 +
</code>
 +
 +
 +
==== CNV ====
 +
(CNV.txt) Copy number variation is predicted based on two major criterias. Requires strong skew between observed and expected coverage of at least 40bp of length and the existence of at least one CVP within this interval.
 +
 +
<code>
 +
<name> <chr> <start> <end> <length> <cvp count> <obs cov> <exp cov>
 +
</code>
 +
 +
 +
==== Duplication ====
 +
(duplication.txt) Duplications are predicted similar to CNVs described above, however the reference sequence has to be mostly unique within the duplicated interval and the length has to be greater than 250bp. Thus duplication predictions are more reliable than CNV predictions.
 +
 +
<code>
 +
<name> <chr> <start> <end> <length> <cvp count> <obs cov> <exp cov>
 +
</code>
 +
 +
 +
==== Unsequenced regions ====
 +
(unsequenced.txt, supplementary_data/unseq_cn.txt, supplementary_data/uns (supplementary\_data/unseq\_core.txt) Unsequenced regions are called, if a region of one or more bp is continuously uncov- ered by reads. However this does not necessarily mean that this is a deletion. It can
 +
indicate long deletions, insertions, (highly) polymorphic regions or a bias in the se- quencing coverage. To account for biases, namely the GC content influencing coverage, we report the average and maximum GC content and the expected coverage within the unsequenced interval. In addition the number of ’N’ in the interval is reported to indicate if an unsequenced region is solely due to unreliable reference sequence. Other predictions (SNPs, small indels) in the vicinity of unsequenced regions are unreliable due to an increased probability of alignment issues. In addition we provide two files showing regions with absence of non-repetitive reads and absence of ’core reads’.
 +
 +
<code>
 +
<name> <chr> <start> <end> <length> <ambi count> <GC> <GC max> <repeat count> <exp cov>
 +
</code>
 +
 +
 +
==== Oversampled regions ====
 +
 +
(supplementary_data/oversampled.txt) Some highly repetitive genomic regions like rDNA or centromeric repeats are not an- notated correcly in the reference sequence. Often the repeat is only represented by one copy. This leads to unexpectedly high covergae in this regions, indicating that several copies of the repeat mapped to the same reference sequence. The total amount of re- peat instances can be estimated using the observed vs. expected coverage ratio. Other predictions (SNPs, small indels) in the vicinity of oversampled regions are unreliable due to an increased probability of erroneous alignments.
 +
 +
<code>
 +
<name> <chr> <start> <end> <length> <ambi count> <GC> <GC max> <repeat count> <obs cov> <exp cov> <obs/exp cov ratio avg> <obs/exp cov ratio max>
 +
</code>
 +
 +
 +
==== SNPs and short indels predictions (scoring matrix approach) ====
 +
(quality_variant.txt) We have developed an empirical scoring matrix with 12 features describing e.g. the quality of reads, the quality of alignments and the likeliness of wrong calls due to other features in the vicinity of a position. These features are used to calculate a quality value for each position/call of the genome ranging from 0 to 40. The scoring matrices can be adjusted by the user, however this is not recommended. There are predefined matrices for different prediction types (e.g. homozygous, heterozygous or pooled samples).
 +
 +
<code>
 +
<name> <chr> <position> <ref base> <cons base> <quality> <support> <concordance> <avg hits>
 +
</code>
 +
 +
 +
==== Reference positions predictions (scoring matrix approach) ====
 +
(quality_reference.txt) Same as above, just for reference like positions instead of SNPs and indels.
 +
 +
<code>
 +
<name> <chr> <position> <ref base> <cons base> <quality> <support> <concordance> <avg hits>
 +
</code>

Revision as of 15:27, 8 April 2011

What is SHORE and what is SHORE for?

SHORE is a data analysis and management application for short DNA/RNA reads produced by the Illumina Genome Analyzer or Life Technology SOLiD platforms. It is developed at the Max Planck Institute for Developmental Biology, Tübingen, Germany. SHORE is designed to support different sequencing applications including genomic re-sequencing, ChIP-Seq, mRNA-Seq, sRNA-Seq and BS-seq. The SHORE pipeline comprises all necessary steps for sequence analysis including quality filtering of raw reads, adapter clipping for sRNA-seq, mapping of reads against a reference genome, repeat analysis, correction for read pair information, quantitative analysis of ChIP or transcriptome data. Several file format converters ensure the compatibility of SHORE with other widely used file formats like fastq, qual, sam, bed and gff. SHORE was developed for ap- plications in Arabidopsis thaliana but has been successfully used with other genomes, including human, D. melanogaster, C. elegans, maize and several bacterial genomes.


Overview of SHORE’s data structure

SHORE has a predefined structure of folders and files. One folder, referred to as In- dexFolder, stores all information about the reference sequence including the indices (created and used by mapping tools), GC content and low complexity sequence anal- ysis. A second folder, referred to as ProjectFolder, contains all information about a certain sequencing project. The IndexFolder is kept separate from the ProjectFolder, because one IndexFolder can be used for multiple ProjectFolders, e.g., different re-sequencing projects (ProjectFold- ers) can refer to the same reference sequence (IndexFolder). The IndexFolder is populated using shore preprocess. It will store a fasta file named <original fasta file>.shore. This file is used as prefix to find all needed index files for SHORE. It is not necessary to keep the original fasta file. The ProjectFolder has a pre-defined folder structure. All SHORE programs require this structure, though this should not be too confusing, as all folders are created au- tomatically by SHORE. All files in the ProjectFolder are ASCII formated files and can be viewed, moved and manipulated. For example it is possible to use SHORE just for read mapping and then convert the SHORE alignments into a different file format (e.g. SAM for SNP detection with SAMtools).

The ProjectFolder and the read data

Folder structure of the ProjectFolder for read storage:

ProjectFolder/FlowcellFolder/LaneFolder/ReadFolder/LengthFolder
ProjectFolder/ This is the home folder for a sequencing project.
FlowcellFolder/ A project starts with the sequencing of one or more flowcells. Each flowcell is stored in a separate folder in the ProjectFolder, the so-called FlowcellFolders.
LaneFolder/ Each flowcell consists of one to eight lanes, each of them rep- resented as a single folder, named 1-8. An additional folder called bad quality will be created containing all reads which did not pass the quality filter.
ReadFolder/ The read folders separate the read pairs, in case of paired- end or mate-pair data. In folder ”1” are the reads from the first sequencing run and in ”2” the reads from the second, respectively. In ”single” are reads from the first and from the second run which lost their read partner in the quality filtering process. In a single-end run, all reads passing the quality filter will be stored in ”single”.
LengthFolder/ SHORE can trim reads before mapping according to their base quality at the read ends and therefore reads can be of different lengths. Each length has its own sub-folder which name is equal to the length, prefixed by ”length ”.

For example, a folder could be named like:

phiX/EAS67_0018_FC12398AAXX/5/1/length_80


The ProjectFolder and the alignment data

After the reads are mapped, the resulting alignments (mappings) can be merged into the so-called AlignmentFolder. This folder stores all information and results of the alignment analysis:

ProjectFolder/AlignmentFolder/AnalysisFolder/ResultFolder
ProjectFolder/ This is the home folder for a sequencing project.
AlignmentFolder/ The name of the folder is user specified and will host the merged results of the read mapping which is used as input for shore consensus, shore coverage and shore peakup. Merging the alignments will introduce redundancy. The align- ments are stored twice, it is recommended to delete the merged file after is has been analyzed.
AnalysisFolder/ Will be created by shore consensus, shore coverage and shore peak. Several AnalysisFolders can be created to store data from different analysis runs.
ResultFolder/ AnalysisFolders have two different subfolders, the ResultFold- ers. In case of ’shore consensus’ there are two ResultFolders called ’ConsensusAnalysis’ which stores the alignment analy- sis results (e.g. SNP calls, peaks ...) and ’ConsensusStatistics’ whichstores general project statistics, like read quality.


It is possible to store the AlignmentFolder or the AnalysisFolders in a different location than the ProjectFolder.


SHORE’s file formats

Any output generated by SHORE will usually be written to various text files that contain a number of tab-delimited columns. Typing shore fmt will display a quick reference on SHORE’s file formats.

Read file format

Read files can be found in the LengthFolders in the ProjectFolder. They are called ’reads 0.fl’. They will be created by shore import. The tab delimited entries are:


<id>

ID is build up of the run id (4 characters), lane (1 character), tile (3 characters), x value within the im- age (5 characters), y value (5 characters). In con- trast to the GAPipeline the numbers are concate- nated and the run id is added at the beginning. The run id makes reads unique between flowcells within one project. Run ids must not start with zero.

<sequence>

DNA sequence

<pe>

Flag, ’1’ or ’2’, first read or second read of a read pair. ’0’ for a single read.

<Sanger quality values>

Sanger calibrated quality values described in a later section

[<Chastity values>]

(Optional column) Illumina chastity values defined as Intensity(max)/ (Int(max) + Int(second)). Note: Earlier versions of SHORE used a read file format featuring three different quality types per entry. This file format is still supported for reading, but is no longer written.


Alignment file format

SHORE alignment files are typically called map.list, map.list.1 or map.list.2. They can be found in the LengthFolders or, when applying paired-ends, in the Read- Folders after correcting for paired-end information (shore correct4pe). The tab delimited entries are:

<chr id>

Each chromosome has an internal id, they are sim- ply numbered after their occurrence within the ref- erence sequence file starting from 1. Translation to the native chromosome name can be found in the *.shore.trans file in the IndexFolder.

<pos>

Left-most position of the mapping relative to the ref- erence sequence.

<alignment>

Matches are reported as a single character, mis- matches are represented by two characters sur- rounded by brackets. The first character repre- sents the reference base, the second character the sequenced base. Deletions are represented as ’-’. See ’Read file format’ section.

<read id>

See ’Read file format’ section.

<strand>

’D’ for forward and ’P’ for reverse hits (direct and palindromic, respectively).

<mismatches>

The number of mismatches in the alignment.

<hits>

The total number of genomic positions the read is aligned to.

<read length>

Length of the read.

<offset>

Number of bases at the beginning of a read that have not been aligned.

<pe flag>

1 or2 for paired-reads, 0 for singletons. Other values describe the state of paired reads after cor- recting for pe information.

<Sanger quality values>

Sanger calibrated quality values described in a later section.

[<Chastity values>]

(Optional column) Illumina chastity values defined as the highest intensity divided by the sum of the highest and the second highest intensity of a single base.

Note: Earlier versions of SHORE used an alignment file format featuring three different quality types per entry. This file format is still supported for reading, but is no longer written.


Consensus result file formats

Analyses of resequencing projects are performed by shore consensus. Multiple re- sult files are produced for SNPs, indels, CNVs, reference like bases and other types of predictions. All of them can be found in the ConsensusAnalysis folder. Currently ’SHORE consensus’ provides three types of consensus predictions: 1. Decision tree based approach as described in Ossowski et al. Genome Research 2009 and 2. Empiri- cal scoring matrix approach to be described in a future publications. 3. SNP prediction on transcriptomic or other quantitative data to be described in a future publication. The output files of the second approach have ’quality ” as name suffix. The third approach produces all files from the first two approaches but uses a slightly different scoring scheme. The following list gives an overview of all column types which can be found in the result file. Note that not all columns apply to each of the predictions:

<name> Project name or name of sequenced sample
<chr> Chromosome identifier
<pos> Position within the chromosome
<start> First position of a prediction within the chromosome (for predictions longer than just one base pair, e.g. indels)
<end> Last position of a prediction
<length> Length
<ref base> Reference base
<cons base> Consensus base (i.e. SNP call). Heterozygous SNPs and SNPs from pooled samples are divided into ’ma- jor allele’ and ’minor allele’
<seq> Sequence (i.e. inserted or deleted sequence)
<quality> Quality of a predicted feature (ranging from 0 to 40)
<read type> part of the reads that were used for prediction: repet- itive/nonrepetitive and core/complete
<support> Number of reads supporting a predicted feature. For heterozygous SNPs and SNPs from pooled samples the sequenced bases are divided into ’major support’ and ’minor support’
<concordance> Ratio of reads supporting a predicted feature to to- tal coverage (excluding quality masked bases). Het- erozygous SNPs and SNPs from pooled samples are divided into ’major concordance’ and ’minor concor- dance’
<max qual> Highest base quality supporting a prediction.
<avg hits> Average number of alignments of all reads covering this genomic position. (see section ’Repeat analysis based on short read alignment’ for details)
<repeat count> Number of repetitive positions in the range of the prediction (i.e. long deletion)
Number of ambiguous positions in the range of the prediction (’N’).
<exp cov> Expected coverage at a locus defined by repeat analysis and GC content. (Average expected coverage, if the prediction describes a range rather than a single location.)
<obs cov> Observed coverage at a locus as defined by the read alignment. (Average expected coverage, if the pre- diction describes a range rather than a single loca- tion.)
<obs/exp ratio> Ratio of observed to expected coverage is used to identify CNVs or highly over-sampled regions (often seen for rDNA clusters) in the genome.
<obs/exp ratio max> The maximum of <obs/exp ratio>.
<GC> Maximal GC content within the given range.
<cvp count> Copy variable positions (variation between instances of repeats) are an indication of duplications or CNVs.


Prediction file formats

The following listing describes the different prediction files and their columns. This description only reviews some of the major aspects to help to understand the files. It is important to note that all of them are predictions. Especially CNVs and duplica- tions are only indicating abnormalities from the observed mapping data compared to what would be expected under ideal sequencing circumstances. Moreover, CNV and duplication predictions have so far been implemented for single end data only, and should be carefully validated. More information on the prediction algorithms can be found in Ossowski et al. Genome Research 2008.


Homozygous SNP calls (decision tree approach)

(homozygous_snp.txt) All positions with a base call different to the reference. Base calls require a concor- dance of >= 80% and a support of at least three non-repetitive reads. Due to statistical sampling and sequencing biases, the accuracy of these call is affected by heterozygous SNPs as well when sequencing heterozygous samples.

<name> <chr> <pos> <ref base> <cons base> <read type> <support> <concordance> <max qual> <avg hits>


Homozygous small indels calls (decision tree approach)

(deletions.txt , insertions.txt) Deletions (length depends on the number of gaps allowed in the mapping process) called from the alignments. Parameters are identical to those from t he homozygous SNP predictions.

<name> <chr> <start> <end> <length> <seq> <read type> <support> <concordance> <avg hits>


Heterozygous SNP and small indel calls (decision tree approach)

(heterozygous_call.txt) All positions with at least 25% of the bases different to the majority call. This file includes minor alleles of indels.

<name> <chr> <pos> <ref base> <major allele> <major support> <major concordance> <minor allele> <minor support> <minor concordance> <unique prb> <avg hits>


SNP and small indel calls in pooled samples (decision tree approach)

(minor_allele_call.txt) All positions with either a homozygous SNP/Indel or a variant with a minor allele fre- quency of >= 2% are stored in this file. Due to the sequencing error of approximately 1% this can result in a high number of false positives. The minimum minor allele frequency should be adjusted according to the number of individuals in the sample. As a rule of thumb it should be greater than (100 / num samples / 2), i.e >= 10% for 5 pooled samples. Note that in case of homozygous variants the minor allele is ’X’ meaning no minor allele found. Positions with homozygous reference calls are not stored at all to save space.

<name> <chr> <pos> <ref base> <major allele> <major support> <major concordance> <minor allele> <minor support> <minor concordance> <avg hits>


Homozygous reference calls (decision tree approach)

(reference.txt) Reference like positions called from the alignments. Parameters are identical to those from the homozygous SNP prediction.

<name> <chr> <pos> <ref base> <cons base> <support type> <support> <concordance> <max qual> <unique prb> <avg hits>


Copy Variable Positions, CVPs

(copy_variable_position.txt) A duplication in the sequenced sample will map to the same locus in the reference sequence as the origin where it was generated from. If there is a (slight) difference between the original position and the duplication, there will be positions which look like het calls. These positions are called CVPs. Positions (so-called CVPs) with two different bases due to mislocated alignments of repetitive sequences are an indication of duplications. CVPs are classified according to their expected repetitiveness. If a position is expected to be unique but contains different bases this can indicate a du- plication of a former unique region.

<name> <chr> <start> <end> <length> <cvp count> <obs cov> <exp cov>


CNV

(CNV.txt) Copy number variation is predicted based on two major criterias. Requires strong skew between observed and expected coverage of at least 40bp of length and the existence of at least one CVP within this interval.

<name> <chr> <start> <end> <length> <cvp count> <obs cov> <exp cov>


Duplication

(duplication.txt) Duplications are predicted similar to CNVs described above, however the reference sequence has to be mostly unique within the duplicated interval and the length has to be greater than 250bp. Thus duplication predictions are more reliable than CNV predictions.

<name> <chr> <start> <end> <length> <cvp count> <obs cov> <exp cov>


Unsequenced regions

(unsequenced.txt, supplementary_data/unseq_cn.txt, supplementary_data/uns (supplementary\_data/unseq\_core.txt) Unsequenced regions are called, if a region of one or more bp is continuously uncov- ered by reads. However this does not necessarily mean that this is a deletion. It can indicate long deletions, insertions, (highly) polymorphic regions or a bias in the se- quencing coverage. To account for biases, namely the GC content influencing coverage, we report the average and maximum GC content and the expected coverage within the unsequenced interval. In addition the number of ’N’ in the interval is reported to indicate if an unsequenced region is solely due to unreliable reference sequence. Other predictions (SNPs, small indels) in the vicinity of unsequenced regions are unreliable due to an increased probability of alignment issues. In addition we provide two files showing regions with absence of non-repetitive reads and absence of ’core reads’.

<name> <chr> <start> <end> <length> <ambi count> <GC> <GC max> <repeat count> <exp cov>


Oversampled regions

(supplementary_data/oversampled.txt) Some highly repetitive genomic regions like rDNA or centromeric repeats are not an- notated correcly in the reference sequence. Often the repeat is only represented by one copy. This leads to unexpectedly high covergae in this regions, indicating that several copies of the repeat mapped to the same reference sequence. The total amount of re- peat instances can be estimated using the observed vs. expected coverage ratio. Other predictions (SNPs, small indels) in the vicinity of oversampled regions are unreliable due to an increased probability of erroneous alignments.

<name> <chr> <start> <end> <length> <ambi count> <GC> <GC max> <repeat count> <obs cov> <exp cov> <obs/exp cov ratio avg> <obs/exp cov ratio max>


SNPs and short indels predictions (scoring matrix approach)

(quality_variant.txt) We have developed an empirical scoring matrix with 12 features describing e.g. the quality of reads, the quality of alignments and the likeliness of wrong calls due to other features in the vicinity of a position. These features are used to calculate a quality value for each position/call of the genome ranging from 0 to 40. The scoring matrices can be adjusted by the user, however this is not recommended. There are predefined matrices for different prediction types (e.g. homozygous, heterozygous or pooled samples).

<name> <chr> <position> <ref base> <cons base> <quality> <support> <concordance> <avg hits>


Reference positions predictions (scoring matrix approach)

(quality_reference.txt) Same as above, just for reference like positions instead of SNPs and indels.

<name> <chr> <position> <ref base> <cons base> <quality> <support> <concordance> <avg hits>