SHORE Overview

From SHORE wiki
Revision as of 15:18, 8 April 2011 by Joffreyfitz (Talk | contribs)

Jump to: navigation, search

What is SHORE and what is SHORE for?

SHORE is a data analysis and management application for short DNA/RNA reads produced by the Illumina Genome Analyzer or Life Technology SOLiD platforms. It is developed at the Max Planck Institute for Developmental Biology, Tübingen, Germany. SHORE is designed to support different sequencing applications including genomic re-sequencing, ChIP-Seq, mRNA-Seq, sRNA-Seq and BS-seq. The SHORE pipeline comprises all necessary steps for sequence analysis including quality filtering of raw reads, adapter clipping for sRNA-seq, mapping of reads against a reference genome, repeat analysis, correction for read pair information, quantitative analysis of ChIP or transcriptome data. Several file format converters ensure the compatibility of SHORE with other widely used file formats like fastq, qual, sam, bed and gff. SHORE was developed for ap- plications in Arabidopsis thaliana but has been successfully used with other genomes, including human, D. melanogaster, C. elegans, maize and several bacterial genomes.


Overview of SHORE’s data structure

SHORE has a predefined structure of folders and files. One folder, referred to as In- dexFolder, stores all information about the reference sequence including the indices (created and used by mapping tools), GC content and low complexity sequence anal- ysis. A second folder, referred to as ProjectFolder, contains all information about a certain sequencing project. The IndexFolder is kept separate from the ProjectFolder, because one IndexFolder can be used for multiple ProjectFolders, e.g., different re-sequencing projects (ProjectFold- ers) can refer to the same reference sequence (IndexFolder). The IndexFolder is populated using shore preprocess. It will store a fasta file named <original fasta file>.shore. This file is used as prefix to find all needed index files for SHORE. It is not necessary to keep the original fasta file. The ProjectFolder has a pre-defined folder structure. All SHORE programs require this structure, though this should not be too confusing, as all folders are created au- tomatically by SHORE. All files in the ProjectFolder are ASCII formated files and can be viewed, moved and manipulated. For example it is possible to use SHORE just for read mapping and then convert the SHORE alignments into a different file format (e.g. SAM for SNP detection with SAMtools).

The ProjectFolder and the read data

Folder structure of the ProjectFolder for read storage:

ProjectFolder/FlowcellFolder/LaneFolder/ReadFolder/LengthFolder
ProjectFolder/ This is the home folder for a sequencing project.
FlowcellFolder/ A project starts with the sequencing of one or more flowcells. Each flowcell is stored in a separate folder in the ProjectFolder, the so-called FlowcellFolders.
LaneFolder/ Each flowcell consists of one to eight lanes, each of them rep- resented as a single folder, named 1-8. An additional folder called bad quality will be created containing all reads which did not pass the quality filter.
ReadFolder/ The read folders separate the read pairs, in case of paired- end or mate-pair data. In folder ”1” are the reads from the first sequencing run and in ”2” the reads from the second, respectively. In ”single” are reads from the first and from the second run which lost their read partner in the quality filtering process. In a single-end run, all reads passing the quality filter will be stored in ”single”.
LengthFolder/ SHORE can trim reads before mapping according to their base quality at the read ends and therefore reads can be of different lengths. Each length has its own sub-folder which name is equal to the length, prefixed by ”length ”.

For example, a folder could be named like:

phiX/EAS67_0018_FC12398AAXX/5/1/length_80


The ProjectFolder and the alignment data

After the reads are mapped, the resulting alignments (mappings) can be merged into the so-called AlignmentFolder. This folder stores all information and results of the alignment analysis:

ProjectFolder/AlignmentFolder/AnalysisFolder/ResultFolder
ProjectFolder/ This is the home folder for a sequencing project.
AlignmentFolder/ The name of the folder is user specified and will host the merged results of the read mapping which is used as input for shore consensus, shore coverage and shore peakup. Merging the alignments will introduce redundancy. The align- ments are stored twice, it is recommended to delete the merged file after is has been analyzed.
AnalysisFolder/ Will be created by shore consensus, shore coverage and shore peak. Several AnalysisFolders can be created to store data from different analysis runs.
ResultFolder/ AnalysisFolders have two different subfolders, the ResultFold- ers. In case of ’shore consensus’ there are two ResultFolders called ’ConsensusAnalysis’ which stores the alignment analy- sis results (e.g. SNP calls, peaks ...) and ’ConsensusStatistics’ whichstores general project statistics, like read quality.


It is possible to store the AlignmentFolder or the AnalysisFolders in a different location than the ProjectFolder.


SHORE’s file formats

Any output generated by SHORE will usually be written to various text files that contain a number of tab-delimited columns. Typing shore fmt will display a quick reference on SHORE’s file formats.

Read file format

Read files can be found in the LengthFolders in the ProjectFolder. They are called ’reads 0.fl’. They will be created by shore import. The tab delimited entries are:


<id>

ID is build up of the run id (4 characters), lane (1 character), tile (3 characters), x value within the im- age (5 characters), y value (5 characters). In con- trast to the GAPipeline the numbers are concate- nated and the run id is added at the beginning. The run id makes reads unique between flowcells within one project. Run ids must not start with zero.

<sequence>

DNA sequence

<pe>

Flag, ’1’ or ’2’, first read or second read of a read pair. ’0’ for a single read.

<Sanger quality values>

Sanger calibrated quality values described in a later section

[<Chastity values>]

(Optional column) Illumina chastity values defined as Intensity(max)/ (Int(max) + Int(second)). Note: Earlier versions of SHORE used a read file format featuring three different quality types per entry. This file format is still supported for reading, but is no longer written.


Alignment file format

SHORE alignment files are typically called map.list, map.list.1 or map.list.2. They can be found in the LengthFolders or, when applying paired-ends, in the Read- Folders after correcting for paired-end information (shore correct4pe). The tab delimited entries are:

<chr id>

Each chromosome has an internal id, they are sim- ply numbered after their occurrence within the ref- erence sequence file starting from 1. Translation to the native chromosome name can be found in the *.shore.trans file in the IndexFolder.

<pos>

Left-most position of the mapping relative to the ref- erence sequence.

<alignment>

Matches are reported as a single character, mis- matches are represented by two characters sur- rounded by brackets. The first character repre- sents the reference base, the second character the sequenced base. Deletions are represented as ’-’. See ’Read file format’ section.

<read id>

See ’Read file format’ section.

<strand>

’D’ for forward and ’P’ for reverse hits (direct and palindromic, respectively).

<mismatches>

The number of mismatches in the alignment.

<hits>

The total number of genomic positions the read is aligned to.

<read length>

Length of the read.

<offset>

Number of bases at the beginning of a read that have not been aligned.

<pe flag>

1 or2 for paired-reads, 0 for singletons. Other values describe the state of paired reads after cor- recting for pe information.

<Sanger quality values>

Sanger calibrated quality values described in a later section.

[<Chastity values>]

(Optional column) Illumina chastity values defined as the highest intensity divided by the sum of the highest and the second highest intensity of a single base.

Note: Earlier versions of SHORE used an alignment file format featuring three different quality types per entry. This file format is still supported for reading, but is no longer written.


Consensus result file formats

Analyses of resequencing projects are performed by shore consensus. Multiple re- sult files are produced for SNPs, indels, CNVs, reference like bases and other types of predictions. All of them can be found in the ConsensusAnalysis folder. Currently ’SHORE consensus’ provides three types of consensus predictions: 1. Decision tree based approach as described in Ossowski et al. Genome Research 2009 and 2. Empiri- cal scoring matrix approach to be described in a future publications. 3. SNP prediction on transcriptomic or other quantitative data to be described in a future publication. The output files of the second approach have ’quality ” as name suffix. The third approach produces all files from the first two approaches but uses a slightly different scoring scheme. The following list gives an overview of all column types which can be found in the result file. Note that not all columns apply to each of the predictions:

<name> Project name or name of sequenced sample
<chr> Chromosome identifier
<pos> Position within the chromosome
<start> First position of a prediction within the chromosome (for predictions longer than just one base pair, e.g. indels)
<end> Last position of a prediction
<length> Length
<ref base> Reference base
<cons base> Consensus base (i.e. SNP call). Heterozygous SNPs and SNPs from pooled samples are divided into ’ma- jor allele’ and ’minor allele’
<seq> Sequence (i.e. inserted or deleted sequence)
<quality> Quality of a predicted feature (ranging from 0 to 40)
<read type> part of the reads that were used for prediction: repet- itive/nonrepetitive and core/complete
<support> Number of reads supporting a predicted feature. For heterozygous SNPs and SNPs from pooled samples the sequenced bases are divided into ’major support’ and ’minor support’
<concordance> Ratio of reads supporting a predicted feature to to- tal coverage (excluding quality masked bases). Het- erozygous SNPs and SNPs from pooled samples are divided into ’major concordance’ and ’minor concor- dance’
<max qual> Highest base quality supporting a prediction.
<avg hits> Average number of alignments of all reads covering this genomic position. (see section ’Repeat analysis based on short read alignment’ for details)
<repeat count> Number of repetitive positions in the range of the prediction (i.e. long deletion)
Number of ambiguous positions in the range of the prediction (’N’).
<exp cov> Expected coverage at a locus defined by repeat analysis and GC content. (Average expected coverage, if the prediction describes a range rather than a single location.)
<obs cov> Observed coverage at a locus as defined by the read alignment. (Average expected coverage, if the pre- diction describes a range rather than a single loca- tion.)
<obs/exp ratio> Ratio of observed to expected coverage is used to identify CNVs or highly over-sampled regions (often seen for rDNA clusters) in the genome.
<obs/exp ratio max> The maximum of <obs/exp ratio>.
<GC> Maximal GC content within the given range.
<cvp count> Copy variable positions (variation between instances of repeats) are an indication of duplications or CNVs.


The following listing describes the different prediction files and their columns. This description only reviews some of the major aspects to help to understand the files. It is important to note that all of them are predictions. Especially CNVs and duplica- tions are only indicating abnormalities from the observed mapping data compared to what would be expected under ideal sequencing circumstances. Moreover, CNV and duplication predictions have so far been implemented for single end data only, and should be carefully validated. More information on the prediction algorithms can be found in Ossowski et al. Genome Research 2008.