Difference between revisions of "SHORE Overview"

From SHORE wiki
Jump to: navigation, search
Line 26: Line 26:
 
|-
 
|-
 
| LengthFolder/ || SHORE can trim reads before mapping according to their base quality at the read ends and therefore reads can be of different lengths. Each length has its own sub-folder which name is equal to the length, prefixed by ”length ”.
 
| LengthFolder/ || SHORE can trim reads before mapping according to their base quality at the read ends and therefore reads can be of different lengths. Each length has its own sub-folder which name is equal to the length, prefixed by ”length ”.
 +
|}
 +
 +
For example, a folder could be named like:
 +
 +
phiX/EAS67_0018_FC12398AAXX/5/1/length_80
 +
 +
 +
=== The ProjectFolder and the alignment data ===
 +
After the reads are mapped, the resulting alignments (mappings) can be merged into the so-called AlignmentFolder. This folder stores all information and results of the alignment analysis:
 +
 +
ProjectFolder/AlignmentFolder/AnalysisFolder/ResultFolder
 +
 +
{|
 +
|ProjectFolder/ || This is the home folder for a sequencing project.
 +
|-
 +
|AlignmentFolder/ || The name of the folder is user specified and will host the merged results of the read mapping which is used as input for shore consensus, shore coverage and shore peakup. Merging the alignments will introduce redundancy. The align- ments are stored twice, it is recommended to delete the merged file after is has been analyzed.
 +
|-
 +
| AnalysisFolder/ || Will be created by shore consensus, shore coverage and shore peak. Several AnalysisFolders can be created to store data from different analysis runs.
 +
|-
 +
| ResultFolder/ || AnalysisFolders have two different subfolders, the ResultFold- ers. In case of ’shore consensus’ there are two ResultFolders called ’ConsensusAnalysis’ which stores the alignment analy- sis results (e.g. SNP calls, peaks ...) and ’ConsensusStatistics’ whichstores general project statistics, like read quality.
 +
|}
 +
 +
 +
It is possible to store the AlignmentFolder or the AnalysisFolders in a different location than the ProjectFolder.
 +
 +
 +
 +
== SHORE’s file formats ==
 +
Any output generated by SHORE will usually be written to various text files that contain a number of tab-delimited columns. Typing shore fmt will display a quick reference on SHORE’s file formats.
 +
 +
=== Read file format ===
 +
Read files can be found in the LengthFolders in the ProjectFolder. They are called ’reads 0.fl’. They will be created by shore import. The tab delimited entries are:
 +
 +
 +
{|
 +
| <id> ||
 +
ID is build up of the run id (4 characters), lane (1 character), tile (3 characters), x value within the im- age (5 characters), y value (5 characters). In con- trast to the GAPipeline the numbers are concate- nated and the run id is added at the beginning. The run id makes reads unique between flowcells within one project. Run ids must not start with zero.
 +
|-
 +
|<sequence> ||
 +
DNA sequence
 +
|-
 +
| <pe> ||
 +
Flag, ’1’ or ’2’, first read or second read of a read pair. ’0’ for a single read.
 +
|-
 +
| <Sanger quality values> ||
 +
Sanger calibrated quality values described in a later section
 +
|- [<Chastity values>] ||
 +
(Optional column) Illumina chastity values defined as Intensity(max)/ (Int(max) + Int(second)).
 +
Note: Earlier versions of SHORE used a read file format featuring three different quality types per entry. This file format is still supported for reading, but is no longer written.
 
|}
 
|}

Revision as of 18:02, 5 April 2011

SHORE overview 3.1 What is SHORE and what is SHORE for?

SHORE is a data analysis and management application for short DNA/RNA reads produced by the Illumina Genome Analyzer or Life Technology SOLiD platforms. It is developed at the Max Planck Institute for Developmental Biology, Tu ̈bingen, Ger- many. SHORE is designed to support different sequencing applications including genomic re-sequencing, ChIP-Seq, mRNA-Seq, sRNA-Seq and BS-seq. The SHORE pipeline comprises all necessary steps for sequence analysis including quality filtering of raw reads, adapter clipping for sRNA-seq, mapping of reads against a reference genome, repeat analysis, correction for read pair information, quantitative analysis of ChIP or transcriptome data. Several file format converters ensure the compatibility of SHORE with other widely used file formats like fastq, qual, sam, bed and gff. SHORE was developed for ap- plications in Arabidopsis thaliana but has been successfully used with other genomes, including human, D. melanogaster, C. elegans, maize and several bacterial genomes.


Overview of SHORE’s data structure

SHORE has a predefined structure of folders and files. One folder, referred to as In- dexFolder, stores all information about the reference sequence including the indices (created and used by mapping tools), GC content and low complexity sequence anal- ysis. A second folder, referred to as ProjectFolder, contains all information about a certain sequencing project. The IndexFolder is kept separate from the ProjectFolder, because one IndexFolder can be used for multiple ProjectFolders, e.g., different re-sequencing projects (ProjectFold- ers) can refer to the same reference sequence (IndexFolder). The IndexFolder is populated using shore preprocess. It will store a fasta file named <original fasta file>.shore. This file is used as prefix to find all needed index files for SHORE. It is not necessary to keep the original fasta file. The ProjectFolder has a pre-defined folder structure. All SHORE programs require this structure, though this should not be too confusing, as all folders are created au- tomatically by SHORE. All files in the ProjectFolder are ASCII formated files and can be viewed, moved and manipulated. For example it is possible to use SHORE just for read mapping and then convert the SHORE alignments into a different file format (e.g. SAM for SNP detection with SAMtools).

The ProjectFolder and the read data

Folder structure of the ProjectFolder for read storage:

ProjectFolder/FlowcellFolder/LaneFolder/ReadFolder/LengthFolder
ProjectFolder/ This is the home folder for a sequencing project.
FlowcellFolder/ A project starts with the sequencing of one or more flowcells. Each flowcell is stored in a separate folder in the ProjectFolder, the so-called FlowcellFolders.
LaneFolder/ Each flowcell consists of one to eight lanes, each of them rep- resented as a single folder, named 1-8. An additional folder called bad quality will be created containing all reads which did not pass the quality filter.
ReadFolder/ The read folders separate the read pairs, in case of paired- end or mate-pair data. In folder ”1” are the reads from the first sequencing run and in ”2” the reads from the second, respectively. In ”single” are reads from the first and from the second run which lost their read partner in the quality filtering process. In a single-end run, all reads passing the quality filter will be stored in ”single”.
LengthFolder/ SHORE can trim reads before mapping according to their base quality at the read ends and therefore reads can be of different lengths. Each length has its own sub-folder which name is equal to the length, prefixed by ”length ”.

For example, a folder could be named like:

phiX/EAS67_0018_FC12398AAXX/5/1/length_80


The ProjectFolder and the alignment data

After the reads are mapped, the resulting alignments (mappings) can be merged into the so-called AlignmentFolder. This folder stores all information and results of the alignment analysis:

ProjectFolder/AlignmentFolder/AnalysisFolder/ResultFolder
ProjectFolder/ This is the home folder for a sequencing project.
AlignmentFolder/ The name of the folder is user specified and will host the merged results of the read mapping which is used as input for shore consensus, shore coverage and shore peakup. Merging the alignments will introduce redundancy. The align- ments are stored twice, it is recommended to delete the merged file after is has been analyzed.
AnalysisFolder/ Will be created by shore consensus, shore coverage and shore peak. Several AnalysisFolders can be created to store data from different analysis runs.
ResultFolder/ AnalysisFolders have two different subfolders, the ResultFold- ers. In case of ’shore consensus’ there are two ResultFolders called ’ConsensusAnalysis’ which stores the alignment analy- sis results (e.g. SNP calls, peaks ...) and ’ConsensusStatistics’ whichstores general project statistics, like read quality.


It is possible to store the AlignmentFolder or the AnalysisFolders in a different location than the ProjectFolder.


SHORE’s file formats

Any output generated by SHORE will usually be written to various text files that contain a number of tab-delimited columns. Typing shore fmt will display a quick reference on SHORE’s file formats.

Read file format

Read files can be found in the LengthFolders in the ProjectFolder. They are called ’reads 0.fl’. They will be created by shore import. The tab delimited entries are:


(Optional column) Illumina chastity values defined as Intensity(max)/ (Int(max) + Int(second)). Note: Earlier versions of SHORE used a read file format featuring three different quality types per entry. This file format is still supported for reading, but is no longer written.
<id>

ID is build up of the run id (4 characters), lane (1 character), tile (3 characters), x value within the im- age (5 characters), y value (5 characters). In con- trast to the GAPipeline the numbers are concate- nated and the run id is added at the beginning. The run id makes reads unique between flowcells within one project. Run ids must not start with zero.

<sequence>

DNA sequence

<pe>

Flag, ’1’ or ’2’, first read or second read of a read pair. ’0’ for a single read.

<Sanger quality values>

Sanger calibrated quality values described in a later section