Difference between revisions of "SHORE Overview"

From SHORE wiki
Jump to: navigation, search
(The RunFolder and the read data)
(The RunFolder and the read data)
Line 41: Line 41:
 
   
 
   
 
  EAS67_0018_FC12398AAXX/5/sample_phiX/single/length_80
 
  EAS67_0018_FC12398AAXX/5/sample_phiX/single/length_80
 
 
<!--COMMENT
 
== The ProjectFolder and the alignment data ==
 
After the reads are mapped, the resulting alignments (mappings) can be merged into the so-called AlignmentFolder. This folder stores all information and results of the alignment analysis:
 
 
ProjectFolder/AlignmentFolder/AnalysisFolder/ResultFolder
 
 
{|
 
|ProjectFolder/ || This is the home folder for a sequencing project.
 
|-
 
|AlignmentFolder/ || The name of the folder is user specified and will host the merged results of the read mapping which is used as input for shore consensus, shore coverage and shore peakup. Merging the alignments will introduce redundancy. The align- ments are stored twice, it is recommended to delete the merged file after is has been analyzed.
 
|-
 
| AnalysisFolder/ || Will be created by shore consensus, shore coverage and shore peak. Several AnalysisFolders can be created to store data from different analysis runs.
 
|-
 
| ResultFolder/ || AnalysisFolders have two different subfolders, the ResultFold- ers. In case of ’shore consensus’ there are two ResultFolders called ’ConsensusAnalysis’ which stores the alignment analy- sis results (e.g. SNP calls, peaks ...) and ’ConsensusStatistics’ whichstores general project statistics, like read quality.
 
|}
 
 
 
It is possible to store the AlignmentFolder or the AnalysisFolders in a different location than the ProjectFolder.
 
COMMENT-->
 

Revision as of 10:20, 27 September 2011

What is SHORE and what is SHORE for?

SHORE is a data analysis and management application for short DNA/RNA reads produced by the various contemporary sequencing platforms. It is developed at the Max Planck Institute for Developmental Biology, Tübingen, Germany. SHORE is designed to support different sequencing applications including genomic re-sequencing, ChIP-Seq, mRNA-Seq, sRNA-Seq and BS-seq. The SHORE pipeline comprises all necessary steps for sequence analysis including quality filtering of raw reads, adapter clipping for sRNA-seq, mapping of reads against a reference genome, correction for read pair information, quantitative analysis of ChIP or transcriptome data. Several file format converters ensure the compatibility of SHORE with other widely used file formats like fastq, qual, sam/bam, bed and gff. SHORE was developed for applications in Arabidopsis thaliana but has been successfully used with other genomes, including human, mouse, D. melanogaster, C. elegans, maize and several bacterial genomes.

Overview of SHORE’s data structure

SHORE structures directories and files in several predefined ways. One central directory, referred to as IndexFolder, stores all information about a reference sequence including the indexes (created and used by mapping tools), GC content and low complexity sequence analysis. A second directory, referred to as ProjectFolder, contains all information about a certain sequencing project. The IndexFolder is kept separate from the ProjectFolder, because one IndexFolder can be reused for multiple projects, e.g., different re-sequencing projects (ProjectFolders) can refer to the same reference sequence (IndexFolder). The IndexFolder is populated using shore preprocess. It will store a fasta file named <original_fasta_file>.shore. This file is used as prefix to find all needed index files for SHORE. It is not necessary to keep the original fasta file.

The ProjectFolder contains one or more RunFolders, which are also organized as a predefined directory hierarchy. This is where the actual read and alignment data are stored. Although many SHORE subprograms can also be run stand-alone for single files, this partitioning is required to take full advantage of all SHORE features. However, this should not be too confusing, as all directories are created automatically by SHORE. All files in the RunFolders are ASCII formatted files and can be viewed, moved and manipulated. It is also possible to use SHORE e.g. just for read mapping and then convert the SHORE alignments into a different file format (e.g. SAM for SNP detection with SAMtools).

The RunFolder and the read data

Directory structure of the RunFolder for read storage:

RunFolder/LaneFolder[/SampleFolder]/ReadFolder[/LengthFolder]
Note: Many SHORE subprograms accept shore directories as input; this term is used to refer to any of RunFolder, LaneFolder, SampleFolder and ReadFolder.

The SampleFolder and LengthFolder levels are optional and will be automatically created where required.

RunFolder/ A project starts with the sequencing of one or more flowcells. Each flowcell is stored in a separate directory, which is named by the user.
LaneFolder/ An Illumina flowcell consists of one to eight lanes, each of them represented by a single directory, named by a single digit. For technologies other than Illumina, this will always be called 1. An additional directory called filtered will be created containing all reads which did not pass the quality filter.
SampleFolder/ This directory level will only be created when using bar codes for de-multiplexing multiple samples. Directories are named sample_<user_defined_label>/.
ReadFolder/ The ReadFolders separate the read pairs, in case of paired-end or mate-pair data. A directory 1 contains the reads from the first sequencing run and in 2 the reads from the second, respectively. In single are reads from the first and from the second run which lost their read partner in the quality filtering process. For a single-end runs, all reads passing the quality filter will be stored in single.
LengthFolder/ SHORE can optionally sort the reads according to their length, e.g. when trimming away bad quality sequence or sequencing adapters. These directories are named length_<read_length>/. Created by default for small RNA only.

For example, a directory could be named like:

EAS67_0018_FC12398AAXX/5/sample_phiX/single/length_80