SHORE Overview

From SHORE wiki
Revision as of 17:54, 5 April 2011 by Joffreyfitz (Talk | contribs)

Jump to: navigation, search

SHORE overview 3.1 What is SHORE and what is SHORE for?

SHORE is a data analysis and management application for short DNA/RNA reads produced by the Illumina Genome Analyzer or Life Technology SOLiD platforms. It is developed at the Max Planck Institute for Developmental Biology, Tu ̈bingen, Ger- many. SHORE is designed to support different sequencing applications including genomic re-sequencing, ChIP-Seq, mRNA-Seq, sRNA-Seq and BS-seq. The SHORE pipeline comprises all necessary steps for sequence analysis including quality filtering of raw reads, adapter clipping for sRNA-seq, mapping of reads against a reference genome, repeat analysis, correction for read pair information, quantitative analysis of ChIP or transcriptome data. Several file format converters ensure the compatibility of SHORE with other widely used file formats like fastq, qual, sam, bed and gff. SHORE was developed for ap- plications in Arabidopsis thaliana but has been successfully used with other genomes, including human, D. melanogaster, C. elegans, maize and several bacterial genomes.


Overview of SHORE’s data structure

SHORE has a predefined structure of folders and files. One folder, referred to as In- dexFolder, stores all information about the reference sequence including the indices (created and used by mapping tools), GC content and low complexity sequence anal- ysis. A second folder, referred to as ProjectFolder, contains all information about a certain sequencing project. The IndexFolder is kept separate from the ProjectFolder, because one IndexFolder can be used for multiple ProjectFolders, e.g., different re-sequencing projects (ProjectFold- ers) can refer to the same reference sequence (IndexFolder). The IndexFolder is populated using shore preprocess. It will store a fasta file named <original fasta file>.shore. This file is used as prefix to find all needed index files for SHORE. It is not necessary to keep the original fasta file. The ProjectFolder has a pre-defined folder structure. All SHORE programs require this structure, though this should not be too confusing, as all folders are created au- tomatically by SHORE. All files in the ProjectFolder are ASCII formated files and can be viewed, moved and manipulated. For example it is possible to use SHORE just for read mapping and then convert the SHORE alignments into a different file format (e.g. SAM for SNP detection with SAMtools).

The ProjectFolder and the read data

Folder structure of the ProjectFolder for read storage:

ProjectFolder/FlowcellFolder/LaneFolder/ReadFolder/LengthFolder
ProjectFolder/ This is the home folder for a sequencing project.
FlowcellFolder/ A project starts with the sequencing of one or more flowcells. Each flowcell is stored in a separate folder in the ProjectFolder, the so-called FlowcellFolders.
LaneFolder/ Each flowcell consists of one to eight lanes, each of them rep- resented as a single folder, named 1-8. An additional folder called bad quality will be created containing all reads which did not pass the quality filter.
ReadFolder/ The read folders separate the read pairs, in case of paired- end or mate-pair data. In folder ”1” are the reads from the first sequencing run and in ”2” the reads from the second, respectively. In ”single” are reads from the first and from the second run which lost their read partner in the quality filtering process. In a single-end run, all reads passing the quality filter will be stored in ”single”.
LengthFolder/ SHORE can trim reads before mapping according to their base quality at the read ends and therefore reads can be of different lengths. Each length has its own sub-folder which name is equal to the length, prefixed by ”length ”.