SHORE Overview
What is SHORE and what is SHORE for?
SHORE is a data analysis and management application for short DNA/RNA reads produced by the various contemporary sequencing platforms. It is developed at the Max Planck Institute for Developmental Biology, Tübingen, Germany. SHORE is designed to support different sequencing applications including genomic re-sequencing, ChIP-Seq, mRNA-Seq, sRNA-Seq and BS-seq. The SHORE pipeline comprises all necessary steps for sequence analysis including quality filtering of raw reads, adapter clipping for sRNA-seq, mapping of reads against a reference genome, correction for read pair information, quantitative analysis of ChIP or transcriptome data. Several file format converters ensure the compatibility of SHORE with other widely used file formats like fastq, qual, sam/bam, bed and gff. SHORE was developed for applications in Arabidopsis thaliana but has been successfully used with other genomes, including human, mouse, D. melanogaster, C. elegans, maize and several bacterial genomes.
Overview of SHORE’s data structure
SHORE structures directories and files in several predefined ways. One central directory, referred to as IndexFolder, stores all information about a reference sequence including the indexes (created and used by mapping tools), GC content and low complexity sequence analysis. A second directory, referred to as ProjectFolder, contains all information about a certain sequencing project. The IndexFolder is kept separate from the ProjectFolder, because one IndexFolder can be reused for multiple projects, e.g., different re-sequencing projects (ProjectFolders) can refer to the same reference sequence (IndexFolder). The IndexFolder is populated using shore preprocess. It will store a fasta file named <original_fasta_file>.shore. This file is used as prefix to find all needed index files for SHORE. It is not necessary to keep the original fasta file.
The ProjectFolder contains one or more RunFolders, which are also organized as a predefined directory hierarchy. This is where the actual read and alignment data are stored. Although many SHORE subprograms can also be run stand-alone for single files, this partitioning is required to take full advantage of all SHORE features. However, this should not be too confusing, as all directories are created automatically by SHORE. All files in the RunFolders are ASCII formatted files and can be viewed, moved and manipulated. It is also possible to use SHORE e.g. just for read mapping and then convert the SHORE alignments into a different file format (e.g. SAM for SNP detection with SAMtools).
The RunFolder and the read data
Directory structure of the RunFolder for read storage:
RunFolder/LaneFolder[/SampleFolder]/ReadFolder[/LengthFolder]
The SampleFolder and LengthFolder levels are optional and will be automatically created where required.
RunFolder/ | A project starts with the sequencing of one or more flowcells. Each flowcell is stored in a separate directory, which is named by the user. |
LaneFolder/ | An Illumina flowcell consists of one to eight lanes, each of them represented by a single directory, named by a single digit. For technologies other than Illumina, this will always be called 1. An additional directory called filtered will be created containing all reads which did not pass the quality filter. |
SampleFolder/ | This directory level will only be created when using bar codes for de-multiplexing multiple samples. Directories are named sample_<user_defined_label>/. |
ReadFolder/ | The ReadFolders separate the read pairs, in case of paired-end or mate-pair data. A directory 1 contains the reads from the first sequencing run and in 2 the reads from the second, respectively. In single are reads from the first and from the second run which lost their read partner in the quality filtering process. For a single-end runs, all reads passing the quality filter will be stored in single. |
LengthFolder/ | SHORE can optionally sort the reads according to their length, e.g. when trimming away bad quality sequence or sequencing adapters. These directories are named length_<read_length>/. Created by default for small RNA only. |
For example, a directory could be named like:
EAS67_0018_FC12398AAXX/5/sample_phiX/single/length_80