SHORE Overview

From SHORE wiki
Jump to: navigation, search

What is SHORE and what is SHORE for?

SHORE is a data analysis and management application for short DNA/RNA reads produced by the various contemporary sequencing platforms. It is developed at the Max Planck Institute for Developmental Biology, Tübingen, Germany.

SHORE is designed to support different sequencing applications including genomic re-sequencing, ChIP-Seq, mRNA-Seq, sRNA-Seq and BS-seq. SHORE aims to provide an end-to-end sequencing data analysis pipeline providing modules for all necessary steps starting from raw read data processing up to primary analysis results.

Available modules include quality filtering of raw reads, adapter clipping for sRNA-seq, mapping of reads against a reference genome, correction for read pair information and quantitative analysis of ChIP or transcriptome data.

Several file format converters ensure the compatibility of SHORE with other widely used file formats like fastq, qual, SAM/BAM, BED and GFF. SHORE was developed for applications in Arabidopsis thaliana but has been successfully used with other genomes, including human, mouse, D. melanogaster, C. elegans, maize and several bacterial genomes.

The SHORE main program

SHORE is a command line application consisting of a variety of subprograms which are invoked through the top-level shore command. For example, the command shore import calls the initial read data import and filtering subprogram.

When called without further command line arguments from a terminal, both shore and its subordinate commands will display a basic help page. For many subprograms, more detailed information is further available through the command shore help <command>, e.g. shore help import.

Help page and command line options

The help page displayed by commands is a three-column table providing the respective command line option, the option's default value enclosed in parenthesis as well as a description. For example,

-a, --application=STRING             (=genomic)         Applications: genomic, mRNA, ChIPseq or sRNA

indicates the option may be specified either as -a <value> or as --application=<value>. The default value if the option is not specified by the user is genomic, indicated by the middle column; further values accepted as arguments to the option are mRNA, ChIPseq and sRNA. Special values displayed in the default value column (not set) and (auto) indicate the option has no default value, or that the option's default value depends on the specification of further options, respectively.

Many command line options accept a list of arguments, indicated by suffixes [,...] or [:...][,...] to the help page's first column:

 -P, --input-paths=STRING[,...]       (not set)          Import shore directories or arbitrary read files (*)
 -x, --reads-fastq=STRING[:...][,...] (not set)          List(s) of fastq files

The --input-paths option for example may be specified either multiple times, or with a comma-separate list of files:

shore import -P batch1.fl -P batch2.fl ...
shore import -P batch1.fl,batch2.fl ...

The suffix [:...][,...] indicates that the option's argument list is allowed to be 2-dimensional, with the inner grouping indicated by colon-separated values, e.g.:

shore import -x batch1_1.fq:batch1_2.fq:batch1_3.fq -x batch2_1.fq:batch2_2.fq:batch2_3.fq ...
shore import -x batch1_1.fq:batch1_2.fq:batch1_3.fq,batch2_1.fq:batch2_2.fq:batch2_3.fq ...

Global command line options

Several parameters applicable to all SHORE subprograms can be set globally though options to the main program. These main program options must be specified before the respective subordinate command, e.g. shore --tmpdir=/large_tmp mapflowcell .... The following options may be set globally:

-u, --umask=STRING (auto) Force a different file creation mask (octal value)
-T, --tmpdir=STRING (auto) Temporary directory
-Z, --compression=STRING (=xz) Set the default compression format (plain or gzip or xz)
-G, --random-access=STRING (=fast) Random access granularity for compressed output files. Slower random access allows for better compression: fast or medium or slow
-C, --config=STRING (=col-0) Load an alternate set of default values

The --umask option may be used to grant other users access to all output files generated by SHORE. For the meaning of the three-digit values, see the system manual page of the chmod command by typing man chmod. By default, the option's value is determined by your system configuration.

Many SHORE commands may create large files with temporary data that are removed after the command completes. The directory that these temporary files are created in may be specified explicitly through the --tmpdir option.

The option --compression may be used to specify the compression format for output files created by SHORE. By default, output files are compressed using the XZ file format. Specifying the value gzip will use GZIP compression for created files, resulting in faster write performance, but larger output files. The value plain may be used to completely disable file compression.

With the --random-access option, users may adjust the trade-off between random access performance on generated files on the one hand, and compression ratio on the other hand. By default, SHORE's output files are optimized to allow fast retrieval of data from any offset into the file, e.g. through the query facilities provided by the shore sort and shore 2dex utilities. The cost is however a slight increase in output file size. Users may adjust for improved compression and worse random access performance using --random-access=medium or --random-access=slow.

The --config option allows to specify complete set of option default values, as described in the following.

Configuring option default values

For convenience, default values for the shore main program and subordinate commands may be pre-configured via a configuration file. Default values are configured by creating the file

$HOME/.config/shore/default.cfg

The program or programs that a section of the configuration file refers to must be specified in square brackets; options are specified by their name as displayed on the help page without the introductory minuses:

[shore]
# Use the 'medium' setting for the --random-access option instead of the default 'fast'.
random-access=medium
# Create temporary data in the directory '/large_tmp' by default.
tmpdir=/large_tmp

[shore mapflowcell]
# Use TAIR10 as the default reference sequence for shore mapflowcell.
index-file=~/Genomes/SHORE/TAIR10/TAIR10.fa.shore
# Always generate the statistics plot.
rplot=yes

[shore consensus,shore qVar]
# Command line options without a --option variant must be enclosed in single quotes.
'f'=~/Genomes/SHORE/TAIR10/TAIR10.fa.shore

The --config option to the SHORE main program allows specification of an alternate set of default values. E.g.

shore -C col-0 mapflowcell ...

will attempt to use option default values specified in a file

$HOME/.config/shore/col-0.cfg

These defaults are valid only throughout the respective run of a subprogram.

Overview of SHORE’s data structure

SHORE structures directories and files in several predefined ways. One central directory, referred to as IndexFolder, stores all information about a reference sequence including the indexes (created and used by mapping tools), GC content and low complexity sequence analysis. A second directory, referred to as ProjectFolder, contains all information about a certain sequencing project. The IndexFolder is kept separate from the ProjectFolder, because one IndexFolder can be reused for multiple projects, e.g., different re-sequencing projects (ProjectFolders) can refer to the same reference sequence (IndexFolder). The IndexFolder is populated using shore preprocess. It will store a fasta file named <original_fasta_file>.shore. This file is used as prefix to find all needed index files for SHORE. It is not necessary to keep the original fasta file.

The ProjectFolder contains one or more RunFolders, which are also organized as a predefined directory hierarchy. This is where the actual read and alignment data are stored. Although many SHORE subprograms can also be run stand-alone for single files, this partitioning is required to take full advantage of all SHORE features. However, this should not be too confusing, as all directories are created automatically by SHORE.

All files in the RunFolders are ASCII formatted text files and can easily be viewed, moved and manipulated.

It is also possible to use SHORE e.g. just for read mapping and then convert the SHORE alignments into a different file format (e.g. SAM for SNP detection with SAMtools).

The RunFolder and the read data

Directory structure of the RunFolder for read storage:

RunFolder/LaneFolder[/SampleFolder]/ReadFolder[/LengthFolder]
Note: Many SHORE subprograms accept shore directories as input; this term is used to refer to any of RunFolder, LaneFolder, SampleFolder and ReadFolder.

The SampleFolder and LengthFolder levels are optional and will be automatically created where required.

RunFolder/ A project starts with the sequencing of one or more flowcells. Each flowcell is stored in a separate directory, which is named by the user.
LaneFolder/ An Illumina flowcell consists of one to eight lanes, each of them represented by a single directory, named by a single digit. For technologies other than Illumina, this will always be called 1. An additional directory called filtered will be created containing all reads which did not pass the quality filter.
SampleFolder/ This directory level will only be created when using bar codes for de-multiplexing multiple samples. Directories are named sample_<user_defined_label>/.
ReadFolder/ The ReadFolders separate the read pairs, in case of paired-end or mate-pair data. A directory 1 contains the reads from the first sequencing run and in 2 the reads from the second, respectively. In single are reads from the first and from the second run which lost their read partner in the quality filtering process. For a single-end runs, all reads passing the quality filter will be stored in single.
LengthFolder/ SHORE can optionally sort the reads according to their length, e.g. when trimming away bad quality sequence or sequencing adapters. These directories are named length_<read_length>/. Created by default for small RNA only.

For example, a directory could be named like:

EAS67_0018_FC12398AAXX/5/sample_phiX/single/length_80