SHORE Subprograms
Contents
- 1 shore preprocess
- 2 shore import
- 3 shore mapflowcell
- 4 shore correct4pe
- 5 shore merge
- 6 shore mapdisp
- 7 shore consensus
- 8 shore qVar
- 9 shore structure
- 10 shore methyl
- 11 shore peak
- 12 shore srna
- 13 shore coverage
- 14 shore mg
- 15 shore count
- 16 shore tagstats
- 17 shore binom_test
- 18 shore mtc
- 19 shore annotate_region
- 20 shore convert
- 21 shore sort
- 22 shore compress
- 23 shore 2dex
- 24 shore idtrans
shore preprocess
shore preprocess creates the mapping indices, calculates local GC content and sequence complexity. In addition, SHORE will create a new copy of the fasta file of the reference sequence featuring adjusted chromosome/contig ids and write all files to the IndexFolder.
Usage: shore preprocess [OPTIONS] [FASTA_FILES]
Mandatory | ||
-f, --fastafile=<arg[,...]> | Fasta file(s) containing all reference sequences | |
-i, --indexfolder=<arg> | IndexFolder. Output folder for all SHORE relevant files. | |
Mapping indices | ||
-s, --seed=<arg[,...]> | (Default: 12) | Seedlength(s) for mapping indices (5-13) |
-b, --bowtie | Activate Bowtie support | |
-n, --novo | Activate Novocraft novoalign support | |
-e, --eland | Activate Eland support | |
-p, --gsnap | Activate GSNAP support | |
-t, --blat | Activate blat support | |
-W, --bwa | Activate BWA support | |
--bwa-construction-algorithm=<arg> | (Default: is) | BWA construction algorithm. Possible options are: is (up to 2GB databases), bwtsw (databases larger than 10MB). For further details see 'bwa index'. |
-U, --ssuff | Activate internal suffix array indexing | |
-g, --no-genomemapper | Inactivates GenomeMapper | |
BS-Seq (bisulfite treated DNA) | ||
-B, --bsseq | Turns on indexing for BS-seq experiments. Only genomemapper and novo are supported. BS-seq indices are calculated in addition to the normal indices for genomemapper and novo. | |
Sequence complexity | ||
-c, --complexity=<arg> | (Default: 9) | Window size in bp for sequence complexity analysis |
-w, --gccontent=<arg> | (Default: 101) | Odd window size in bp for GC content measuring |
SOLiD support | ||
-C, --build-colorspace-index | Build color-space index | |
GenomeMapper Graph version support | ||
-V, --variation-files=<arg> | Variation file(s) (comma separated absolute paths). Turns on graph version | |
Other Options | ||
--maxsize=<arg> | Split into multiple indeces if the sequences exceed this size in megabytes | |
--headers | Include the complete fasta headers in the *.trans file (default is first word) |
shore import
This program converts Illumina GAPipeline BUSTARD directories, FASTQ files or SOLiD csfasta files into SHORE format. shore import will create the necessary files and directory structure.
Input formats of the importer are specified using option -v. Available importers are:
- Bustard: Input generated by the GAPipeline (bustard/goat) or SCS programs.
- Fastq: FastQ files. Some users prefer Illumina fastq files as standard output from the GAPipeline.
- Solid: SOLiD F3 and R3 csfasta and (optionally) QV files.
- Shore: SHORE reads_0.fl files. This importer can be used to re-filter or trim reads which are already in SHORE format. In addition, 454 SFF files will also be accepted by this importer.
Usage: shore import [OPTIONS]
Mandatory | ||
-v, --importer=<arg> | (Default: Bustard) | Importers: Bustard, Fastq, Shore, Solid |
-e, --exporter=<arg> | (Default: Shore) | Exporters: Shore, Console |
-a, --application=<arg> | Applications: genomic, mRNA, ChIPseq, sRNA | |
Bustard importer | ||
-b, --bustard-folder=<arg> | Bustard folder, *_qseq.txt files | |
-l, --lanes=<arg[,...]> | (Default: 1,2,3,4,5,6,7,8) | Lanes |
Fastq importer | ||
-Q, --quality-type=<arg> | (Default: sanger) | Quality type provided in fastq files (either "sanger" or "illumina" (illumina prior to CASAVA 1.8)) |
-x, --read1-fastq=<arg[,...]> | List of fastq files for the first run | |
-y, --read2-fastq=<arg[,...]> | List of fastq files for the second run, required for paired-end runs [NOTE: Same file order as in the -x option required] | |
Shore importer | ||
--input=<arg[,...]> | Input reads_0.fl files or flowcell folders | |
Solid importer | ||
-F, --F3prefix=<arg> | Prefix of F3 csfasta and _QV.qual file | |
-R, --R3prefix=<arg> | Prefix of R3 csfasta and _QV.qual file | |
Shore exporter | ||
-o, --flowcell-folder=<arg> | Flowcell folder, will be created | |
-B, --batch-size=<arg> | Divides the length folders into batches that contain <batch-size> reads | |
--no-read-compression | Don't compress read files | |
--no-filtered-compression | Do not compress trash files | |
--rplot | Graphical output of statistics using R | |
--nondestructive-trim | Do not truncate the ends of trimmed or clipped reads | |
-L, --lengthdirs | Always create length_ directories (created by default for sRNA only) | |
Read filtering | ||
-D, --disable-illumina-filter | Start with unfiltered reads, override the GAPipeline filter and other external filters | |
-n, --max-Ns=<arg> | (Default: 100%) | Maximum number of ambiguous base calls per read (percentage of trimmed read length or absolute) |
-g, --lowcomplexity | Turn on low complexity filter | |
-c, --shore-filter | Use custom shore filter (implies '-D' if sig2 files are provided) | |
-C, --chastity-violation=<arg> | (Default: 57) | Threshold for chastity violations (in percent) |
-V, --quality-violation=<arg> | (Default: 3) | Threshold for quality violations (0 to 40) |
--filter-ranges=<arg[,...]> | (Default: 12:2,25:5) | Filter setup for custom shore filter |
Read trimming | ||
-m, --max-length=<arg[,...]> | Maximum read length(s) (read length including barcode) | |
-k, --minimal-length=<arg[,...]> | Minimal read length (switches on read trimming; read length without barcode) | |
-q, --quality-cutoff=<arg> | (Default: 5) | Quality cutoff for read trimming |
--discard-trim-failures | Filter reads trimmed beyond minimal length. | |
Read barcoding | ||
-r, --barcodes=<arg> | File with barcodes (line separated, optional second column is sample name) | |
-h, --barcode-mismatches=<arg> | (Default: 0) | Allowed number of mismatches in the barcodes |
-w, --two-sided-barcodes | Barcode is at both sides of the clone | |
Adapter clipping (454 or application = sRNA) | ||
-d, --adapter-sequence=<arg> | Adapter sequence (please specify first 12 bp) | |
-s, --smallest-sRNA=<arg> | Minimum length of sRNA to report | |
-t, --largest-sRNA=<arg> | Maximum length of sRNA to report | |
-p, --permit-missing-adapter | Permit reads where the adapter cannot be found | |
--linker=<arg> | Specify linker sequence for separation of 454 PE reads |
shore mapflowcell
This program performs the actual read alignments to a reference genome.
SHORE supports various mapping tools to always provide the best option for various applications. The default tool, GenomeMapper, is extensively tested. Currently the other available options are BWA, Bowtie, Novocraft and Eland.
Usage: shore mapflowcell [OPTIONS] [READ_PATHS]
Mandatory | ||
-f, --files=<arg[,...]> | Shore directories (run, lane, pe or sample) or read files | |
-i, --index-file=<arg> | Fasta file in IndexFolder, *.shore file | |
Mapping tools | ||
-v, --mapper=<arg> | (Default: genomemapper) | <genomemapper> <novo> <bowtie> <eland> <bwa> <gsnap> <blat> |
-C, --color | BWA & Bowtie: Reads and index are in colorspace | |
-B, --HSO | GenomeMapper: Turn on alignment of BS-seq reads. (EXPERIMENTAL) | |
Alignment parameters | ||
-n, --edit-distance=<arg> | (Default: 0) | Maximum edit distance (read length percentage or absolute value) |
-g, --maxgaps=<arg> | (Default: 0) | GenomeMapper, BWA, GSnap, Novoalign: Maximum number of gaps (0-n) (read length percentage or absolute value) |
-e, --gapextension=<arg> | GSnap & BWA: Maximum gap extension (0-n). -g defines max gap openings and -e max extensions per gap opening (read length percentage or absolute value) | |
-q, --hamming=<arg> | Bowtie & Novoalign: Quality-weighted hamming distance as defined by MAQ. Overwrites -n and -g. Permitted values: 0 to inf. | |
-l, --seed=<arg> | Bowtie & BWA: called seed (default: 28) - number of bases at beginning of read required to match GenomeMapper: Discard hits smaller than this seed length (read length percentage or absolute value) | |
-s, --seed-threshold=<arg> | GenomeMapper: Discard seeds with the number of hits above this threshold | |
--restrict-ED=<arg> | (Default: off) | Automatically limit edit distance according to the seed lemma (off or on or strict) |
Parallelization | ||
-c, --cores=<arg> | (Default: 1) | Number of processors/cores |
-b, --batch-size=<arg> | (Default: 50000) | Number of reads per thread |
-M, --native-cores=<arg> | (Default: 1) | Use the alignment tool's internal parallelization |
Mapping strategy | ||
-R, --report=<arg> | Maximum reported alignments. Recommended for single-end only! | |
-r, --suboptimal=<arg> | BWA: stop searching for suboptimal alignments when there are >INT equally best hits GSnap: All hits with best score plus suboptimal-score are reported default: no suboptimal alignments | |
-a, --all-hit-strategy | GenomeMapper & Bowtie: Map against all locations within the specified alignment parameters | |
-2, --best2-strategy | GenomeMapper: Report the best and the second best hit | |
--select-seeds=<arg[,...]> | Select from of multiple seed lengths if available in index directory | |
-P, --upgrade=<arg> | (Default: off) | Upgrade a previous mapflowcell run (off or replace or leftovers or full) |
Spliced alignment for mRNA-seq | ||
-S, --spliced | BLAT & GSnap: Perform spliced alignments | |
-D, --maxintron=<arg> | (Default: 1000) | Max. intron length considered for spliced alignment (equals --localsplicedist in GSnap). |
-L, --minhit=<arg> | (Default: 17) | Minimum length of hit on either side of spliced read |
Paired end sequencing | ||
-p, --PE | Paired-end mode, generate output suitable for correct4pe or for realignment using --upgrade=full | |
Output | ||
-Z, --nocompress-maplist | Do not compress mapping files | |
-Y, --nocompress-leftover | Do not compress leftover files | |
--rplot | Graphical output of statistics using R |
shore correct4pe
shore correct4pe finds the most likely mapping of repetitive reads by utilizing paired-end information. While in paired read mapping each read is aligned separately, read pair information can be used to increase the likelihood of an alignment by selecting the paired alignment based on the most likely distance between the pairs.
shore correct4pe starts by estimating the insert size distribution. The upper bound of this distribution is usually very sharp (clones longer than expected seem to be very rare), whereas the lower boundary is more blurred and very small clones can be observed as well. The insert size distribution is then translated into a probability distribution for the observation of a given distance of a pairing (where pairing is defined as the combination of one of the mappings of read 1 with one of the mappings of read 2). All possible combinations of the mappings of both reads of a pair are compared and all pairings with a probability equal to zero are dismissed. Mappings which are not in a pairing with a probability above zero are deleted. This removes all repetitive mappings, which resulted from repeats. If there is a mapping of one read pair with two different mappings of the other read the more likely pairing is kept. If all pairings have zero probability all mappings of both reads are kept. These are the discordant (unhappy) read pairs which typically are used to predict structural variants.
shore correct4pe will plot the insert size distribution using the R if -p is specified. In this case R has to be installed and included in the PATH environment variable.
shore merge
Merges and filters alignment files
shore mapdisp
Text-based alignment visualization
shore consensus
shore consensus has been replaced by shore qvar.
The common output from whole genome re-sequencing projects are lists of all identified polymorphisms (e.g. SNPs, indels, CNVs) as well as reference-like positions. In addition a consensus sequence or contigs can be generated by combining all high quality predictions. shore consensus provides this functionality by sequentially scanning an alignment to gather all read information available at a specific locus (i.e. called bases, base qualities, coverage, repetitiveness, alignment quality). This information is subsequently used to predict differences to the reference sequence.
shore consensus can also be used to identify minor alleles (SNPs or short indels) in pooled samples. In addition shore consensus estimates several characteristics of a run ahead of the actual consensus calling. This includes min and max read length, min and max mismatches, sequencing depth, observed local repetitiveness and GC content bias. Consensus also provides multiple project statistics regarding sequencing error rate, correlation of quality values to observed errors and coverage biases due to local GC content, which can be used to optimize further analysis (e.g. deletions should not be called in low GC content regions if a strong GC bias is observed).
Note: shore consensus can also be applied to sRNA-seq, mRNA-seq and ChIP-seq data. However, SHORE provides more appropriate tools for those purposes (coverage and peak).
shore qVar
Computes consensus sequence, SNPs, indels and CNVs from alignments
shore structure
shore structure enables the detection of diverged regions through clustering of mate pairs alignments with an unexpected distance and/or orientation to each other. Typically the recall is very good for deletions, but insertions longer than the insert size are cannot be revealed. In addition shore structure calls inversions. Currently only works for homozygous changes.
shore methyl
Quantify methylated and unmethylated cytosines from BS-seq alignments (only genomemapper)
shore peak
shore peak provides enriched region prediction for ChIP-Seq experiments. Significance of the predicted regions is assessed by comparison to the specified control samples.
Replicate experiments may be processed simultaneously by specifying multiple experiment and control paths. While the significance of each peak region is then tested for independently for each replicate, the region prediction itself is performed jointly for all experiments to obtain results that are immediately comparable.
shore srna
The purpose of shore srna is facilitating the analysis of small RNA sequencing data. The genome is scanned for regions where significant amounts of small RNAs are expressed and annotates these loci by read counts as well as the sRNA size that predominates.
shore coverage
For analysis of expression levels of mRNAs and small RNAs or for detection of unknown transcripts it is typically required to generate a coverage graph and to define expressed segments based on consecutive coverage.
shore coverage generates a coverage graph by sequentially scanning the alignment and basically counting reads.
shore mg
Primitive metagenomic analysis
shore count
shore count calculates the read count as well as other properties for regions in the genome that have already been defined by some other means. It may be used to analyze either fixed-size jumping windows over the genome or regions defined in an input file, e.g. to analyze annotated coding regions or to manually re-analyze regions defined by the segmentation algorithms of shore coverage, shore peak or shore srna.
Accepted input files are tab-delimited plain text files with a header specifying the columns chr, pos, size and optionally strand.
shore tagstats
Gather read statistics for multiple samples.
shore binom_test
shore binom_test can be used to evaluate two sets of count data agaist each other using a binomial test.
shore mtc
The subprogram shore mtc implements various multiple testing correction methods. The expected input is a tab-delimited text file with a header, and the column containing the p-values to be adjusted must be named raw_p.
Implemented methods include
- Benjamini-Hochberg false discovery rate control (fdr_bh)
- Bonferroni familywise error rate control (fwer_bonferroni)
- Holm familywise error rate control (fwer_holm)
- Hochberg familywise error rate control (fwer_hochberg)
- Sidak singlestep familywise error rate control (fwer_sidak_ss)
- Sidak stepdown familywise error rate control (fwer_sidak_sd)
- Benjamini-Yekutieli false discovery rate control (fdr_by).
shore annotate_region
shore annotate_region can be used to annotate previously defined genomic regions with the overlapping or nearest genes present in an annotation file. Only the central base of each region will be annotated. The annotation file must be in standard GFF format.
shore convert
Convert SHORE files into common file formats, and vice versa
shore sort
Sort / merge tab-delimited text files
shore compress
Compress files to indexed gzip format
shore 2dex
Range-indexing and query for tab-delimited text files
shore idtrans
Translate SHORE sequence IDs into sequence names, and vice versa