SHORE Subprograms

From SHORE wiki
Revision as of 13:51, 23 September 2011 by Felo80 (Talk | contribs)

Jump to: navigation, search

shore preprocess

shore preprocess creates the mapping indices, calculates local GC content and sequence complexity. In addition, SHORE will create a new copy of the fasta file of the reference sequence featuring adjusted chromosome/contig ids and write all files to the IndexFolder.


Usage: shore preprocess [OPTIONS] [FASTA_FILES]

Mandatory
-f, --fastafile=<arg[,...]> Fasta file(s) containing all reference sequences
-i, --indexfolder=<arg> IndexFolder. Output folder for all SHORE relevant files.
Mapping indices
-s, --seed=<arg[,...]> (Default: 12) Seedlength(s) for mapping indices (5-13)
-b, --bowtie Activate Bowtie support
-n, --novo Activate Novocraft novoalign support
-e, --eland Activate Eland support
-p, --gsnap Activate GSNAP support
-t, --blat Activate blat support
-W, --bwa Activate BWA support
--bwa-construction-algorithm=<arg> (Default: is) BWA construction algorithm. Possible options are: is (up to 2GB databases), bwtsw (databases larger than 10MB). For further details see 'bwa index'.
-U, --ssuff Activate internal suffix array indexing
-g, --no-genomemapper Inactivates GenomeMapper
BS-Seq (bisulfite treated DNA)
-B, --bsseq Turns on indexing for BS-seq experiments. Only genomemapper and novo are supported. BS-seq indices are calculated in addition to the normal indices for genomemapper and novo.
Sequence complexity
-c, --complexity=<arg> (Default: 9) Window size in bp for sequence complexity analysis
-w, --gccontent=<arg> (Default: 101) Odd window size in bp for GC content measuring
SOLiD support
-C, --build-colorspace-index Build color-space index
GenomeMapper Graph version support
-V, --variation-files=<arg> Variation file(s) (comma separated absolute paths). Turns on graph version
Other Options
--maxsize=<arg> Split into multiple indeces if the sequences exceed this size in megabytes
--headers Include the complete fasta headers in the *.trans file (default is first word)

shore import

This program converts Illumina GAPipeline BUSTARD directories, FASTQ files or SOLiD csfasta files into SHORE format. shore import will create the necessary files and the RunFolder directory structure.

Input formats of the importer are specified using option -v. Available importers are:

  • Bustard: Input generated by the GAPipeline (bustard/goat) or SCS programs.
  • Fastq: FastQ files. Some users prefer Illumina fastq files as standard output from the GAPipeline.
  • Solid: SOLiD F3 and R3 csfasta and (optionally) QV files.
  • Shore: SHORE reads_0.fl files. This importer can be used to re-filter or trim reads which are already in SHORE format. In addition, 454 SFF files will also be accepted by this importer.


Usage: shore import [OPTIONS]

Mandatory
-v, --importer=<arg> (Default: Bustard) Importers: Bustard, Fastq, Shore, Solid
-e, --exporter=<arg> (Default: Shore) Exporters: Shore, Console
-a, --application=<arg> Applications: genomic, mRNA, ChIPseq, sRNA
Bustard importer
-b, --bustard-folder=<arg> Bustard directory, *_qseq.txt files
-l, --lanes=<arg[,...]> (Default: 1,2,3,4,5,6,7,8) Lanes
Fastq importer
-Q, --quality-type=<arg> (Default: sanger) Quality type provided in fastq files (either sanger (ASCII offset 33) or illumina (ASCII offset 64, illumina prior to CASAVA 1.8))
-x, --read1-fastq=<arg[,...]> List of fastq files for the first run
-y, --read2-fastq=<arg[,...]> List of fastq files for the second run, required for paired-end runs [NOTE: Same file order as in the -x option required]
Shore importer
--input=<arg[,...]> Input reads_0.fl files or RunFolders.
Note: If a complete RunFolder is specified, the raw data will be recovered, and previous filtering will be undone.
Solid importer
-F, --F3prefix=<arg> Prefix of F3 csfasta and _QV.qual file
-R, --R3prefix=<arg> Prefix of R3 csfasta and _QV.qual file
Shore exporter
-o, --flowcell-folder=<arg> RunFolder, will be created
-B, --batch-size=<arg> Divides the LengthFolders into batches that contain <batch-size> reads
--no-read-compression Don't compress read files
--no-filtered-compression Do not compress trash files
--rplot Graphical output of statistics using R
--nondestructive-trim Do not truncate the ends of trimmed or clipped reads
-L, --lengthdirs Always create length_ directories (created by default for sRNA only)
Read filtering
-D, --disable-illumina-filter Start with unfiltered reads, override the GAPipeline filter and other external filters
-n, --max-Ns=<arg> (Default: 100%) Maximum number of ambiguous base calls per read (percentage of trimmed read length or absolute)
-g, --lowcomplexity Turn on low complexity filter
-c, --shore-filter Use custom shore filter (implies '-D' if sig2 files are provided)
-C, --chastity-violation=<arg> (Default: 57) Threshold for chastity violations (in percent)
-V, --quality-violation=<arg> (Default: 3) Threshold for quality violations (0 to 40)
--filter-ranges=<arg[,...]> (Default: 12:2,25:5) Filter setup for custom shore filter
Read trimming
-m, --max-length=<arg[,...]> Maximum read length(s) (read length including barcode)
-k, --minimal-length=<arg[,...]> Minimal read length (switches on read trimming; read length without barcode)
-q, --quality-cutoff=<arg> (Default: 5) Quality cutoff for read trimming
--discard-trim-failures Filter reads trimmed beyond minimal length.
Read barcoding
-r, --barcodes=<arg> File with barcodes (line separated, optional second column is sample name)
-h, --barcode-mismatches=<arg> (Default: 0) Allowed number of mismatches in the barcodes
-w, --two-sided-barcodes Barcode is at both sides of the clone
Adapter clipping (454 or application = sRNA)
-d, --adapter-sequence=<arg> Adapter sequence (please specify first 12 bp)
-s, --smallest-sRNA=<arg> Minimum length of sRNA to report
-t, --largest-sRNA=<arg> Maximum length of sRNA to report
-p, --permit-missing-adapter Permit reads where the adapter cannot be found
--linker=<arg> Specify linker sequence for separation of 454 PE reads

shore mapflowcell

This program performs the actual read alignments to a reference genome.

SHORE supports various alignment tools to always provide the best option for various applications. The default tool, GenomeMapper, is extensively tested. Currently the other available options are BWA, Bowtie, Novocraft and Eland.

SHORE mapflowcell will create an alignment file named map.list corresponding to each of the reads_0.fl files in the input directories.


Usage: shore mapflowcell [OPTIONS] [READ_PATHS]

Mandatory
-f, --files=STRING[,...] Shore directories (run, lane, pe or sample) or read files
-i, --index-file=STRING Fasta file in IndexFolder, *.shore file
Mapping tools
-v, --mapper=STRING (Default: genomemapper) <genomemapper> <novo> <bowtie> <eland> <bwa> <gsnap> <blat>
-C, --color BWA & Bowtie: Reads and index are in colorspace
-B, --HSO GenomeMapper: Turn on alignment of BS-seq reads. (EXPERIMENTAL)
Alignment parameters
-n, --edit-distance=INT[%] (Default: 0) Maximum edit distance (read length percentage or absolute value)
-g, --maxgaps=INT[%] (Default: 0) GenomeMapper, BWA, GSnap, Novoalign: Maximum number of gaps (0-n) (read length percentage or absolute value)
-e, --gapextension=INT[%] GSnap & BWA: Maximum gap extension (0-n). -g defines max gap openings and -e max extensions per gap opening (read length percentage or absolute value)
-q, --hamming=INT[%] Bowtie & Novoalign: Quality-weighted hamming distance as defined by MAQ. Overwrites -n and -g. Permitted values: 0 to inf.
-l, --seed=INT[%] Bowtie & BWA: called seed (default: 28) - number of bases at beginning of read required to match GenomeMapper: Discard hits smaller than this seed length (read length percentage or absolute value)
-s, --seed-threshold=INT GenomeMapper: Discard seeds with the number of hits above this threshold
--restrict-ED=STRING (Default: off) Automatically limit edit distance according to the seed lemma (off or on or strict)
Parallelization
-c, --cores=INT (Default: 1) Number of processors/cores
-b, --batch-size=INT (Default: 50000) Number of reads per thread
-M, --native-cores=INT (Default: 1) Use the alignment tool's internal parallelization
Mapping strategy
-R, --report=INT Maximum reported alignments. Recommended for single-end only!
-r, --suboptimal=INT BWA: stop searching for suboptimal alignments when there are >INT equally best hits GSnap: All hits with best score plus suboptimal-score are reported default: no suboptimal alignments
-a, --all-hit-strategy GenomeMapper & Bowtie: Map against all locations within the specified alignment parameters
-2, --best2-strategy GenomeMapper: Report the best and the second best hit
--select-seeds=INT[,...] Select from of multiple seed lengths if available in index directory
-P, --upgrade=STRING (Default: off) Upgrade a previous mapflowcell run (off or replace or leftovers or full)

leftovers and full allow both re-alignment to the same reference sequence as used in a previous pass using a different alignment tool or parameters, as well as additionally mapping the reads to a different reference sequence.

  • off: normal operation, will skip directories already featuring map.list, left_over.fl or ref.txt files
  • replace: replace all previous alignment files
  • leftovers: only align the left over reads from a previous pass of mapflowcell; the results will be merged into the existing map.list files
  • full: try to find more alignment locations or alignments with fewer mismatches for all previously aligned (and unaligned) reads
Spliced alignment for mRNA-seq
-S, --spliced BLAT & GSnap: Perform spliced alignments
-D, --maxintron=INT (Default: 1000) Max. intron length considered for spliced alignment (equals --localsplicedist in GSnap).
-L, --minhit=INT (Default: 17) Minimum length of hit on either side of spliced read
Paired end sequencing
-p, --PE Paired-end mode, generate output suitable for correct4pe or for realignment using --upgrade=full
Output
-Z, --nocompress-maplist Do not compress mapping files
-Y, --nocompress-leftover Do not compress leftover files
--rplot Graphical output of statistics using R

shore correct4pe

shore correct4pe finds the most likely mapping of repetitive reads by utilizing paired-end information. While in paired read mapping each read is aligned separately, read pair information can be used to increase the likelihood of an alignment by selecting the paired alignment based on the most likely distance between the pairs.

shore correct4pe starts by estimating the insert size distribution. The upper bound of this distribution is usually very sharp (clones longer than expected seem to be very rare), whereas the lower boundary is more blurred and very small clones can be observed as well. The insert size distribution is then translated into a probability distribution for the observation of a given distance of a pairing (where pairing is defined as the combination of one of the mappings of read 1 with one of the mappings of read 2). All possible combinations of the mappings of both reads of a pair are compared and all pairings with a probability equal to zero are dismissed. Mappings which are not in a pairing with a probability above zero are deleted. This removes all repetitive mappings, which resulted from repeats. If there is a mapping of one read pair with two different mappings of the other read the more likely pairing is kept. If all pairings have zero probability all mappings of both reads are kept. These are the discordant (unhappy) read pairs which typically are used to predict structural variants.

shore correct4pe will plot the insert size distribution using the R if -p is specified. In this case R has to be installed and included in the PATH environment variable.


Usage: shore correct4pe [OPTIONS]

Mandatory
-l STRING[,...] Lane or sample directories (comma separated)
-x INT Expected insert size, has to be larger than 0
-e INT Library identifier, defines name space of the read identifiers (>=1)
Optional
-r INT (Default: 10000) Maximum number of hits per read-pair
-s INT SOLiD reads
-m INT Mate pair library instead of Paired-end library
-i STRING Insert distribution file (e.g. when re-running correct4pe)
-d INT Delete uncorrected map.list files
-p INT Plot insert dist

shore merge

Merges and filters alignment files


Usage: shore merge [OPTIONS] [ALIGNMENT_PATHS]

Input
-m, --infiles=<arg[,...]> Alignment files or shore directories (run, lane, read or sample; comma-separated; defaults to <outdir>)
-I, --idordered Input alignments are sorted by read ID and not by coordinate
Main output options
-o, --outdir=<arg> (Default: AlignmentFolder) Set output directory (will be created if necessary)
-A, --no-alignments Deactivate processing of alignment files (only merge insert size distributions or left over files)
-l, --leftovers Activate merging of leftover reads
-p, --rplot Graphical output of statistics using R
Ancillary output options
-s, --subsamples=<arg[,...]> Generate random subsamples of <arg> reads
--diff Write all reads that only occur in a single alignment file
--phasesplit=<arg> Split reads by <arg>-mer phasing
--combine Combine alignments (discard duplicate entries and alignments that are not the best hit)
-q, --stats-only Statistics only, do no write the merged data
-t, --no-stats Disable alignment statistic
-Z, --nocompress Do no compress output files
-S, --stdout Don't create an output directory, write alignments to standard output (implies -t, -Z)
Alignment filter
-H, --hits-range=<arg,arg> Set the allowed range of repetitiveness ('1,1' = nonrep reads)
-M, --mm-range=<arg,arg> Set the allowed range of mismatches
-R, --region=<arg> Only use reads that overlap with the range [chr1:pos1..[chr2:]pos2]
--assume-length=<arg> (Default: 400) Assume maximal alignment length <arg>, enables fast range queries
-X, --p3fix=<arg> Set the 3' end to a fixed distance from the 5' end
-N, --read-lengths=<arg[,...]> Use only reads of the given length(s)
-T, --strand=<arg> Use only reads from the given strand
-B, --duplicates=<arg> Report at maximum <arg> reads with the 5' end at the same position on the same strand
--wpoiss=<arg> Window size for adaptive duplicate read filtering
--sam-ref=<arg> Reference sequence for SAM file parsing
--peflags=<arg[,...]> Use only reads with the given PE flag(s)
Leftover filter
--badqual=<arg> (Default: 10) Quality threshold
--maxbadbases=<arg> (Default: 8) Max. number low quality bases

shore mapdisp

shore mapdisp provides text-based alignment visualization in the terminal. There are lots of much more sophisticated programs for viewing short read alignments. This tool is rather designed for extremely rapid inspection of the alignments in a specific region of the genome.


Usage: shore mapdisp [OPTIONS] SHORE_PATHS

Output
--outfile=<arg> Write to output file instead of using less
Display
-d, --no-color Do not use color terminal
Read filtering
-H, --hits-range=<arg,arg> Set the allowed range of repetitiveness ('1,1' = nonrep reads)
-M, --mm-range=<arg,arg> Set the allowed range of mismatches
-R, --region=<arg> Only use reads that overlap with the range [chr1:pos1..[chr2:]pos2]
--assume-length=<arg> (Default: 400) Assume maximal alignment length <arg>, enables fast range queries
-X, --p3fix=<arg> Set the 3' end to a fixed distance from the 5' end
-N, --read-lengths=<arg[,...]> Use only reads of the given length(s)
-T, --strand=<arg> Use only reads from the given strand
-B, --duplicates=<arg> Report at maximum <arg> reads with the 5' end at the same position on the same strand
--wpoiss=<arg> Window size for adaptive duplicate read filtering
--sam-ref=<arg> Reference sequence for SAM file parsing
--peflags=<arg[,...]> Use only reads with the given PE flag(s)

shore consensus

shore consensus is being replaced by shore qvar (but still a requirement for SHOREmap analysis).

The common output from whole genome re-sequencing projects are lists of all identified polymorphisms (e.g. SNPs, indels, CNVs) as well as reference-like positions. In addition a consensus sequence or contigs can be generated by combining all high quality predictions. shore consensus provides this functionality by sequentially scanning an alignment to gather all read information available at a specific locus (i.e. called bases, base qualities, coverage, repetitiveness, alignment quality). This information is subsequently used to predict differences to the reference sequence.

shore consensus can also be used to identify minor alleles (SNPs or short indels) in pooled samples. In addition shore consensus estimates several characteristics of a run ahead of the actual consensus calling. This includes min and max read length, min and max mismatches, sequencing depth, observed local repetitiveness and GC content bias. Consensus also provides multiple project statistics regarding sequencing error rate, correlation of quality values to observed errors and coverage biases due to local GC content, which can be used to optimize further analysis (e.g. deletions should not be called in low GC content regions if a strong GC bias is observed).

Note: shore consensus can also be applied to sRNA-seq, mRNA-seq and ChIP-seq data. However, SHORE provides more appropriate tools for those purposes (coverage and peak).

The output generated by shore consensus is described in SHORE consensus result files.


Usage: shore consensus [OPTIONS]

Mandatory
-n STRING Name (any of species, strain, accession, project or any other ID)
-f STRING Reference genome sequence from the IndexFolder, *.shore file
-o STRING AnalysisFolder, will be created
-i STRING[,...] Shore directories or map.list file(s)
-g INT Core offset - do not trust the first and last -g positions of the alignment. default: max MM's
Quality threshold
-q INT (Default: 5) Cutoff for base masking using Sanger calibrated qualities
-c INT Cutoff for base masking using chastity values
Basecalling (scoring matrix approach)
-a STRING Scoring matrix file (recommended, activates new basecalling approach)
-b FLOAT (Default: 0.2) Minimum allele frequency of alternative base call
Basecalling (decision tree approach)
-x INT (Default: 3) Minimum coverage threshold
-m INT (Default: 3) Maximum observed to expected coverage
-e FLOAT (Default: 0.1) Minimum observed to expected coverage
-y FLOAT (Default: 0.8) Minimum concordance of homozygous SNPs (0 to 1)
-d FLOAT (Default: 0.67) Minimum concordance of homozygous Indels
-t FLOAT (Default: 0.25) Minimum frequency for heterozygous pos (0 to 1)
-u FLOAT (Default: 0.02) Minimum frequency for minor allele pos (0 to 1)
-z INT (Default: 10) Quality threshold, max base quality
Optional
-R INT Allow base calling in highly repetitive regions
-s INT Consensus analysis using transcriptome (mRNA-seq) reads. Turns off CNV analysis
-S INT (Default: 0) Ignore position with transcriptome coverage not above threshold
-w INT Use graph based map.list format (only genomemapper)
-v INT Create additional output files containing all intermediate data (required for subsequent SHOREmap analysis)
-r INT Graphical output of statistics using R
-N INT Turn off calculation of long deletions, duplications and any other CNVs

shore qVar

Computes consensus sequence, SNPs, indels and CNVs from alignments

shore structure

shore structure enables the detection of diverged regions through clustering of mate pairs alignments with an unexpected distance and/or orientation to each other. Typically the recall is very good for deletions, but insertions longer than the insert size are cannot be revealed. In addition shore structure calls inversions. Currently only works for homozygous changes.

shore methyl

Quantify methylated and unmethylated cytosines from BS-seq alignments (only genomemapper)

shore peak

shore peak provides enriched region prediction for ChIP-Seq experiments. Significance of the predicted regions is assessed by comparison to the specified control samples.

Replicate experiments may be processed simultaneously by specifying multiple experiment and control paths. While the significance of each peak region is then tested for independently for each replicate, the region prediction itself is performed jointly for all experiments to obtain results that are immediately comparable.

The output generated by shore peak is described in SHORE peak result files.

Usage: shore peak [OPTIONS]

Mandatory
-o, --outfolder=<arg> (Default: PeakAnalysis) Output directory (will be created)
-i, --chip-paths=<arg[:...][,...]> ChIP experiment alignment files or shore directories (replicates)
-c, --ctrl-paths=<arg[:...][,...]> Control experiment alignment files or shore directories
Segmentation
-S, --window-size=<arg> (Default: 2000) Sliding window size for dynamic segmentation. Note that this value presents an upper bound for the size of the peaks that can be detected.
-P, --poisson-threshold=<arg> (Default: 0.05) Poisson probability threshold [<=] for dynamic segmentation
-V, --probation=<arg> (Default: 0) Allow a mitigated threshold for at most <arg> base pairs inside a segment
-Q, --mitigator=<arg> (Default: 1) Modifier for calculation of the mitigated threshold, value in [0,1]
-J, --minsize=<arg> (Default: 131) Segment size threshold [>=]
Normalization
-b, --binsize=<arg> (Default: 4000) Size of read bins for normalization
-q, --rankmaxquant-ubound=<arg> (Default: 1) Quantile upper bound for the rank maxima of the bins used
Read filter
-H, --hits-range=<arg,arg> Set the allowed range of repetitiveness ('1,1' = nonrep reads)
-M, --mm-range=<arg,arg> Set the allowed range of mismatches
-R, --region=<arg> Only use reads that overlap with the range [chr1:pos1..[chr2:]pos2]
--assume-length=<arg> (Default: 400) Assume maximal alignment length <arg>, enables fast range queries
-X, --p3fix=<arg> (Default: 130) Set the 3' end to a fixed distance from the 5' end (set to 0 to disable)
-N, --read-lengths=<arg[,...]> Use only reads of the given length(s)
-B, --duplicates=<arg> Report at maximum <arg> reads with the 5' end at the same position on the same strand
--sam-ref=<arg> Reference sequence for SAM file parsing
--peflags=<arg[,...]> Use only reads with the given PE flag(s)
-F, --poissonifier-width=<arg> (Default: 13) Set the window size for the adaptive duplicate read filter (set to zero to disable)
Peak filtering
-n, --nsigma=<arg> (Default: 6) Allow the mean segment coverage any control sample to be at most <arg> std. deviations higher than the median before discarding the segment
--min-xshift=<arg> (Default: 10) Require a certain shift for the reverse strand peak in at least one experiment
--min-foldchange=<arg> (Default: 2) Require a minimum normalized fold change of <arg> for experiment vs. control for at least one experiment
Other
--non-directional Assume that any clone may be sequenced from both ends (calculates a more conservative FDR)
-d, --rankproduct=<arg> (Default: 10000) Number of simulations for rankproduct PFP estimation (set to zero to disable PFP estimation)
--rplot=<arg> (Default: 100) Plot the first <arg> peaks using R
-r, --index-file=<arg> Extract sequence information for each segment from *.shore index file
-a, --annotation-file=<arg> Annotation file, GFF3 file in sequence ontology compliant format
-O, --chr-ordering=<arg[,...]> Allows to specify the order of chromosome entries in the annotation file
--so-filter=<arg[,...]> (Default: gene,transposable_element_gene) Only parse toplevel annotation features of the given SO types

shore srna

The purpose of shore srna is facilitating the analysis of small RNA sequencing data. The genome is scanned for regions where significant amounts of small RNAs are expressed and annotates these loci by read counts as well as the sRNA size that predominates.


Usage: shore srna [OPTIONS] [SAMPLE_PATHS]

Mandatory
-s, --samples=<arg[:...][,...]> Shore directories (comma-separated; colon-separated items will be treated as a single assay)
-o, --outfolder=<arg> (Default: SrnaAnalysis) Output directory
Segmentation
-j, --joint-seg Apply segmentation threshold to the joint coverage instead of per-sample coverage
-C, --static-threshold=<arg> (Default: 10) Coverage threshold [>]
-J, --minsize=<arg> (Default: 15) Segment size threshold [>=]
-V, --probation=<arg> (Default: 0) Allow a mitigated threshold for at most <arg> base pairs inside a segment
-Q, --mitigator=<arg> (Default: 1) Modifier for calculation of the mitigated threshold, value in [0,1]
-v, --overlap=<arg> (Default: 1) Required overlap for merging segments (may be negative to allow gaps)
Alignment filter
-H, --hits-range=<arg,arg> Set the allowed range of repetitiveness ('1,1' = nonrep reads)
-M, --mm-range=<arg,arg> Set the allowed range of mismatches
-R, --region=<arg> Only use reads that overlap with the range [chr1:pos1..[chr2:]pos2]
--assume-length=<arg> (Default: 400) Assume maximal alignment length <arg>, enables fast range queries
-B, --duplicates=<arg> Report at maximum <arg> reads with the 5' end at the same position on the same strand
--sam-ref=<arg> Reference sequence for SAM file parsing
--peflags=<arg[,...]> Use only reads with the given PE flag(s)

shore coverage

For analysis of expression levels of mRNAs and small RNAs or for detection of unknown transcripts it is typically required to generate a coverage graph and to define expressed segments based on consecutive coverage.

shore coverage generates a coverage graph by sequentially scanning the alignment and basically counting reads.


Usage: shore coverage [OPTIONS] [MAPFILES]

Input
-m, --mapfiles=<arg[:...][,...]> Alignment files or shore directories (flowcell, lane, pe or barcode; comma-separated; colon-separated items will be treated as single assay)
-n, --merge-input Merge all input files
Output
-o, --output-directory=<arg> (Default: CoverageAnalysis) Output directory (will be created)
-s, --segmentation Write segmentation files
-t, --merge-segments=<arg> Overlap in base pairs for merging segment files (may be negative to allow gaps); if unspecified, segments will not be merged
-q, --no-coverage Do not write coverage files
-z, --compress Compress output files
--rplot Plot the specified range using R
--ylim=<arg> Set y axis limit for plots (default: auto)
--phasing=<arg> Visualize <arg>-mer phasing
Alignment filter
-H, --hits-range=<arg,arg> Set the allowed range of repetitiveness ('1,1' = nonrep reads)
-M, --mm-range=<arg,arg> Set the allowed range of mismatches
-R, --region=<arg> Only use reads that overlap with the range [chr1:pos1..[chr2:]pos2]
--assume-length=<arg> (Default: 400) Assume maximal alignment length <arg>, enables fast range queries
-X, --p3fix=<arg> Set the 3' end to a fixed distance from the 5' end
-N, --read-lengths=<arg[,...]> Use only reads of the given length(s)
-T, --strand=<arg> Use only reads from the given strand
-B, --duplicates=<arg> Report at maximum <arg> reads with the 5' end at the same position on the same strand
--wpoiss=<arg> Window size for adaptive duplicate read filtering
--sam-ref=<arg> Reference sequence for SAM file parsing
--peflags=<arg[,...]> Use only reads with the given PE flag(s)
Coverage
-W, --weight-repetitive=<arg> (Default: divide) How to weight repetitive hits (divide or multiply or const)
Segmentation
-C, --static-threshold=<arg> (Default: 10) Coverage threshold [>] for static segmentation
-J, --minsize=<arg> (Default: 20) Segment size threshold [>=] for static or dynamic segmentation
-V, --probation=<arg> (Default: 0) Allow a mitigated threshold for at most <arg> base pairs inside a segment
-Q, --mitigator=<arg> (Default: 1) Modifier for calculation of the mitigated threshold, value in [0,1]
-D, --dynamic Switches to dynamic segmentation
-S, --window-size=<arg> (Default: 2000) Sliding window size for dynamic segmentation
-P, --poisson-threshold=<arg> (Default: 0.05) Poisson probability threshold [<=] for dynamic segmentation

shore mg

Primitive metagenomic analysis


Usage: shore mg [OPTIONS] [MAP_PATHS]

Allowed options
-f, --mappaths=<arg[,...]> Input directories or files
-o, --outfolder=<arg> (Default: Mg) Output directory, will be created
--collapse=<arg[:...][,...]> Collapse any sequence ID not listed here to the next smaller one in the list
--power Initialize the ID combinations for collapse with the power set of all IDs
--autocollapse=<arg> Specify *.trans or ref.txt file to automatically collapse to the species level; preprocess must have been run with the --fullheader option; 2nd & 3rd word of fasta headers are taken to be the species name
--make-unique Make alignments unique before processing
Read filter
-H, --hits-range=<arg,arg> Set the allowed range of repetitiveness ('1,1' = nonrep reads)
-M, --mm-range=<arg,arg> Set the allowed range of mismatches
-N, --read-lengths=<arg[,...]> Use only reads of the given length(s)
-T, --strand=<arg> Use only reads from the given strand
--sam-ref=<arg> Reference sequence for SAM file parsing
--peflags=<arg[,...]> Use only reads with the given PE flag(s)

shore count

shore count calculates the read count as well as other properties for regions in the genome that have already been defined by some other means. It may be used to analyze either fixed-size jumping windows over the genome or regions defined in an input file, e.g. to analyze annotated coding regions or to manually re-analyze regions defined by the segmentation algorithms of shore coverage, shore peak or shore srna.

Accepted input files are tab-delimited plain text files with a header specifying the columns chr, pos, size and optionally strand.


Usage: shore count [OPTIONS] [MAPFILES]

Mandatory
-m, --mapfiles=<arg[:...][,...]> Alignment files or shore directories (flowcell, lane, pe or barcode; comma-separated; colon-separated items will be treated as single assay)
-o, --output-folder=<arg> (Default: SegmentAnalysis) Output directory, will be created
Variable size
-f, --segment-file=<arg> Set file with segment information (expects a sorted file with columns chr, pos, size, strand)
Fixed size
-s, --segment-size=<arg> Use segments of fixed size <arg> instead of a file
-j, --segment-distance=<arg> Distance of fixed size segments (defaults to segment size)
-t, --strand-specific Count both strands separately
Output
-k, --rpkm Also calculate reads per kilobase & million (RPKM) values (totals calculated without applying the alignment filter)
--totals-file=<arg> Read totals for RPKM calculation from a file
-a, --fasta-file=<arg> If a fasta file is provided, the sequence will be reported for each segment
Counting
-O, --overlap=<arg> (Default: 50%) Required amount of overlap between read and feature (percentage or absolute)
-W, --weight-repetitive=<arg> (Default: divide) How to weight repetitive hits (divide or multiply or const)
Alignment filter
-H, --hits-range=<arg,arg> Set the allowed range of repetitiveness ('1,1' = nonrep reads)
-M, --mm-range=<arg,arg> Set the allowed range of mismatches
-R, --region=<arg> Only use reads that overlap with the range [chr1:pos1..[chr2:]pos2]
--assume-length=<arg> (Default: 400) Assume maximal alignment length <arg>, enables fast range queries
-X, --p3fix=<arg> Set the 3' end to a fixed distance from the 5' end
-N, --read-lengths=<arg[,...]> Use only reads of the given length(s)
-T, --strand=<arg> Use only reads from the given strand
-B, --duplicates=<arg> Report at maximum <arg> reads with the 5' end at the same position on the same strand
--wpoiss=<arg> Window size for adaptive duplicate read filtering
--sam-ref=<arg> Reference sequence for SAM file parsing
--peflags=<arg[,...]> Use only reads with the given PE flag(s)

shore tagstats

Gather read statistics for multiple samples. This is mainly intended for small RNA sequencing when no reference is available.


Usage: shore tagstats [OPTIONS] [PATHS]

Allowed options
-i, --readpaths=<arg[:...][,...]> SHORE directories or read file paths
-o, --outdir=<arg> (Default: ReadAnalysis) Output directory
-r, --report=<arg> (Default: 1) Only report a sequence if it's represented at least <arg> times
-p, --pseudo=<arg> (Default: 0) Add a pseudocount of <arg> to each read count

shore binom_test

shore binom_test can be used to evaluate two sets of count data agaist each other using a binomial test.


Usage: shore binom_test [OPTIONS]

Allowed options
-i, --input-file=<arg> (Default: stdin) Read count input file
-o, --output-file=<arg> (Default: stdout) Output file
-p, --distribution-p=<arg> (Default: 0.5) Parameter p of the binomial distribution
-n, --normalization-file=<arg> File with scaling factors for each column (overrides '-p')
-a, --alternative=<arg> (Default: less) Specifies the alternative hypothesis for the test (less or greater or twosided)
-1, --first-column=<arg> Name of the first read count column
-2, --2nd-column=<arg> Name of the second read count column (tested vs. column 1)
--global-scaling=<arg> (Default: 1) Scaling constant with wich all read counts are multiplied
-j, --input-header=<arg[,...]> Specify header for input file, if not available
-f, --fold-change Report fold enrichment values
--fdr-bh Calculate Benjamini-Hochberg FDR
--sort Sort output

shore mtc

The subprogram shore mtc implements various multiple testing correction methods. The expected input is a tab-delimited text file with a header, and the column containing the p-values to be adjusted must be named raw_p.

Implemented methods include

  • Benjamini-Hochberg false discovery rate control (fdr_bh)
  • Bonferroni familywise error rate control (fwer_bonferroni)
  • Holm familywise error rate control (fwer_holm)
  • Hochberg familywise error rate control (fwer_hochberg)
  • Sidak singlestep familywise error rate control (fwer_sidak_ss)
  • Sidak stepdown familywise error rate control (fwer_sidak_sd)
  • Benjamini-Yekutieli false discovery rate control (fdr_by).


Usage: shore mtc [OPTIONS]

Mandatory
-m, --method=<arg[,...]> Select correction method(s), out of: fdr_bh, fwer_bonferroni, fwer_holm, fwer_hochberg, fwer_sidak_ss, fwer_sidak_sd, fdr_by
-i, --input-file=<arg> (Default: stdin) The file the raw p-values are read from (expects a column 'raw_p')
-o, --output-file=<arg> (Default: stdout) Output file
Output
-u, --fdr-max=<arg> (Default: 1) Maximum q-value to report
-e, --echo-comments Echo all comments read from input files to stdout
-q, --quiet Do not print input, only report the q-values
Other
-j, --input-header=<arg[,...]> Use arg as input file header

shore annotate_region

shore annotate_region can be used to annotate previously defined genomic regions with the overlapping or nearest genes present in an annotation file. Only the central base of each region will be annotated. The annotation file must be in standard GFF format.


Usage: shore annotate_region [OPTIONS]

Mandatory
-a, --annotation-file=<arg> Annotation file
-f, --feature-file=<arg> File with the features to be annotated. This file must contain a header specifying the columns 'chr', 'pos' and optionally 'size' or 'end'
-o, --outfile=<arg> (Default: stdout) Output file
Optional
--header=<arg[,...]> Header for the feature file
--range Use the real regions and not just the central base
--gff Write output in GFF format
--so-filter=<arg[,...]> (Default: gene,transposable_element_gene) Only parse toplevel features of the given Sequence Ontology (SO) types
--print Just print the annotation tree
--query-pos=<arg> Query annotation for the given position

shore convert

Convert SHORE files into common file formats, and vice versa.

Available converters:

  • Alignment2ALN
  • Alignment2BED
  • Alignment2GFF
  • Alignment2Maplist
  • Alignment2SAM
  • ColorFlat2Fastq
  • Contig2AFG
  • Eland2Maplist
  • ExpandTabs
  • FlatPair2Fastq
  • Maplist2Eland
  • Reads2Fasta
  • Reads2Fastq
  • Reads2Flat
  • Reads2Qual
  • Solid2Fastq
  • Solid2Flat
  • Variant2GFF
  • Variant2VCF

Alignment2... converters can convert

  • SHORE map.list files (default)
  • SAM files (*.sam)
  • BAM files (*.bam)

Reads2... converters can convert

  • SHORE reads_0.fl files (default)
  • FastQ files (*.fq, *.fastq)
  • 454 Standard Flowgram Format SFF (*.sff)
  • Illumina QSEQ files (*.qseq, *_qseq.txt)
  • SHORE map.list files (*.list) (discards alignment information and only keeps the read information; input files must be sorted by read ID)
By default, the SHORE file formats (map.list and reads_0.fl, respectively) are expected as input.
All other file types must have the correct file extensions to be recognized (an additional .gz is allowed for compressed files).

Additionally, the special file names stdin and stdout may be used for reading from standard input and for writing to standard output, respectively.

For stdin, map.list format is expected for Alignment2... conversions and reads_0.fl format for Reads2... conversions. To convert different formats from standard input, use e.g. stdin.sam, stdin.fastq.gz, etc.

shore sort

Sort / merge tab-delimited text files


Usage: shore sort [OPTIONS] [TEXT_FILES]

Allowed options
-i, --infiles=<arg[,...]> A comma-separated list of plain-text input files
-o, --outfile=<arg> (Default: stdout) Output file
-p, --preset=<arg> Automatically select sort keys for the file type specified. Supported values: * maplist: map.list format sorted by genomic coordinate * maplist_id: map.list format sorted by read ID * reads0: reads flat file format sorted by read ID * gff: GFF format sorted by position
-k, --keystring=<arg> Concatenation of column ids (counted from 1) and key types. Valid key types: t (text), i (integer) and f (float); capital letters reverse the sort order - e.g. '-k 1i5t3i7I'.
-I, --inplace Output file is the same as the input file
-t, --tmpdir=<arg> Temporary file directory (defaults to $TMPDIR or /tmp)
-B, --blocksize=<arg> (Default: 2048) Block size in megabytes
-m, --nur-merge Merge already sorted files
-u, --unique Output only the first of an equal run
-c, --check Only test if the files are sorted
-b, --upper-bound=<arg[,...]> Returns byte offset (counted from 0) and text of the first line in a sorted file that compares greater than the keys given in <arg> (provide comma-separated values in order of key priority)
-T, --tail=<arg[,...]> Print all lines in a sorted file that compare greater than the keys given in <arg> (provide comma-separated values in order of key priority)
-C, --no-comments Do not treat line comments and empty lines specially
-v, --verbose Be more verbose

shore compress

Compress files to indexed gzip format


Usage: shore compress [OPTIONS] FILES

Allowed options
--outfile=<arg> Write to the file <arg> instead of <infile>.gz
--replace Remove original files after compression. If the input file is already compressed it will be recompressed and replaced
--tail=<arg> Instead of compressing files, dump the last <arg> bytes of a seekable file
--dumpgzx Print out the index for each file

shore 2dex

Range-indexing and query for tab-delimited text files


Usage: shore 2dex [OPTIONS] [TEXT_FILES]

Mandatory
-i, --infiles=<arg[,...]> A comma-separated list of tab-delimited plain-text input files (can also be any SHORE directory when -f MAPLIST is set)
Format Options
-f, --format=<arg> Provide file type for automatic settings, valid file types: MAPLIST, GFF, SAM
-c, --chr-column=<arg> Column w. chromosome or sequence name, provide the column name or @<column_number>
-p, --pos-column=<arg> Column w. start position, provide the column name or @<column_number>
-s, --size-column=<arg> Column w. feature size, provide the column name or @<column_number>
-e, --end-column=<arg> Column w. end position (inclusive), provide the column name or @<column_number>
-x, --xend-column=<arg> Column w. end position (exclusive), provide the column name or @<column_number>
-C, --commentchar=<arg> Comment line symbol
Index Options
-B, --blocksize=<arg> (Default: 131072) Block size determining the index resolution in bytes
-G, --maxgap=<arg> (Default: 131072) Maximum sequence gap in a block
Query Options
-q, --query=<arg> A range to query; prints all overlapping records. Valid ranges: 'SEQ:POS~SIZE', 'SEQ:POS..END', 'SEQ1:POS..SEQ2:END', 'SEQ:POS...XEND', 'SEQ1:POS...SEQ2:XEND' (END: inclusive, XEND: exclusive)
Other
-v, --verbose Be more verbose
-Q, --quiet Be less verbose

shore idtrans

SHORE uses numerical identifiers for all sequences of the reference. shore idtrans simplifies translating these numbers in some of the result files back into chromosome names as specified in the reference fasta file (and vice versa).

Required is either a *.trans file which is stored in the IndexFolder by shore preprocess, or a ref.txt file generated by shore mapflowcell.


Usage: shore idtrans [OPTIONS] FILES

Allowed options
-t, --transfile=<arg> *.trans file from IndexFolder
-r, --reffile=<arg> ref.txt file generated by mapflowcell
-o, --outfile=<arg> Output file (default: <infile>.idtrans)
-c, --columns=<arg[,...]> (Default: chr) Columns to be translated (column names or @<column_number>)
--name2id Translate names to IDs (default: translate IDs to names)
--nocompress Do not compress output files