Shore consensus

From SHORE wiki
Revision as of 14:22, 23 September 2011 by Felo80 (Talk | contribs)

Jump to: navigation, search

shore consensus is being replaced by shore qvar (but still a requirement for SHOREmap analysis).

The common output from whole genome re-sequencing projects are lists of all identified polymorphisms (e.g. SNPs, indels, CNVs) as well as reference-like positions. In addition a consensus sequence or contigs can be generated by combining all high quality predictions. shore consensus provides this functionality by sequentially scanning an alignment to gather all read information available at a specific locus (i.e. called bases, base qualities, coverage, repetitiveness, alignment quality). This information is subsequently used to predict differences to the reference sequence.

shore consensus can also be used to identify minor alleles (SNPs or short indels) in pooled samples. In addition shore consensus estimates several characteristics of a run ahead of the actual consensus calling. This includes min and max read length, min and max mismatches, sequencing depth, observed local repetitiveness and GC content bias. Consensus also provides multiple project statistics regarding sequencing error rate, correlation of quality values to observed errors and coverage biases due to local GC content, which can be used to optimize further analysis (e.g. deletions should not be called in low GC content regions if a strong GC bias is observed).

Note: shore consensus can also be applied to sRNA-seq, mRNA-seq and ChIP-seq data. However, SHORE provides more appropriate tools for those purposes (coverage and peak).

The output generated by shore consensus is described in SHORE consensus result files.


Usage: shore consensus [OPTIONS]

Mandatory
-n STRING Name (any of species, strain, accession, project or any other ID)
-f STRING Reference genome sequence from the IndexFolder, *.shore file
-o STRING AnalysisFolder, will be created
-i STRING[,...] Shore directories or map.list file(s)
-g INT Core offset - do not trust the first and last -g positions of the alignment. default: max MM's
Quality threshold
-q INT (Default: 5) Cutoff for base masking using Sanger calibrated qualities
-c INT Cutoff for base masking using chastity values
Basecalling (scoring matrix approach)
-a STRING Scoring matrix file (recommended, activates new basecalling approach)
-b FLOAT (Default: 0.2) Minimum allele frequency of alternative base call
Basecalling (decision tree approach)
-x INT (Default: 3) Minimum coverage threshold
-m INT (Default: 3) Maximum observed to expected coverage
-e FLOAT (Default: 0.1) Minimum observed to expected coverage
-y FLOAT (Default: 0.8) Minimum concordance of homozygous SNPs (0 to 1)
-d FLOAT (Default: 0.67) Minimum concordance of homozygous Indels
-t FLOAT (Default: 0.25) Minimum frequency for heterozygous pos (0 to 1)
-u FLOAT (Default: 0.02) Minimum frequency for minor allele pos (0 to 1)
-z INT (Default: 10) Quality threshold, max base quality
Optional
-R INT Allow base calling in highly repetitive regions
-s INT Consensus analysis using transcriptome (mRNA-seq) reads. Turns off CNV analysis
-S INT (Default: 0) Ignore position with transcriptome coverage not above threshold
-w INT Use graph based map.list format (only genomemapper)
-v INT Create additional output files containing all intermediate data (required for subsequent SHOREmap analysis)
-r INT Graphical output of statistics using R
-N INT Turn off calculation of long deletions, duplications and any other CNVs