Shore srna

From SHORE wiki
Revision as of 16:24, 26 September 2011 by Felo80 (Talk | contribs)

Jump to: navigation, search

The purpose of shore srna is facilitating the analysis of small RNA sequencing data. The genome is scanned for regions where significant amounts of small RNAs are expressed and annotates these loci by read counts as well as the sRNA size that predominates.

Command line options

Usage: shore srna [OPTIONS] [SAMPLE_PATHS]

Input
-s, --samples=STRING[:...][,...] Shore folders (comma-separated; colon-separated items will be treated as a single assay)
Output
-o, --outdir=STRING (Default: SrnaAnalysis) Output directory (will be created)
--rpkm Report counts normalized as 'reads per kilobase per million' instead of 'reads per million'
Coverage
-W, --weight-repetitive=STRING (Default: divide) How to weight repetitive hits (divide or multiply or const)
  • divide: each alignment has a score of 1/number_of_hits
  • multiply: each alignment has a score of number_of_hits (only useful for repeat analysis, don't use)
  • const: each alignment is counted as 1
Segmentation
-j, --joint-seg Apply segmentation threshold to the joint coverage instead of per-sample coverage
-C, --static-threshold=FLOAT (Default: 10) Coverage threshold [>]
-J, --minsize=INT (Default: 15) Segment size threshold [>=]
-V, --probation=INT (Default: 0) Allow a mitigated threshold for at most <arg> base pairs inside a segment
-Q, --mitigator=FLOAT (Default: 1) Modifier for calculation of the mitigated threshold, value in [0,1]
-v, --overlap=INT (Default: 1) Required overlap for merging segments (may be negative to allow gaps)
Alignment filter
-H, --hits-range=INT,INT Set the allowed range of repetitiveness ('1,1' = nonrep reads)
-M, --mm-range=INT,INT Set the allowed range of mismatches
-R, --region=STRING Only use reads that overlap with the range [chr1:pos1..[chr2:]pos2]
--assume-length=INT (Default: 400) Assume maximal alignment length <arg>, enables fast range queries
-N, --read-lengths=INT[,...] Use only reads of the given length(s)
-B, --duplicates=FLOAT Report at maximum <arg> reads with the 5' end at the same position on the same strand
--sam-ref=STRING Reference sequence for SAM file parsing

SHORE srna result files

The main result file produced by shore srna is named seg.txt:

chr sequence / chromosome ID
pos left-most position of the expressed locus on the reference sequence
size size of the expressed locus
strand strand of the expressed locus; each strand is processed completely independently
kmer_maxofs offset into the locus where the kmer with size lmax is most strongly expressed; useful for locating the exact position of mature miRNA.
agree fraction of samples where lmax is the size of the most frequent kmer at the locus
disagree fraction of samples where lmax is not the size of the most frequent kmer at the locus
lmax the most frequent kmer (calculated from the RPM-normalized read counts) at the locus across all samples
cmax RP[K]M-normalized count of the most frequent kmer at this locus across all samples
ctotal RP[K]M-normalized total read count across all samples at this locus
cpure kmer "purity": cmax/ctotal
cchas kmer "chastity": cmax/(cmax+cmax2), where cmax2 is the normalized read count of the 2nd-most frequent kmer at the locus; cchas is always >=0.5