SHORE v0.7 file formats

From SHORE wiki
Revision as of 15:03, 23 September 2011 by Felo80 (Talk | contribs)

Jump to: navigation, search

Any output generated by SHORE will usually be written to various text files that contain a number of tab-delimited columns.

Typing shore fmt will display a quick reference for many of SHORE’s file formats.

Read file format

Read files can be found in the LengthFolders or ReadFolders. These files are created by shore import and are named reads_0.fl.

reads_0.fl files are usually sorted on the id field in numerical order.

The tab delimited entries are:

id A unique identifier for the read or read pair
sequence DNA sequence
pe Flag, ’1’ or ’2’, first read or second read of a read pair. ’0’ for a single-end read.
Sanger quality values String of sanger calibrated quality values

Encoding: ASCII 33 ('!', quality 0) to ASCII 73 ('I', quality 40); extended range ASCII 93 (']', quality 60)

[Chastity values]

(Optional column)

String of illumina chastity values (defined as <math>Intensity(max)/ (Intensity(max) + Intensity(second))</math>).

Encoding: ASCII 40 ('(', chastity of 0.5) to ASCII 90 ('Z', chastity of 1.0)

Note: Earlier versions of SHORE used a read file format featuring three different quality types per entry. This file format is still supported for reading, but is no longer written.

Alignment file format

SHORE alignment files are typically named map.list, map.list.1 or map.list.2. They are stored in the LengthFolders or ReadFolders.

map.list files are sorted in numerical order, either on the fields chr id and pos, or on the field read id.

The tab delimited entries are:

chr id Each chromosome has an internal id, simply enumerated according to their occurrence within the reference sequence file, starting from 1. Translation to the native chromosome name can be found in the *.shore.ref and *.shore.trans file in the IndexFolder, or in the ref.txt files created by shore mapflowcell.
pos Left-most position of the alignment relative to the forward strand of the reference sequence. The first position of a chromosome is 1.
alignment String representation of the read alignment. The sequence is always reported with respect to the forward strand of the reference, i.e. the sequence of reads matching to the reverse strand is reverse complemented.
  • Matches are reported as a single IUPAC character
  • Mismatches or gaps as two characters surrounded by brackets. The first character represents the reference base, the second character the sequenced base. Deleted nucleotides are represented as ’-’.
    • Examples: [CT] (mismatch), [-T] (insertion), [C-] (deletion)
    • Long deletions with respect to the reference may be reported as the character L followed by the size of the deletion, e.g. [L100]
  • Unaligned sequence ('soft clip') may be reported in angle brackets, e.g. <TTTTTT>

Extensions, not supported by all tools:

  • Consecutive stretches of the same operation (mismatch, insertion, deletion) may be abbreviated, e.g. [CTT|---] instead of [C-][T-][T-]
  • F can be used to indicate a mapped part of a fragment with known size, but unknown sequence, e.g. [F100]
read id A unique identifier for the read or read pair
strand D for forward and P for reverse hits (direct and palindromic, respectively).
mismatches The number of mismatches+gaps in the alignment.
hits The total number of genomic positions the read is aligned to.
read length Length of the read ('soft clipped' nucleotides excluded).
reserved Reserved field for future use, should be parsed as a string
pe flag Paired-end information
  • 0: single read
  • 1: first read of a pair
  • 2: second read of a pair
  • 3: first read of a pair (concordant mapping)
  • 4: first read of a pair (discordant mapping)
  • 5: first read of a pair (orphan read)
  • 6: second read of a pair (concordant mapping)
  • 7: second read of a pair (discordant mapping)
  • 8: second read of a pair (orphan read)

A library ID <math>L</math> may be encoded in the pe flag for the flags with value <math>>2</math>; in this case the flag is calculated as <math>pe\_flag = pe\_flag + (L * 6)</math>

Sanger quality values String of sanger calibrated quality values.

Encoding: ASCII 33 ('!', quality 0) to ASCII 73 ('I', quality 40); extended range ASCII 93 (']', quality 60)

[Chastity values]

(Optional column)

String of illumina chastity values defined as the highest intensity divided by the sum of the highest and the second highest intensity of a single base.

Encoding: ASCII 40 ('(', chastity of 0.5) to ASCII 90 ('Z', chastity of 1.0)

Note: Earlier versions of SHORE used an alignment file format featuring three different quality types per entry. This file format is still supported for reading, but is no longer written.

SHORE peak result files

The main result file produced by shore peak is named SUMMARY.txt:

id An arbitrary numerical ID for the peak region
chr Sequence / chromosome ID
pos Left-most position of the peak region on the reference sequence
size Size of the peak region
p_rank1 Replicate 1 rank of the P-value of the peak
fdr_bh_q1 Replicate 1 Benjamini-Hochberg adjusted FDR of the peak
rc_chip1 Replicate 1 number of reads contributing to the peak in the sample
rc_ctrl1 Replicate 1 number of reads in the same region of the control
pbexcess1 Replicate 1 per-base-excess: mean_coverage_sample(peak) - (mean_coverage_control(peak) * normalization_constant)
fc_score1 Replicate 1 fold change score: 4 * atan(mean_coverage_sample(peak) / mean_coverage_control(peak) * normalization_constant) / PI - 1.0
height_excess1 Replicate 1 peak height excess:
frfc_score1 Replicate 1 forward-reverse fold change score: Calculated like fc_score, but compares the sample forward strand and reverse strand coverage
cog_xshift1 Replicate 1 forward-reverse peak shift
overlap_names Identifiers of the genes overlapping with the center of the peak region (only preset when the option -a was specified)
overlap_types Parts of the genes that overlap (exon, 5' UTR etc.) (only present when the option -a was specified)
up_names Identifiers of the closest genes 'to the left' from the center of the peak (only present when the option -a was specified)
up_dist Distance of the closest genes 'to the left' from the center of the peak (only present when the option -a was specified)
up_strands Strands of the closest genes 'to the left' from the center of the peak (only present when the option -a was specified)
down_names Identifiers of the closest genes 'to the right' from the center of the peak (only present when the option -a was specified)
down_dist Distance of the closest genes 'to the right' from the center of the peak (only present when the option -a was specified)
down_strands Strands of the closest genes 'to the right' from the center of the peak (only present when the option -a was specified)