Difference between revisions of "SHORE v0.7 file formats"
(→Alignment file format) |
|||
Line 45: | Line 45: | ||
{|cellpadding="5" | {|cellpadding="5" | ||
+ | |----valign="top" | ||
| '''chr id''' | | '''chr id''' | ||
| Each chromosome has an internal id, simply enumerated according to their occurrence within the reference sequence file, starting from 1. Translation to the native chromosome name can be found in the ''*.shore.ref'' and ''*.shore.trans'' file in the ''[[IndexFolder]]'', or in the ''ref.txt'' files created by ''[[shore mapflowcell]]''. | | Each chromosome has an internal id, simply enumerated according to their occurrence within the reference sequence file, starting from 1. Translation to the native chromosome name can be found in the ''*.shore.ref'' and ''*.shore.trans'' file in the ''[[IndexFolder]]'', or in the ''ref.txt'' files created by ''[[shore mapflowcell]]''. | ||
− | |- | + | |----valign="top" |
| '''pos''' | | '''pos''' | ||
| Left-most position of the alignment relative to the forward strand of the reference sequence. The first position of a chromosome is ''1''. | | Left-most position of the alignment relative to the forward strand of the reference sequence. The first position of a chromosome is ''1''. | ||
− | |- | + | |----valign="top" |
| '''alignment''' | | '''alignment''' | ||
| String representation of the read alignment. The sequence is always reported with respect to the forward strand of the reference, i.e. the sequence of reads matching to the reverse strand is reverse complemented. | | String representation of the read alignment. The sequence is always reported with respect to the forward strand of the reference, i.e. the sequence of reads matching to the reverse strand is reverse complemented. | ||
Line 61: | Line 62: | ||
* Consecutive stretches of the same operation (mismatch, insertion, deletion) may be abbreviated, e.g. '''[CTT|---]''' instead of '''[C-][T-][T-]''' | * Consecutive stretches of the same operation (mismatch, insertion, deletion) may be abbreviated, e.g. '''[CTT|---]''' instead of '''[C-][T-][T-]''' | ||
* ''F'' can be used to indicate a mapped part of a fragment with known size, but unknown sequence, e.g. '''[F100]''' | * ''F'' can be used to indicate a mapped part of a fragment with known size, but unknown sequence, e.g. '''[F100]''' | ||
− | |- | + | |----valign="top" |
| '''read id''' | | '''read id''' | ||
| A unique identifier for the read or read pair | | A unique identifier for the read or read pair | ||
− | |- | + | |----valign="top" |
| '''strand''' | | '''strand''' | ||
| ''D'' for forward and ''P'' for reverse hits (''direct'' and ''palindromic'', respectively). | | ''D'' for forward and ''P'' for reverse hits (''direct'' and ''palindromic'', respectively). | ||
− | |- | + | |----valign="top" |
| '''mismatches''' | | '''mismatches''' | ||
| The number of mismatches+gaps in the alignment. | | The number of mismatches+gaps in the alignment. | ||
− | |- | + | |----valign="top" |
| '''hits''' | | '''hits''' | ||
| The total number of genomic positions the read is aligned to. | | The total number of genomic positions the read is aligned to. | ||
− | |- | + | |----valign="top" |
| '''read length''' | | '''read length''' | ||
| Length of the read ('soft clipped' nucleotides excluded). | | Length of the read ('soft clipped' nucleotides excluded). | ||
− | |- | + | |----valign="top" |
| '''reserved''' | | '''reserved''' | ||
| Reserved field for future use, should be parsed as a string | | Reserved field for future use, should be parsed as a string | ||
− | |- | + | |----valign="top" |
| '''pe flag''' | | '''pe flag''' | ||
| Paired-end information | | Paired-end information | ||
Line 92: | Line 93: | ||
* ''8'': second read of a pair (orphan read) | * ''8'': second read of a pair (orphan read) | ||
A library ID <math>L</math> may be encoded in the pe flag for the flags with value <math>>2</math>; in this case the flag is calculated as <math>pe\_flag = pe\_flag + (L * 6)</math> | A library ID <math>L</math> may be encoded in the pe flag for the flags with value <math>>2</math>; in this case the flag is calculated as <math>pe\_flag = pe\_flag + (L * 6)</math> | ||
− | |- | + | |----valign="top" |
| '''Sanger quality values''' | | '''Sanger quality values''' | ||
| String of sanger calibrated quality values. | | String of sanger calibrated quality values. | ||
Encoding: ASCII 33 (''''!'''', quality 0) to ASCII 73 (''''I'''', quality 40); extended range ASCII 93 ('''']'''', quality 60) | Encoding: ASCII 33 (''''!'''', quality 0) to ASCII 73 (''''I'''', quality 40); extended range ASCII 93 ('''']'''', quality 60) | ||
− | |- | + | |----valign="top" |
| ['''Chastity values'''] | | ['''Chastity values'''] | ||
(Optional column) | (Optional column) |
Revision as of 17:26, 26 September 2011
Any output generated by SHORE will usually be written to various text files that contain a number of tab-delimited columns.
Typing shore fmt will display a quick reference for many of SHORE’s file formats.
This page only describes SHORE's read and alignment file formats; other files formats will be described on the page of the respective subprogram that generates them.
Read file format
Read files can be found in the LengthFolders or ReadFolders. These files are created by shore import and are named reads_0.fl.
reads_0.fl files are usually sorted on the id field in numerical order.
The tab delimited entries are:
id | A unique identifier for the read or read pair |
sequence | DNA sequence |
pe | Flag, ’1’ or ’2’, first read or second read of a read pair. ’0’ for a single-end read. |
Sanger quality values | String of sanger calibrated quality values
Encoding: ASCII 33 ('!', quality 0) to ASCII 73 ('I', quality 40); extended range ASCII 93 (']', quality 60) |
[Chastity values]
(Optional column) |
String of illumina chastity values (defined as <math>Intensity(max)/ (Intensity(max) + Intensity(second))</math>).
Encoding: ASCII 40 ('(', chastity of 0.5) to ASCII 90 ('Z', chastity of 1.0) |
Note: Earlier versions of SHORE used a read file format featuring three different quality types per entry. This file format is still supported for reading, but is no longer written.
Alignment file format
SHORE alignment files are typically named map.list, map.list.1 or map.list.2. They are stored in the LengthFolders or ReadFolders.
map.list files are sorted in numerical order, either on the fields chr id and pos, or on the field read id.
The tab delimited entries are:
chr id | Each chromosome has an internal id, simply enumerated according to their occurrence within the reference sequence file, starting from 1. Translation to the native chromosome name can be found in the *.shore.ref and *.shore.trans file in the IndexFolder, or in the ref.txt files created by shore mapflowcell. |
pos | Left-most position of the alignment relative to the forward strand of the reference sequence. The first position of a chromosome is 1. |
alignment | String representation of the read alignment. The sequence is always reported with respect to the forward strand of the reference, i.e. the sequence of reads matching to the reverse strand is reverse complemented.
Extensions, not supported by all tools:
|
read id | A unique identifier for the read or read pair |
strand | D for forward and P for reverse hits (direct and palindromic, respectively). |
mismatches | The number of mismatches+gaps in the alignment. |
hits | The total number of genomic positions the read is aligned to. |
read length | Length of the read ('soft clipped' nucleotides excluded). |
reserved | Reserved field for future use, should be parsed as a string |
pe flag | Paired-end information
A library ID <math>L</math> may be encoded in the pe flag for the flags with value <math>>2</math>; in this case the flag is calculated as <math>pe\_flag = pe\_flag + (L * 6)</math> |
Sanger quality values | String of sanger calibrated quality values.
Encoding: ASCII 33 ('!', quality 0) to ASCII 73 ('I', quality 40); extended range ASCII 93 (']', quality 60) |
[Chastity values]
(Optional column) |
String of illumina chastity values defined as the highest intensity divided by the sum of the highest and the second highest intensity of a single base.
Encoding: ASCII 40 ('(', chastity of 0.5) to ASCII 90 ('Z', chastity of 1.0) |
Note: Earlier versions of SHORE used an alignment file format featuring three different quality types per entry. This file format is still supported for reading, but is no longer written.