Shore correct4pe

From SHORE wiki
Revision as of 09:19, 11 April 2012 by Felo80 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

shore correct4pe finds the most likely mapping of repetitive reads by utilizing paired-end information. While in paired read mapping each read is aligned separately, read pair information can be used to increase the likelihood of an alignment by selecting the paired alignment based on the most likely distance between the two partners.

shore correct4pe starts by estimating the insert size distribution. The upper bound of this distribution is usually very sharp (clones longer than expected are usually rare), whereas the lower boundary is more blurred and very small clones can be observed as well.

The insert size distribution is then translated into a probability distribution for the observation of a given distance of a pairing (where pairing is defined as the combination of one of the mappings of read 1 with one of the mappings of read 2). All possible combinations of the mappings of both reads of a pair are compared and all pairings with a probability equal to zero are dismissed. Mappings which are not in a pairing with a probability above zero are deleted. This removes all repetitive mappings, which resulted from repeats. If there is a mapping of one read pair with two different mappings of the other read the more likely pairing is kept.

If all pairings have zero probability all mappings of both reads are kept. These are the discordant (unhappy) read pairs which typically are used to predict structural variants. Reads with an unmapped partner are marked as orphan (cf. map.list format).

shore correct4pe will plot the insert size distribution using R if the -p option is specified. In this case R has to be installed and included in the PATH environment variable.

Command line options

Usage: shore correct4pe [OPTIONS]

Mandatory
-l STRING[,...] Lane or sample directories (comma separated)
-x INT Expected insert size, has to be larger than 0
-e INT Library identifier, defines name space of the read identifiers (>=1)
Optional
-r INT (Default: 10000) Maximum number of hits per read-pair
-s SOLiD reads
-m Mate pair library instead of Paired-end library
-i STRING Insert distribution file (e.g. when re-running correct4pe)
-d Delete uncorrected map.list files
-p Plot insert dist