Part II Index Data Structures
5. ? gram Index
6.1 Related work
In the last years a variety of tools was designed and developed speci ically for the purpose of mapping short reads. In Table 6.1 we compare the algorithmic key features of a subset of mappers, i.e. Bowtie [Langmead et al., 2009], Bowtie 2 [Langmead and Salzberg, 2012], BWA [Li and Durbin, 2009], Eland [Cox, 2006], Hobbes [Ahmadi et al., 2012], Maq [Li et al., 2008a], mrFAST [Alkan et al., 2009], SeqMap [Jiang and Wong, 2008], SHRiMP 2 [David
et al., 2011], Soap [Li et al., 2008b], Soap 2 [Li et al., 2009b], Zoom [Lin et al., 2008], and
RazerS [Weese et al., 2009, 2012]. For a thorough algorithmic comparison we refer the reader to [Li and Homer, 2010; Fonseca et al., 2012].
All existing read mapping approaches are based on a substring index. Depending on the type of index, they can be divided into two classes: 1) π-gram based or 2) pre ix-trie based read mappers.
π-gram based read mappers. The π-gram based approaches use a two-step strategy. First, a iltration algorithm reduces the search space by iltering regions that cannot con- tain a match. This includes building a π-gram index, either on the set of reads, the refer- ence sequence, or both to ef iciently ind common π-grams. The remaining candidate re-
π-gram based pre ix trie based
Eland Maq Soap SeqMap Zoom SHRiMP
2 RazerS Hobbes mrF A S T Bo wtie Bo wtie 2 B W A S oap 2 inde x index of reads β’ β’ - β’ β’ - β’ - β’ - - - - index of reference - - β’ - - β’ - β’ β’ β’ β’ β’ β’ ilt ering seed-based β’ β’ β’ β’ β’ - β’ β’ β’ β’ π-gram counting - - - β’ β’ - - - supports indels - - - β’ - β’ β’ β’ β’ β’
full sensitivity mode β’ β’ - β’ β’ - β’ β’ β’ -
controllable loss rate - - - β’ - - -
mapping
supports indels - - - β’ β’ β’ β’ β’ β’ - β’ β’ -
max. read length [bp] 32 127 60 - 240 - - 100 1000 - - - -
max. error 2 3 2 5 - - - 2
multiple (suboptimal) hits - - - β’ β’ β’ β’ β’ β’ β’ β’ β’ -
Table 6.1:Read mapping tools and their characteristics.
full sensitive up to 2 mismatches full sensitive up to 3 mismatches
depends on settings, no switch to guarantees full sensitivity full sensitive up to 5 errors
no help for parameter choice, default will be lossy for most settings limited to one gap
step. The iltration algorithms again can be divided into seed-based or π-gram counting ilters.
Most of the seed-based ilters are based on the pigeonhole principle. It states that, given two sequences within distance π, any partition of the irst sequence into π + 1 parts contains one part that can be found without errors in the other sequence [Navarro and Raf inot, 2002]. The shorter this seed, the more likely it is to encounter random matches, and therefore the lower the speci icity of the ilter. This strategy is thus rather limited to a small number of errors. To increase the seed length and the iltration speci icity, Eland was the irst to extend this strategy and divides one sequence into π + 2 parts. Now at least two of such parts will occur in the other sequence. These two parts retain their relative positions as long as no indels occur in between. Eland, Maq, and Soap make use of this observation but are therefore limited to Hamming distance. Furthermore, Eland and Soap always use a 4-segment partition and Maq at most a 5-segment partition and can therefore not guarantee full sensitivity for π > 2 or π > 3, respectively. SeqMap extends the two-seed pigeonhole strategy to edit distance and searches the two parts varying the gap length by βπ, β¦ , π nucleotides. Not based on the pigeonhole lemma but also seed- based ilters, where a common (gapped) π-gram triggers a veri ication, are BLAT [Kent, 2002], PatternHunter [Ma et al., 2002], PatternHunter II [Li et al., 2003], and Zoom.
common π-grams every π-error match must have (see Lemma 5.1 on page 90). To ind match candidates, SHRiMP 2 scans the reference and for every read of length π reports dot plot rectangles of size π Γ (π + π) with at least π‘ common π-grams as candidates. For SHRiMP 2, however, π and π‘ must be set manually and the default con iguration is not guaranteed to be lossless.
Veri ication methods encompass approximate matching algorithms for Hamming or edit distance [Sellers, 1980], local-alignment algorithms [Smith and Waterman, 1981], e.g. in SHRiMP 2, or alignment algorithms that minimize the sum of base-call qualities at mismatching bases, e.g. in Maq. In current implementations one has to carefully distin- guish whether both steps, the iltration and the veri ication step, are adequate for the distance chosen (Hamming or edit distance). Some implementations, e.g. Maq, verify matches using base-call qualities but ilter the candidate regions using a ixed Hamming or edit distance.
Some π-gram based read mappers primarily target sensitivity and support to output all (including suboptimal) matches of a read up to a certain distance. This set of so-called
all-mappers includes Hobbes, SeqMap, SHRiMP 2, Hobbes, mrFAST, Zoom, and RazerS.
Pre ix-trie based mappers. Another approach to read mapping utilizes the Burrows-
Wheeler transform [Burrows and Wheeler, 1994] which allows for the compression of a
text and its suf ix array πππΏππΊπ». Ferragina et al. [2004] complemented the functionality of this data structure so that one can count and locate all exact text occurrences of a pat- tern π without prior uncompressing either of the two tables. This self-index, also called
FM index, uses a backward search to determine the interval πππΏππΊπ»[π..π) of occurrences in
πͺ(|π|)time. Lam et al. [2008] extended this idea to ind all approximate pattern occur- rences. For each read, their approach recursively descends a pre ix trie, i.e. the suf ix trie of the reversed text, analogously to Algorithm 4.9 on page 79.
This idea of backward backtracking combined with heuristics to prune the search space is used by Bowtie, Bowtie 2, BWA, and Soap 2. Whereas most of the approaches search whole reads and recursively enumerate mismatches (Bowtie, Soap 2) or also in- dels (BWA), Bowtie 2 uses the FM index to search exact seeds and veri ies candidates with an approximate matching algorithm, as described for seed-based ilters above. In general, the backtracking approach is not limited to pre ix-tries and was successfully adapted to suf ix trees in [Hoffmann et al., 2009].
Recursive approaches are usually designed for the fast search of one or a few loca- tions where reads map with low error rates. These search algorithms are mostly based on heuristics and optimized for speed instead of sensitivity. As they aim at directly ind- ing the βbestβ location for mapping a read, they are called best-mappers. If a read has multiple mapping locations, the best location is randomly chosen either error-based or quality-based, where the irst strategy minimizes the absolute number of errors and the second prefers alignments where mismatches correspond to low base-call qualities. We will show that both strategies have limitations in the presence of repeats or SNPs and that best-mappers are not applicable to reliably detect sequence variations (Section 6.10.5).
reference genome reads ο¬ltra on veriο¬ca on duplicate removal extracted matches
Figure 6.1:Match extraction in RazerS. The reference is scanned by a π-gram based ilter
which outputs candidate parallelograms that possibly contain a match. To ver- ify whether the parallelograms contain true matches, they are searched with an approximate string matching algorithm. At last, duplicate matches, a result of overlapping parallelograms, are removed.