• No results found

Conclusion and Future Work

Part II Index Data Structures

8. Conclusion and Future Work

In this thesis, we presented data structures and algorithms with applications in the anal- ysis of high-throughput sequencing data. We developed a uniform framework for con- structing and accessing different substring indices of a single or multiple strings in main or external memory and showed its applicability for indexing multiple whole mammal genomes. Moreover, we provided algorithms for typical applications based on indices, e.g. exact and approximate pattern matching and repeat search, and in the last chapters introduced high-throughput sequencing applications based on two of the proposed in- dices. To make our framework and tools freely accessible to the research and user com- munity, we implemented it as part of SeqAn [Döring et al., 2008] a platform-independent generic C++template library for sequence analysis. Due to its modularity, it was eas- ily possible to integrate our framework into other alignment tools [Rausch et al., 2008; Langmead et al., 2009; Emde et al., 2010; Kehr et al., 2011; Emde et al., 2012; Siragusa

et al., 2013a].

In Chapter 6, we presented RazerS, an ef icient read mapping tool that guarantees to ind all reads within a user-de ined Hamming or edit distance. In addition, a ixed error model and a user-de ined loss rate can be used to ind the reads at higher speed with con- trolled sensitivity. RazerS hence provides a perfect sensitivity-time tradeoff. Our tool can also handle paired-end reads as well as arbitrary number of errors and arbitrary read lengths, which makes it usable for the new or improved technologies that will provide longer reads. The latter two features are unique among the current implementations. To provide a shared-memory parallelization we used OpenMP and dynamic load balancing. Compared to other state-of-art read mappers, RazerS shows the highest sensitivity with a comparable performance. It is the preferable tool for applications that require a high sensitivity even in the presence of repeats, e.g. variation detection pipelines. The novel algorithmic ideas used in RazerS, e.g. lossy iltering with sensitivity control or using a banded adaptation of Myers’ algorithm for ef icient bit-parallel veri ication, can also be applied to improve existing read mappers with a similar iltration-veri ication approach. Many algorithmic components of RazerS were integrated into SeqAn and the whole al- gorithm was basis of similar tools for local read alignment [Hauswedell, 2009] or the alignment of miRNA [Emde et al., 2010] or split reads [Emde et al., 2012].

In Chapter 7, we presented a new approach to constraint-based string mining that outperforms the best-known algorithms by Fischer et al. [2006, 2008]; Kügel and Ohle- busch [2008] in running time as the experiments show. The better running time can be attributed to various factors. Most importantly, the optimal monotonic hull of a fre-

quency predicate is incorporated to prune the search space to a minimum, resulting in the deferred frequency index (DFI). Moreover, the frequency information is extracted as a constant time byproduct during the suf ix tree construction. Our algorithm inherits the good cache locality of the lazy suf ix tree if expanded in a depth- irst search fashion [Giegerich et al., 2003]. We used the notion of entropy from information theory and in- troduced a symmetric, discriminatory predicate that generalizes the emerging substring

mining problem for more than two databases. In an experiment with proteomes of four

species we showed that it can be used to mine parts of protein domains that belong to species speci ic protein families. Generally, the DFI is the preferable algorithm for fre- quency based string mining. For huge datasets that the DFI cannot process in main mem- ory, space ef icient variants of the FHK algorithm [Fischer et al., 2006] should be consid- ered. For conjunctive predicates, the KO algorithm [Kügel and Ohlebusch, 2008] is the next best alternative. For non-conjunctive predicates, the FMV algorithm [Fischer et al., 2008] can reduce the memory consumption at the price of a high increase in running time.

Future Work. The work presented in this thesis can be complemented in several as- pects of future research. First, different compressed indices could be provided to enable larger texts to be processed in main memory with focus on generic approaches that are ef icient in practice. In [Grossi et al., 2003; Sadakane, 2003; Navarro and Mäkinen, 2007] the authors devise compressed indices which are based on succinct representations of the suf ix array or the lcp table. In conjunction with a data structure for constant-time range-minimum queries as proposed in [Fischer and Heun, 2006], a compressed variant of the enhanced suf ix array could be integrated into our framework as proposed in [Fis- cher et al., 2008] and extended to multiple strings. Another memory improvement for small alphabets completely refrains from using the lcp or child table and instead uses a binary search to determine the children of a suf ix tree node [Navarro and Baeza-Yates, 2000]. We implemented a prototype of the FM index [Ferragina et al., 2004] which proved its applicability to high-throughput sequencing in different read mapping applications [Li and Durbin, 2009; Langmead et al., 2009; Langmead and Salzberg, 2012] and allows to traverse the pre ix trie of a text. Currently, we are integrating it into our framework and provide pre ix trie iterators to ease the development of FM index based algorithms. Another interesting direction is dynamic indexing [Salson et al., 2009, 2010], i.e. to up- date an index according to text changes. This approach not only saves the time required for constructing an index from scratch, it could also be used to determine and ef iciently represent the changes a set of similar texts would induce on a reference index. We are de- veloping a data structure that, instead of applying these changes directly, allows to access the (virtual) index of each text.

Our read mapping approach is with slight modi ications also applicable to the din- ucleotide based ABI/SOLiD sequencing technology. Therefore the reference sequence must be converted into color space, i.e. into a sequence of 4 dinucleotide colors instead of the 4 DNA bases, and the semi-global alignment of color-space reads could be adapted as proposed in [Rumble et al., 2009]. Additionally, base-call qualities could not only be used for sensitivity control, but also to optionally rank the read alignments by their plausibil-

ity instead of the number of errors [Li et al., 2008a]. We also plan to use SIMD extensions [Intel, 2011] and hardware accelerators, e.g. GPUs and FPGAs, to massively parallelize the veri ication of candidate regions.

Depending on the problem at hand, the implementation of our algorithm for fre- quency string mining could be improved. If the DFI should only be used to output the result of Th(pred), the memory consumption of the algorithm could be further reduced. As each node is visited at most once, at any time only nodes of the suf ix tree on the path from the root to the current node need to be stored. A small alphabet (e.g. DNA) leads to a dense suf ix tree with many branching nodes at the top, as observed by Kurtz [1999]. In that case, an improvement in running time could be expected by replacing the top of the suf ix tree with a q-gram index and in parallel traverse multiple 𝑞-gram buckets. In this way, the memory consumption could be improved by keeping in memory only the tra- versed subtree. Considering additional constraints during the mining process will play an important role in further algorithmic development, e.g. reducing the solution space of any mining approach to a succinct but representative set is one of the open challenges, as mentioned by Han et al. [2007]. For example, Kobyliński and Walczak [2009] aggregate all minimal jumping emerging substrings to train discriminative image classi iers. The top down construction of the DFI could limited to right minimal jumping emerging sub- strings. To check for left minimality would require either the use of suf ix links [Ukkonen, 1995] or an additional post processing step. Another venue is to combine the framework of frequency based string mining with probabilistic automata that can be used to classify sequences, e.g. to build discriminative models as presented by Slonim et al. [2003]. Due to the ef iciency of the presented approach it is possible now to construct probabilistic automata for a set of databases in expected linear time as an extension to our previous work [Schulz et al., 2008b].