Discussion - Repeat-aware Sequencing Error Detection and Correction . 48

CHAPTER 3. Repeat-aware Sequencing Error Detection and Correction . 48

3.5 Discussion

We have presented a method for error detection that accommodates genomic repeats by considering the multiple potential sources of kmer counts, including both faithfully read match-ing strmatch-ings and incorrectly read nonmatchmatch-ing strmatch-ings in the genome. In the end, the method produces estimates of the expected number of attempts T_l to read kmer x_l, with or without errors. The kmers with small Tl identify strings xl with low coverage in the dataset. Under the assumption of uniform coverage, very low T_lare likely to represent nonexistent kmers. The same logic is used to justify discarding kmers with small Y_l, but it is rarely recognized that Y_l is not directly proportional to genome occurrence αl except in the absence of errors. Relying

on the overlapping erroneous kmers, we correct errors in the reads.

Choosing a threshold is a challenge for all error detection methods. In practice, the thresh-old can be trained on comparable data where errors are directly observable, perhaps by com-paring with a known reference genome. Alternatively, analytical calculations can justify a threshold that minimizes some measure of incorrect predictions [Schr¨oder et al.,2009]. A very simple calculation is available with some knowledge of genome length |G|, because then it is possible to estimate the FP rate, P (Y_l< M | α_l> 0) assuming uniform coverage. Our datasets have expected k-length string coverage ^{N (L−k+1)}_|G|−L+1 around 60X, which implies expected FP rates ranging from 1 × 10⁻¹⁴ at threshold M = 10 and 1 × 10⁻³ at threshold M = 35. Our own experiments, however, find that analytically calculated thresholds may actually introduce more errors than originally present in the dataset. Given these challenges, it is a positive feature of the proposed method that in almost all cases, a broader range of thresholds achieved nearly the same low wrong predictions (see Fig. 3.2). Under these conditions, success of the method becomes less dependent on the choice of threshold. In addition, the leftward shift of our wrong prediction curves tends to favor our method if thresholds are chosen conservatively.

Assemblers that incorporate error kmer detection tend to be conservative, reducing gaps in the assembled genome by holding FP low with small thresholds M [Butler et al., 2008, Jackson et al., 2010, Zerbino and Birney,2008]. Although our method claimed lower overall wrong predictions in this range, we note that our FP error rates are higher than those achieved with Y thresholding, barely so in the absence of repetition and correct error distribution, but notably so otherwise. Lower FN at the cost of higher FP reflects our ability to reduce errors by detecting kmers that do not exist in the genome, but have neighbors that are highly repetitive in the genome. Thus, identifying errors with our method reduces the complexity of assembly

by removing FN that cause tangles [Pevzner et al.,2001], but increases FP that cause contig breaks. The resulting trade-off may not be appropriate for all applications. FP kmers tend to be kmers that occur once in the genome in a region with low coverage. Our method could be improved by incorporating variable coverage effects or scanning for kmers possibly discarded due to low coverage.

Our estimated error distributions, tIEP, wIEP, tUEP, and wUEP, do not match the error distribution in the simulated reads and especially do not match the errors in the true reads of the E. coli dataset SRX000429. All error models are approximations, and the error distribution we can estimate is a combination of read errors, errors introduced by the mapping software, and errors in the assembly. Either as a consequence of this mismatch of model and reality, the longer genome, or both, Fig.3.2shows that prediction errors were substantially higher in the E.

coli dataset for all methods. The error distributions of datasets from E. coli and A. sp. ADP1 differed notably (Table 3.2), which may imply error rates vary substantially across organisms and sample preparations. For all these reasons, error estimation will continue to be a challenge in the future. Ideally, errors should be estimated along with T_l, but a less parameter-rich model is required, perhaps along the lines of the mixture models used to estimate genome length [Li and Waterman,2003].

Choosing kmer length k and maximum Hamming distance dmax are additional challenges with few easy answers. We chose k such that the average non-repetitive kmer has one genome occurrence. This choice relies on having some idea of genome repetitiveness. In general, choos-ing a large value of k results in low read occurrences Y_l and little information for inference of Tl. Yet choosing k too small populates kmer neighborhoods with counts from kmers occurring in largely incompatible contexts. In both cases, the noise of too little data or the noise of

irrelevant data interferes with the useful signal needed to detect possible error reads. Choice of k is connected to choice of dmax. All results reported here are for dmax = 1; results for d_max = 2 changed little. However, as k increases, d_max = 1 may not adequately capture the true neighbor context of kmers.

Our estimation method uses the EM algorithm to compute the expected number of kmer misreads Y_ml, which makes it possible to not only detect error kmers, as we have done, but also identify errors in valid kmers. Specifically, the computed Y_ml could be used to identify valid kmers that are over- or under-counted. For example, if Yl is high because error count Yml is large due to some repetitive x_m, then the prevalence of kmer x_lin the genome is overestimated by Y_l. Indeed, T_l can be used to estimate genome length and repetition [Li and Waterman, 2003].

There have been some attempts to formally characterize repeats in genomes [Haubold and Wiehe,2006], but generally, the term “repeat” is used loosely in the literature, with meaning varying by context. In this chapter, we consider kmer x_l a repeat when its genomic occurrence α_l is higher than what is expected in a random genomic sequence. Because genomes are not random, all genomes display some degree of repetition. Perhaps such cryptic repetition explains why we can achieve lower false prediction rates at optimal thresholds even on non-repetitive genomes, like E. coli. In fact, E. coli is somewhat non-repetitive according to the Ir

measure ofHaubold and Wiehe [2006].

3.6 Our Contributions

We develop a statistical model and a computational method for error detection and cor-rection in the presence of genomic repeats. We propose a method to infer genomic frequencies

T_l

Frequency

0 50 100 150 200

0e+002e+054e+056e+058e+05

Figure 3.3 Histogram of estimated T_l for E. coli dataset.

of kmers from their observed frequencies by analyzing the misread relationships among ob-served kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior er-ror detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content.

The proposed method is made available through the software package REDEEM (Read Error DEtection and Correction via Expectation Maximization) under GNU GPL3 license and Boost Software V1.0 license at “http://aluru-sun.ece.iastate.edu/doku.php?id=redeem”.

In document Error correction and clustering algorithms for next generation sequencing (Page 74-78)