k mer based query tools - BWT-BASED WEB TOOLS

CHAPTER 7: BWT-BASED WEB TOOLS

7.4 k mer based query tools

The BWT and FM-index allow for arbitrary length k-mers to be queried inO(k)steps. Addi- tionally, a string can be retrieved from the BWT inO(L) steps whereL is the length of the read being retrieved. The first two web tools are based on using these two functions to search for a user provided k-mer and then retrieve all reads containing thatk-mer.

With either k-mer tool, the user selects one or more datasets and provides thek-mer to search for. The server then opens the BWT for each dataset and performs thek-mer search. This returns a range of indices in the BWT that can then be used to retrieve each read containing the particular k-mer. Once all of the reads for a dataset are extracted, they are then sent back to the client for visualization.

The first tool simply displays the results of thek-mer query for the user. In Figure 7.1, an example query on an inbred mouse DNA-seq dataset is shown. Each retrieved read is shown as a row in the

Figure 7.1: K-mer lookup tool. This is a visualization of all reads with a particular k-mer in an inbred mouse DNA-seq dataset. Every read is shown as a row with the exact matching k-mer shown in either red or blue. A red k-mer indicates that the k-mer was found exactly on a read. A blue k-mer indicates that the reverse complement sequence of thek-mer was found exactly on the read, so the entire read was reverse-complemented prior to displaying it. The reads are aligned such that the shared exact matching k-mers are in the same position horizontally. At the bottom is a bold sequence with the shared k-mer in green that represents the consensus sequence of all the reads. A base in this consensus is chosen programmatically by selecting the base with the highest frequency in the column. Any bases that do not match the consensus are highlighted in yellow. In this way, random errors show up as sporadic highlighted bases. In contrast, heterozygosity shows up as consistent highlighted bases in a column.

visualization. These reads have been horizontally shifted such that the sharedk-mer from every read is aligned. If thek-mer is red, it indicates that the read contained the k-mer exactly on the forward version of read. If thek-mer is blue, it indicates that the read contained the reverse-complement k-mer, so the entire read was reverse-complemented for visualization. At the bottom of the display is a bolded sequence representing the consensus sequence of all retrieved reads. The consensus is generated by selecting the most prevalent base at each position. Bases from individual reads that do not match the consensus sequence are highlighted in yellow for easier identification by the user. Using this approach, random sequencing errors appears as sporadic highlighted bases whereas homologous sequences appear as consistent highlighted bases in a column in which consensus and alternate states are typically found at similar frequencies. Finally, the interface allows the user to select any contiguous sequence present in the results and then perform the query on that newk-mer.

The second tool is an adaptation of the first modified to aid in identifying the different sequence versions or alleles sharing a k-mer query. Figure 7.2 shows a single k-mer query where there is evidence of multiple alleles in the bases both before and after the query. This suggests that the chosen query is part of two distinct sequences that both occur within the inbred organism.

Figure 7.2: K-mer lookup tool with multiple alleles. This is a visualization of all reads with a particular k-mer in an inbred mouse DNA-seq dataset. Every read is shown as a row with the exact matchingk-mer shown in either red or blue. A redk-mer indicates that thek-mer was found exactly on a read. A blue k-mer indicates that the reverse complement sequence of thek-mer was found exactly on the read, so the entire read was reverse-complemented prior to displaying it. The reads are aligned such that the shared exact matchingk-mers are in the same position horizontally. At the bottom is a bold sequence with the shared k-mer in green that represents the consensus sequence of all the reads. A base in this consensus is chosen programmatically by selecting the base with the highest frequency in the column. Any bases that do not match the consensus are highlighted in yellow. In this way, random errors show up as sporadic highlighted bases. In contrast, heterozygosity shows up as consistent highlighted bases in a column. In this example, there are many consistently highlighted bases in several reads near the bottom of the visualization, suggesting that thisk-mer occurs in two distinct sequences from the inbred organism.

Figure 7.3 demonstrates the output of the allele-finding tool on the same dataset andk-mer query from Figure 7.2. After retrieving the reads, pairs of reads are scored against each other based on sequence overlap and number of matching bases. The closest pairs are then repeatedly merged into larger clusters such that all reads in a cluster exactly match their consensus. In other words, there is no disagreement on what the consensus should be for a particular cluster of reads. All clusters that have few reads are then merged into a remainder cluster where the reads vote on the consensus but are not required to match it. At the top of the visualization, a summary of the consensus sequences are displayed along with the number of reads in the corresponding cluster. Differences in each consensus from the largest cluster’s consensus are highlighted in yellow. Below the summary, the reads of each cluster are displayed using the same visualization style of the original k-mer search. These two tools enable arbitrary k-mer searches into multiple datasets and present the results to the user. By highlighting differences in the consensus sequence, the tools enable identification of random sequencing errors and/or consistent inconsistencies that may indicate genomic variation. Additionally, the allele-based tool helps the user to identify these alleles by clustering the reads prior to visualization. In both tools, the user has to know something about the sequence content of the datasets before performing the search. In particular, the choice of k-mer query is important because it requires the sequence to exactly match the query. As a result, variants may result in difficulties finding reads for a dataset.

In order to account for these off-target variants, there is another available tool that allow for k-mer searches with a user-specified edit distance,e. When an edit distance based query is requested, the server uses a branch-and-bound search routine to find anyk-mers that are withinebase changes, insertions, or deletions from the target sequence. Since this is a branch-and-bound search with an alphabet of length 4, the time to find all thek-mers with a particular edit distance O(4e∗k). Once identified, the tool returns the k-mers and the reads corresponding to thek-mer. As with the first tool, these reads are then visually organized and a consensus sequence is generated for each edited k-mer.

In document Holt_unc_0153D_16498.pdf (Page 103-106)