Batch query tools - BWT-BASED WEB TOOLS - Holt_unc_0153D

CHAPTER 7: BWT-BASED WEB TOOLS

7.5 Batch query tools

Thek-mer based visualizations are useful for accessing raw sequence data surrounding a specific k-mer. However, there are often cases where the actual read context is not required. Instead, the user may only be concerned with thek-mer frequencies associated with a batch of queries. Typically,

Figure 7.3: K-mer allele tool. This is a visualization of the allele based visualization of the same k-mer from Figure 7.2 in an inbred mouse DNA-seq dataset. After retrieving all reads with an exact matchingk-mer, each entire read is compared against each other read and exact matching clusters are formed from the reads. Each cluster has a consensus sequence that every read in the cluster matches. Any reads that don’t exactly match a cluster with at least five reads are allocated to the remainder consensus for the query. All consensus sequences and the remained consensus are shown as a summary table at the top of the visualization. Differences in the consensus sequences are highlighted in yellow. Additionally, each consensus has a numerical value indicating the number of reads that exactly match that consensus. Below the summary, each cluster of reads is shown with its matching consensus in the same visualization style of Figure 7.2.

these queries are pre-annotated and represent biologically relevant sequences such as single nucleotide polymorphisms, insertions, deletions, splice junctions, or copy number.

The batch query tools are designed to provide easy methods input a batch of k-mers and query all of them simultaneously through the web interface. The first version of this tool allows users to select a single dataset to query. Users then upload a CSV file (or manually type the queries into the web client), specify which column of the CSV file contains the k-mers, and tell the client to execute the queries. The client then requests k-mer counts from the server for each of thek-mers provided by the user. The output is another CSV file where all columns from the input are preserved and two new columns are added corresponding to forward and reverse-complementk-mer frequencies. The user can then copy this information or download the output to a CSV file for analysis. An example input and output for this tool is shown in Figure 7.4. As a final note, this tool also has a version that allows for querying k-mers within an edit distance. Any discoveredk-mers within the specified edit distance are returned as extra columns in the output.

The second version of this tool allows for multiple datasets to be queried and saved to the output file. After selecting which datasets to query, the users again upload a CSV file, specify a column with labels and a column with queries, and tell the client to execute the queries. For each selected dataset, the client then requests k-mer counts from the server. Then, the output is formatted as a new CSV table where columns are the queries and rows are the datasets. For each dataset and k-mer, the corresponding cell stores the frequency of the k-mer in that dataset. An example output for this tool is shown in Figure 7.5.

In contrast to the k-mer based visualizations, these tools are better suited to gathering count statistics for many different k-mer patterns rather than to exploring the sequencing surrounding a particular pattern. In particular, this class of tool enables users to gather k-mer frequency information in multiple datasets without needing to download the datasets and count them using a locally installed program. Most use cases for these tools involve first identifying k-mer patterns that specifically target a biologically meaningful sequence. For example, the pattern may uniquely identify a particular splice junction or genomic variant within an organism. In either case, the sequence must be known beforehand and off-target variants may affect the results.

Figure 7.4: Mass query tool. These images are screenshots of the mass query tool when run on a small set of short k-mers. The top image shows a CSV input for a selection ofk-mer queries where the input has an identifier, a k-mer query, and some metadata associated with the query. The user informs the client that a header line is present and that thek-mer queries are in second column of the input. The client then requests the specifick-mer counts from the server. In the output, the original CSV input is copied and two new columns are added corresponding to the forward and reverse-complement counts for thek-mer queries. This output can then be copied or downloaded in CSV format for more analysis. In this example, the chosen sample is an inbred organism, so only one version (either “ref" or “alt") of each allele has counts greater than zero in the output.

Figure 7.5: Batch query tool. This image is a screenshot of the batch query tool when run on a two probe k-mers for multiple datasets. For this example, eight different datasets were selected by the user. The client requestsk-mer counts for each dataset and outputs the result in a new CSV file format where columns represent k-mer queries and rows are the datasets being queried. The resulting CSV file can be copied or downloaded for future analysis. In this example, all eight datasets are inbred datasets, so only one of the two alleles has counts greater than zero for each dataset.

In document Holt_unc_0153D_16498.pdf (Page 106-110)