• No results found

Conclusion & Future work

In conclusion, we have proposed a motif-finding framework that, given a set of gene upstream regions, performs their all-to-all pairwise comparison and identifies all the motifs of length k (k-mers) that are common to any pair of upstream regions or differ in at most d characters. Our framework stores the k-mers found in each gene in a multi-level hash table. Our hash table design aims to optimize hash table comparison (rather than hash table insertion or lookup), is highly parallelizable and can be easily mapped onto GPU. We have proposed four GPU implementations of pairwise hash table comparison, each leveraging a distinct parallelization approach. We have per- formed an extensive experimental evaluation on an average-size yeast genome. Our results have shown that, for the considered genome, GPU implementations designed to optimize task parallelism perform better than kernels designed to limit warp diver- gence through synchronizations, and outperform an 8-thread CPU implementation by a factor of 23 to 52x.

In the future, we plan to use our framework on different genomes (with larger upstreams) and perform a comparative intra- and inter-genome analysis with the goal of identifying genes that are regulated by the same transcription factors and thus share biological functions. From the computational standpoint, we expect that, on genes with longer upstream regions leading to larger hash tables, the Thread to Bucket version will be indispensable when coupled with shared memory chunking of both upstreams.

Acknowledgment

This work has been supported by NSF award CCF-1421765 and CNS 1429294 and by equipment donations from Nvidia Corporation.

Chapter 4

PolyHomology: A Comparative

Genomics Approach to Regulation

Pattern Discovery

4.1

Introduction

Eukaryotic genomes are notoriously complex and can contain tens to hundreds of thousands of genes [149][43]. Even for a well-studied model organism such as the budding yeast (Sacchromyces cerevisiae), where the number of genes is estimated to be only about 5,500, 28% the protein-coding genes are still unannotated according to PANTHER version 10 [86][16]. Understanding when and where these genes are expressed can give insights into what biological processes or molecular pathways they act in. The rules governing the expression of a gene are complex and incompletely understood. However, special proteins called transcription factors (TFs) play a criti- cal role in binding to the DNA of a gene and enhancing or suppressing its expression (e.g., transcription [148]). This process is driven by the specific binding of those TFs to short sequences of DNA known as motifs. While genes can be identified with relatively high confidence across species, regulatory motifs identification tends to be

much more challenging. Even with a relatively good algorithm, the rates of false negatives and false positives are excessive [40]. A major reason for this difficulty is that the motifs are both short and degenerate (e.g., some positions in the motif can vary without altering binding), making it difficult to generate statistical models that detect them with high confidence [144]. Thus, the big picture of understanding what sets of genes are co-regulated is still largely unsolved, especially when one looks beyond model organisms to the ever growing list of sequenced genomes [39] [1]. At the same time, the problem of understanding gene expression would be considerably simplified if we knew, for each gene, all of the other genes with which it shared a common TF.

One promising avenue that has been opened by the advent of inexpensive se- quencing is the use of the comparative method, one of the historically most successful approaches to biology [41] [32] [37] [157]. In particular, we hypothesize that very large-scale comparative genomics approaches, using dozens to hundreds of genomes from distinct species, might be able to identify shared gene regulation on a global scale using even fairly simply homology measures. In this model, we assume that important regulatory pathways (e.g., which TFs bind to which genes) have been conserved over evolutionary time [99]. Indeed, there is evidence suggesting that evolutionary pres- sures to maintain important biological functions and pathways lead to corresponding orthologs being conserved over time [140] [95]. Our approach assumes that both the binding motifs and their recognition elements in the TF proteins may drift over time, but that selection acts to maintain the binding of the TF to its target genes. Given this assumption, it follows that shared similarity in upstream regions between co- regulated genes, if also present, albeit with different motifs, in orthologous genes in other taxa, is evidence of co-regulation. Coupled with rapidly falling costs for DNA sequencing, we propose that it may be possible to identify co-regulated genes through the comparison of all upstream regions against each other in each genome and then

integrating these potential regulatory sites across a large number of related genomes using the principle of orthology among the genes analyzed. From this premise, we here propose PolyHomology, a method that combines information from homologous genes across species, to find such highly co-regulated genes using a big-data approach. In a sense, this approach is a simplification of the phylogenetic footprinting technique [21, 14]. However, we differ with traditional footprinting technique in two significant ways: First, we forego the estimation of the regulatory motifs themselves in an effort to im- prove the prediction of co-regulated genes. Second, we do not require a target gene of interest. Hence, while some of the yeast genomes analyzed here were previously used for a footprinting analysis [59], that analyses focused only on closely related species where motif conservation was expected, whereas we use a much more diverged set of genomes. As such, the goal of this work is not the prediction of co-regulated genes per se but rather assessing if such a prediction would be possible in a future world with extremely densely sampled genomes. Because analyses based on such dense sampling presents formidable computational challenges, we developed our algorithms to exploit the parallelism inherent in these data by using GPU co-processors for the most computationally demanding part of the analyses.