GENOME EVOLUTION

Evolutionary tools for sequence analysis

Evolution of transcription regulation

Jacky Hess, Ari Löytynoja, Emeric Sevin, Martin Taylor, Nick Goldman

It has been widely speculated, but as yet not convincingly demonstrated, that changes in the regulation of gene expression underlie much of the divergence between species (King & Wilson, 1975) and the phenotypic variation evident within populations. Following our earlier work investigating a transcriptional regulatory network that controls cellular differentiation (Suzuki et al., 2009), we have been studying how the regulation of gene expression has diverged between mouse and human. This work uses novel, high-quality gene expression and promoter usage data from both mouse and human primary cells, assayed over a time course following a stimulatory signal. The experimental data was generated as part of an international collaboration. Most genes show significant constraint in their expression profile, but up to 23% have significantly diverged in their regulation. We see a significant correlation between the selective pressure on protein-coding sequence and expression divergence, suggesting that adaptation in coding sequence often goes hand-in-hand with adaptation in the expression of a gene. However, several lines of evidence have led us to conclude that the majority of expression divergence is a consequence of trans regulation – changes in the upstream regulatory network that could simultaneously affect multiple genes, rather than mutations in cis-regulatory sequences such as the core promoter. Despite the apparent dominance of trans effects we have found a handful of non-coding regulatory sequences that show compelling evidence for diversifying selection with consequent impacts on cis regulation of the neighbouring gene. In the course of this analysis we have found yet more evidence for highly localised variation in mutation rates, in agreement with other recent work from the group (Taylor et al., 2008; Washietl et al., 2008; Semple & Taylor, 2009). As all of these results are based on data from a very well studied cell type, we are able to directly relate many of the observed differences in gene regulation to cellular and even organism-level differences between human and mouse, supporting the ideas first advanced over 30 years ago by King and Wilson.

Transcription factors (TF) are obvious effectors of trans regulation, and variation in their repertoire and sequences is thus bound to have consequences on the downstream expression of genes. In an ongoing study of TF repertoires in yeasts, we are aiming to address the role of the protein-coding components of transcriptional regulatory circuits in their evolution.

We have collected data from 17 species of hemiascomycetous yeasts comprising a total of 48 families of DNA-binding proteins. The number of TFs per genome ranges from 170 in Ashbya gossypii to 247 in Debaryomyces hansenii. Generally, TFs were found to comprise between 3.2% and 4.4% of the annotated protein-coding genes. We found the main contributors to yeast TF repertoires to be classical C2H2, and binuclear cluster zinc finger proteins, often in com- bination with a fungal-specific TF domain. These make up between 46% and 62% of the collected TF repertoires. TFs often contain a highly conserved DNA binding region surrounded by fast evolving sequence which makes both alignment and subsequent phylogenetic analysis challenging. To overcome some of those difficulties, we have devel- oped an anchored alignment approach based on PRANK (Löytynoja & Goldman, 2005) that guides the alignment based on annotated structural domains and thereby improves the alignment surrounding those. Furthermore, we are currently investigating the effects of different strategies for the inference of duplication and losses along a gene tree when the phylogenetic signal in the data is limited in order to confidently identify groups of orthologous genes for in-depth evolutionary analysis.

Although trans effects may dominate as a source of change in regulation, the evolution of cis-regulatory elements is nonetheless known to have a significant impact on phenotypic variation (Wray, 2007). While the evolutionary process of specific, well-characterised systems has been studied before, we are currently working on giving a broader view of the evolutionary dynamics of transcription factor binding sites (TFBS) in Drosophila, such as their gain and loss in promoters through time (‘turnover’).

From the 16,209 non-redundant promoters in the genome of Drosophila melanogaster, we produced multiple align- ments with their orthologues in seven other drosophilids thanks to PRANK (Löytynoja & Goldman, 2005), using

Resear ch in 2009 – The Goldman Gr oup

120

ad hoc tree topology selection to tackle reported issues of incomplete lineage sorting. Ancestral sequences were also

reconstructed using maximum likelihood methods and included in the alignments. Fifteen known TFBS motifs were mapped in the promoters (in collaboration with Jüri Reimand, BIIT, University of Tartu, Estonia), leading to a filtered set of 591,865 putative binding positions.

We assessed general constraints over the set of promoters by looking at the average substitution rates at each position relative to the transcription start site. This revealed a strong correlation with nucleotide composition, whereby regions with higher concentration of adenine (A) and tyrosine (T) bases seem to change faster. For some TFBS, we were also able to infer spatial clustering constraints on well-conserved sites, as well as inter-TFBS distance preferences. With this in mind, we are investigating the general mechanisms underlying turnover events, looking at the conservation patterns of the original and substituted sites in compensatory cases, and also estimating the local context of selection of TFBS depending on their position in the promoters.

Selective pressure analysis

Greg Jordan, Martin Taylor, Nick Goldman

Working with data from the Mammalian Genome Project led by the Broad Institute at MIT, we undertook an analysis of selective pressures in mammals using a greatly increased amount of data and resolution when compared to previous studies (figure 2). Using comparative genomics data from the Ensembl Compara database (Vilella et al., 2009) and two methods from the Goldman group, namely Tim Massingham’s Sitewise Likelihood Ratio (Massingham & Goldman, 2005) and Ari Löytynoja’s PRANK (Löytynoja & Goldman, 2005), we; 1) generated the first ever distribution of site- wise selective pressures in mammals; 2) identified new classes of proteins subject to positive selection in mammalian clades; and 3) began to investigate the dynamics of positive and purifying selection in three mammalian sub-clades. Of particular interest in this study has been the ability to precisely identify the location of evidence for positive selection within mammalian gene families. Previous genome-wide studies focused on identifying entire genes showing evidence for positive selection; our site-wise analysis allows such evidence to be localised to individual residues. Correlating positively-selected sites with various protein-level annotations yields new insights into the biological proc- esses and structural motifs which are most often subject to positive selection in mammals. Gene ontology (GO; Gene Ontology Consortium, 2008) terms such as ‘olfactory receptor activity’ and ‘electron transport’ were enriched in genes under strong evolutionary constraint but with a small number of sites showing evidence for positive selection, while Pfam (Finn et al., 2008) domains such as ‘protein kinase domain’ and ‘ion transport protein’ showed strong purifying constraint but a disproportionately large number of positively-selected sites. Future work along these lines will involve analysis to better understand the accuracy and sensitivity of site-wise analyses and correlating site-wise selection pressures with protein structures and population genetic datasets.

Figure 2. A long protein with very few indels, cytochrome B (ENSG00000165168, CYBB) shows strong purifying selection with two pockets of positively selected sites (bottom; Mammals). The overall dN/dS pattern is similar in the three sub-trees (from top: Laurasiatheria, Primates, Glires), although the exact sites showing evidence of positive selection vary between each sub-tree. The protein sequence alignment is coloured according to the Taylor (1997) colouring scheme except for those bases missing from low-coverage genomes (black). Vertical bars above the alignment represent the 95% confidence intervals of the site-wise dN/dS estimates, plotted on a log scale. Colour represents the strength of evidence for purifying (blue) or positive (red) selection; sites with little evidence for non-neutral selection are coloured grey. Horizontal lines are drawn at the neutral dN/dS value of 1. Resear ch in 2009 – The Goldman Gr oup

121

In document Annual Scientific Report European Bioinformatics Institute (Page 121-123)