Evolutionary tools for sequence analysis
PHYLOGENETIC METHODOLOGY Multiple sequence alignment
Ari Löytynoja, Nick Goldman
We have developed webPRANK, an easy-to-use web interface to the PRANK phylogeny-aware sequence alignment algorithm (Löytynoja & Goldman, 2005). In addition to standard DNA and amino acid alignments, webPRANK can align protein-coding DNA data as codon or amino acid sequences and then back-translate the resulting alignment to DNA. It can also align DNA sequences using evolutionary models of sequence structure (such as fast versus slowly evolving non-coding regions, or non-coding versus coding regions) and infer the sequence structure along the align- ment (Löytynoja & Goldman, 2008a). webPRANK includes a powerful alignment browser with features similar to those found in our PRANKSTER stand-alone program. The browser allows for visual inspection of the results with site-wise estimates of alignment reliability and post-processing of results by removal of the most uncertain alignment sites (figure 1). The webPRANK server can be found at www.ebi.ac.uk/goldman-srv/webprank.
The phylogeny-aware sequence alignment algorithm implemented in the PRANK software package was shown to improve the alignment of sequences with insertions (Löytynoja & Goldman, 2008b). The algorithm is, however, greedy and may perform poorly if the phylogeny of sequences is incorrect or changes across the sites. As it relies on a correct phylogeny of the sequences, the original algorithm’s performance may paradoxically suffer from increasingly dense sampling of sequences (due to recombination and incomplete lineage sorting which alter the underlying phylogeny across the sites). The modelling of sequences as graphs, where nodes represent characters and edges connect adjacent characters, allows a more flexible description of the uncertainty of the site presence/absence at ancestral sequences. Furthermore, the graph edges can be described with a probabilistic model that accounts for their phylogenetic his- tory, and thus allow a less greedy inference of insertion/deletion events. We are thus currently re-implementing our phylogeny-aware sequence alignment as a graph-based alignment, which should largely resolve the problem.
118
Assessing multiple patterns of heterogeneity in phylogenetic models of evolution
Samuel Blanquart, Nick Goldman
Mixture models in phylogenetics make the assumption that a given (Markovian) process of substitution persists throughout a sequence site’s history. It has, however, been shown that this assumption does not hold systematically. For example, site rates in the Rates Across Sites model (RAS; Yang, 1994) are not always constant over time. They often vary over the site’s history, e.g. switching from fast to slow. This is denoted as Single Site Variation (SSV) of rates, or heterotachy (Galtier, 2001). Other works have demonstrated that SSV phenomena also apply to other features of molecular evolution, for example biochemical constraints on amino acid sites (e.g. hydrophilic/hydrophobic SSV, Holmes & Rubin, 2002; Blackburne et al., 2008), or on coding DNA sequences (Whelan, 2008), introducing more complex SSV patterns. This also concerns the selective pressures (dN/dS ratio) applied at the codon level (positive/ negative selection SSV; Guindon et al., 2004).
The previously mentioned studies investigating occurrences of SSV phenomena during molecular evolution all use the Markov-Modulated Markov (MMM) formalism. An MMM substitution process involves a set of ‘classical’ stochastic substitution processes, plus an additional stochastic process allowing switches from each of the classical substitution processes to the others (Galtier, 2001). Our work in this area focuses on the design of a general framework for SSV models involving MMM formalisms. This will generalise the more sophisticated approach of Whelan (2008) by allow- ing virtually all parameters of a classical phylogenetic model (rates, stationary probabilities, exchange rates etc.) to exhibit SSV behaviour. While the new model will first be applied to coding DNA sequences with the aim of reproduc- ing the results and of improving the fit of the Whelan model, future applications will investigate SSV phenomena for amino acid and codon sequences, and provide new model combinations which have never been tested before. Comparative prediction of protein-coding genes
Stefan Washietl, Nick Goldman
The detection of protein-coding genes in genomic DNA is a classical problem in computational biology. Using machine learning techniques, sophisticated models of genes have been built that can be used to annotate whole genomes. However, new types of high-throughput data, such as genome-wide transcription maps and massive comparative sequencing, have led to new challenges beyond classical gene finding. Many transcripts have been found that do not overlap known or predicted genes and statistical methods are necessary to assess the coding potential of this ‘black matter’ transcription. Similarly, comparative sequencing has revealed a plethora of evolutionarily conserved regions without annotation. A reliable analysis of their coding potential is an essential step preceding any further analysis. We have developed RNAcode, a new program to detect coding regions in multiple sequence alignments. RNAcode analyses typical evolutionary patterns such as synonymous/conservative amino acid substitutions, conservation of
Resear ch in 2009 – The Goldman Gr oup
Figure 1. The TSPAN6 dataset is aligned using the webPRANK alignment server and the result is displayed in a web browser window. In addition to automatic colour coding, display of the evolutionary tree and horizontal scrolling, the alignment browser allows for post-processing of the results. Here, alignment columns with low reliability score (lighter shades in the track at the bottom) are selected and shaded light grey. The filtered set of columns can be exported in several different alignment formats for further analyses.
119