HaMStR Algorithm - Addressing biological questions with massive sequence data

7.2.1 Step 1: Defining a Gene Set for the Ortholog Search

7.2.1.1 Generation of Core-Orthologs

As input we introduce the primer-taxa (set) where each taxon is completely sequenced and where the phylogeny of the primer-taxa is undisputed. Standard orthology prediction tools such as InParanoid (Remm et al. (2001)) or OMA (Roth et al. (2008)) can be applied to identify genes with orthologs present in all primer taxa, the so called core-ortholog groups. We compute orthologs for each pair of taxa with InParanoid (Remm et al. (2001)) or used pre-compiled orthology assignments provided by the InParanoid database (Berglund et al. (2007)). The pairwise orthology predictions are subsequently extended to include all primer taxa by using a criterion of transitive closure (InParanoid-TC). This approach has the advantage of generating ortholog clusters, with only one sequence per taxon. Such clusters can then be directly used for downstream standard approaches of phylogeny reconstruction. In brief, we order the n primer-taxa such that taxon i + 1 is the closest relative to taxon i in the species tree, for i = 1, 2, ..., n − 1, where ties are broken randomly. For each protein in taxon 1 we carry out the following loop:

a We identify the corresponding InParanoid-orthologs in taxon 2. If more than one

co-ortholog exists, we choose the one with the highest InParanoid-score.

b With the protein from taxon 2 we identify the ortholog with the highest InParanoid-

score in taxon 3. This procedure continues until the ortholog-pair for taxon n − 1 and n has been determined.

7.2 HaMStR Algorithm 59

Time

Species A

Species B

Species C

Speciation 2 Speciation 1 Gene Duplication Xa Ya Xb Yb Xc Yc X/Y Y’ Y’’ X’ X’’ O

Figure 7.1: Concept of Orthology and Paralogy The black tree illustrates the evo-

lutionary relationships of three species A, B and C. Within each species, two genes ((Xa,Ya), (Xb,Yb) and (Xc,Yc)) are present, which lineages can be traced back to a common origin (O) in the ancestor of all three species. All splits in the green gene tree, connecting Xa, Xb and Xc, are caused by speciation events. Xa, Xb and Xc are therefore called orthologs. The same holds true for the genes that are connected by the red gene tree (Ya, Yb and Yc), which are orthologs as well. In contrast, the shared origin of both gene groups is the gene duplication (X/Y). This makes each gene of the green lineage paralogous to each gene of the red lineage and vice versa. If genes from both groups are mixed for a tree reconstruction, the obtained tree topology may not be congruent to the species tree. For example a tree based on the sequences of Xa, Yb and Yc will support a grouping of species B and C to the exclusion of species A, because the evolutionary distance between Yb and Yc is shorter than between Yb and Xa.

Time

Species A

Species B

Species C

Speciation 2 Speciation 1 Gene Duplication Xa Ya Xb Yb Xc Yc X/Y Y’ Y’’ X’ X’’ O

Figure 7.2: Orthology assignment with incomplete sequence data In this sce-

nario, the sequence data for species A and B is incomplete. Only the genes connected by solid lines are available, because, for example, not sufficient ESTs have been generated yet, or the genome from which the individual genes have been predicted is still in draft status. A search for the most similar sequence to Xa in species B will result in Yb. A search for the most similar sequence to Yb in species A will confirm that Xa and Yb are the most similar sequences. This implies, that Xa and Yb are orthologs, although they are not. A similar situation occurs if the genomes of species A and B are completely sequenced, but genes Ya and Xb got lost during their evolution. In this case the reciprocal BLAST strategy will also wrongly predict Xa and Yb to be orthologs. This phenomenon is known as hidden paralogy.

7.2 HaMStR Algorithm 61

c The circle is then closed by identifying the ortholog to the protein from taxon n in

taxon 1.

d If the two proteins in taxon 1 at the beginning and the end of the round trip are

identical, we keep the set of orthologs and call it core-ortholog. Else, we discard the proteins.

We end up with a collection of core-orthologs representing genes present in all primer-taxa. In each individual core-ortholog each taxon is represented exactly once. From the initial ordering of the primer-taxa it follows that the newly added sequence in each step is an ortholog to all other sequences already in the candidate cluster. The final closure step (c) excludes pathogenic cases resulting from hidden paralogy (Scannell et al. (2006), Fig. 7.2).

InParanoid-TC has one clear advantage over other orthology prediction programs. The pair-wise ortholog predictions from InParanoid can be pre-computed and stored for a set of taxa. From this data core-orthologs can be rapidly generated for any subset of primer-taxa without further computation.

7.2.1.2 Generation of Profile Hidden Markov Models

For each sequence cluster in the core-orthologs, the sequences are aligned using MAFFT Katoh et al. (2005) with the options --maxiterate 1000 and --localpair. The resulting multiple sequence alignments, comprising the n sequences from the primer-taxa are then converted into a profile Hidden Markov Model (pHMM) (Durbin (1998b)). The programs

hmmbuild and hmmcalibrate from the HMMER package1 are used for building,

training and calibrating the pHMMs. Each core-ortholog is now represented by a pHMM.

7.2.2 Step 2 Extension of Core-Orthologs

We now extend the core-orthologs with data from additional taxa, the query-taxa. As data may serve translated ESTs, or protein sequences inferred from either complete or partially sequenced genomes.

7.2.2.1 pHMM Search

We use a fast implementation of the hmmsearch algorithm2 _{to search protein sequence}

data from the query-taxon for matches to the individual pHMMs. If ESTs are the data

http://hmmer.janelia.org

for the query taxa, the individual ESTs are translated in all six reading frames prior to the search.

7.2.2.2 Re-BLAST and Orthology Prediction

To determine the orthology status of the hmmsearch hits, we use a reciprocity criterion (Fig. 7.3). Each hit is compared by BLASTP (Altschul et al. (1997)) to the proteome of one of the primer-taxa, the so-called reference-taxon (Proteome F in Fig. 7.3). Ideally, the reference-taxon should be the closest related primer-taxon to the query-taxon. If the protein of the reference taxon that contributed to the pHMM provides the best BLASTP hit, then the hmmsearch hit is added to the corresponding core-ortholog. Otherwise, it is discarded. We note that the reciprocity criterion is also fulfilled when the reference protein is among the lower ranking BLASTP hits, but has the same score as the top listed hit in the BLASTP output.

7.2.2.3 Post-processing of ESTs

To account for possible frame shifts caused by sequencing errors in ESTs we use GeneWise Birney et al. (2004) to generate a codon-alignment for the EST and the protein sequence of the reference-taxon. This alignment determines the coding part together with the reading frame in the EST.

7.3 Exploring the Potential of Compiling Phylogenetic

In document Addressing biological questions with massive sequence data (Page 68-72)