Application of the Data - Addressing biological questions with massive sequence data

libraries. They have all been generated by the same sequencing center as those derived from the contaminated library. We therefore repeated the analysis with the remaining ESTs. At least one further cDNA library seems to be contaminated with genetic material from Capitella (Tab. 6.4). The few suspicious ESTs found in the CAXZ and CAXA library are most probably false positives. They can be explained by highly conserved genes with a low sequence divergence between Helobdella and Capitella. If the corresponding gene is not included in the draft assembly of the Helobdella genome, the EST will match to the equivalent region in the Capitella genome and yet pass the filtering. However, with more than 800 suspicious ESTs found in the CAWX library, we would rather not trust this explanation in this case. We therefore locked access to the Helobdella EST project.

6.6 Application of the Data

Currently, access to the data hosted at dbDMP is only granted to the members of the Deep Metazoan Phylogeny project. Within this small community, dbDMP was already of great benefit and the included sequence data formed the foundation of multiple studies. Table 6.5 contains a small collection of publications that incorporated sequence data of dbDMP in particular. Further studies are already submitted for publication or are currently in progress and will be published in the near future.

Table 6.3: ML distances between the three data sets per gene The first column

gives our internal gene ID. The second column gives the Maximum Likelihood distance (MLd) between the sequences of the Helobdella EST set and the Helobdella genome set. The third column contains the MLd between the sequences of the Helobdella ESTs and the Capitella genome. The last column gives the MLd between both genome sets.

Gene-ID Hel_EST <-> Hel_Ge He_ESTs <-> Cap_Ge Hel_Ge <-> Cap_Ge

21884 0.00620 0.36059 0.35194 22001 0.36658 0 0.36658 22055 0.80213 0 0.80213 22083 0.00691 0.18248 0.20912 22285 0.00581 0.33577 0.38988 22296 0.08106 0 0.07928 22451 0.08648 0 0.08648 22468 0.27080 0 0.28054 22490 0 0.18226 0.18093 22551 0 0.23933 0.24622 22560 0 0.32084 0.31834 22568 0.39673 0 0.39673 22583 0.00373 0.30311 0.30964 22603 0.16587 0.00426 0.15984 22638 0.05049 0 0.05032 22664 0.15740 0 0.15740 22679 0.35132 0.00537 0.36461 22736 0.21343 0 0.21842 22853 0.11895 0 0.13804 22910 0.31946 0 0.31627 22979 0.25542 0 0.26230 23035 0.26165 0 0.26165 23170 0.35491 0 0.35330 23221 0.31637 0 0.32603 23273 0.15063 0 0.14985 23285 0.57582 0 0.56921 23290 0.06987 0 0.06987 23444 0.39401 0 0.39401 23477 0.20284 0 0.21544 23495 0.13392 0 0.13186 23513 0.46554 0 0.45867 23526 0.22674 0 0.22674 23553 0 0.09677 0.14896 23599 0.14366 0 0.14366 23680 0.35097 0 0.34088 23758 0.31041 0 0.30921 23824 0.49478 0 0.48811 23888 0.26257 0 0.26229 23909 0.25191 0 0.25796 23950 0.29353 0 0.30035 24038 0.37151 0 0.37784 24074 0.36823 0 0.37736 24115 0.00508 0.45738 0.48354 24116 0.11945 0 0.11945 24143 0.18201 0 0.17999 24170 0.50364 0 0.50364 24212 0.18550 0 0.18550

6.6 Application of the Data 55

Table 6.4: Contamination in the Helobdella EST collections In column 1 we listed

the names of the cDNA libraries that gave rise to all publicly available Helobdella ESTs. Column 2 contains the total number of ESTs from each cDNA library present in dbDMP. Column 3 gives the number of EST we could unequivocally assign to either Helobdella robusta or Capitella sp. via BLAT search against the genome sequences. The last column shows the number of ESTs which best BLAT hit was triggered by the Capitella sp. genome sequence and therefore are contaminations of the Helobdella robusta cDNA libraries.

cDNA Library No. of total ESTs Assignable Best hit to Capitella

CAWX 33,118 26,817 802

CAWY 15,350 11,712 4,011

CAXZ 25,208 21,200 17

CAXA 27,683 22,587 8

Table 6.5: Studies based on data from dbDMP This table gives a brief overview

on published studies that incorporated sequence data from dbDMP. The first column gives the reference, the second column the subject of the paper.

Reference Topic

Simon et al. (2009) Phylogeny of basal Pterygota (winged insects)

Ebersberger et al. (2009a) Phylogeny of fungi

Witek et al. (2008) Phylogeny of Syndermata (Rotifera and Acanthocephala)

Roeding et al. (2007) Phylogeny of metazoa

Ebersberger et al. (2009b) Phylogeny of fungi

Bleidorn et al. (2009) Placement of Myzostomida within the metazoan species tree

Struck and Fisse (2008) Placement of Nemertea within the metazoan species tree

Helmkampf et al. (2008) Phylogeny of Lophotrochozoa

Philippe et al. (2009) Phylogeny of basal metazoa

Hausdorf et al. (2007) Placement of Bryozoa

7 Orthology Assignment

7.1 Introduction

The principle of reconstructing evolutionary relationships of species, their phylogeny, is based on a simple idea. Markers, such as morphological characters or DNA sequences, are used as representatives for whole species. The evolutionary history of these markers can be reconstructed by comparing the character states of each marker in different organisms in the light of a chosen model of evolution. To be able to infer the evolutionary history of species from the history of a marker, it is crucial that both histories are tied together. A split in the lineages of the markers must be coincide with a split in the lineages of the species (speciation). If this requisite is not met, false conclusion about the relationships of the species are drawn.

Genes which lineages split due to a speciation event are called orthologs (Fitch (1970)). Their evolutionary history is thus congruent to that of the species they are found in, and therefore their sequences can be used as markers for phylogeny reconstructions. In contrast, genes that arose by a gene duplication event within a common ancestor, called

paralogs (Fitch (1970)), must not be used (Fig. 7.1). Identifying orthologs in EST data is

challenging, because the generation of ESTs is not directed towards preselected genes, but a random process (see 3.3.2). By that, in most of the cases the sequences themselves are the only information available.

Several approaches have been developed that identify orthologs only with the information provided by the sequences themselves, for example by comparing pairwise sequence similarities. A widely used strategy is to perform a bidirectional BLAST search (Altschul

et al. (1997)), also referred to as reciprocal BLAST. The best hit for every sequence of

a species A in another species B is determined. Afterwards each best hit sequence is used as query for a BLAST search with species A as target. Sequence pairs that are each other’s best BLAST hit are assumed to be orthologs. This strategy has been extended to deal with more complex gene families or to define groups of orthologous sequences from more than two species (e.g. Remm et al. (2001); Li et al. (2003)). However, all these methods require that sequence data is available for all genes in all species under consideration. Otherwise, the orthology of reciprocal best BLAST hits is not guaranteed (Fig. 7.2). Consequently, orthology prediction methods based on the reciprocal BLAST hit criterion should not be used for EST data.

Another approach to identify orthologs on the sequence level would be phylogeny-based methods (e.g. Zmasek and Eddy (2002)). Here, orthology of sequences is assumed when a phylogenetic tree of the considered sequences is congruent to the tree of the species the sequences are derived from. These methods can be applied even if the sequence data is only partial, but they require the knowledge of the true species tree. This automatically disqualifies them for an application in phylogenetic studies, since revealing the species tree is the aim of the analysis.

To be able to incorporate EST data for phylogeny reconstructions, we developed a method, HaMStR, that reliably identifies orthologous sequences to a predefined set of genes within an EST collection or a proteome. HaMStR has been described and evaluated in detail in Ebersberger et al. (2009b). In the following we explain the algorithm.

In document Addressing biological questions with massive sequence data (Page 63-68)