eggNOG: automated construction and annotation
of orthologous groups of genes
Lars Juhl Jensen
1, Philippe Julien
1, Michael Kuhn
1, Christian von Mering
2,
Jean Muller
1, Tobias Doerks
1and Peer Bork
1,3,*
1
European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany,2University of Zurich
and Swiss Institute of Bioinformatics, Winterthurerstrasse 190, 8057 Zurich, Switzerland and 3
Max-Delbru¨ck-Centre for Molecular Medicine, Robert-Ro¨ssle-Strrasse 10, 13092 Berlin, Germany Received August 14, 2007; Revised September 14, 2007; Accepted September 17, 2007
ABSTRACT
The identification of orthologous genes forms the basis for most comparative genomics studies. Existing approaches either lack functional annota-tion of the identified orthologous groups, hampering the interpretation of subsequent results, or are manually annotated and thus lag behind the rapid sequencing of new genomes. Here we present the eggNOG database (‘evolutionary genealogy of
genes: Non-supervised Orthologous Groups’),
which contains orthologous groups constructed from Smith–Waterman alignments through identifi-cation of reciprocal best matches and triangular linkage clustering. Applying this procedure to 312 bacterial, 26 archaeal and 35 eukaryotic genomes yielded 43 582 course-grained orthologous groups of which 9724 are extended versions of those from the original COG/KOG database. We also constructed more fine-grained groups for selected subsets of organisms, such as the 19 914
mamma-lian orthologous groups. We automatically
annotated our non-supervised orthologous groups with functional descriptions, which were derived by identifying common denominators for the genes based on their individual textual descriptions, annotated functional categories, and predicted
protein domains. The orthologous groups in
eggNOG contain 1 241 751 genes and provide at least a broad functional description for 77% of them. Users can query the resource for individual genes via a web interface or download the complete set of orthologous groups at http://eggnog.embl.de.
INTRODUCTION
The vast majority of the functionally annotated genes in genomes or metagenomes are derived by comparative analysis and inference from existing functional knowledge via homology. With the sequencing of entire genomes, it became possible to increase the resolution of the functional transfer by distinguishing between orthologs and paralogs, that is gene pairs that trace back to speciation and gene duplication events, respectively (1). These concepts have since been extended and refined to include orthologous groups (2), in-paralogs and out-paralogs (3,4), but the identification and classification of homologous genes remains very difficult. In contrast to the definition of orthology, the classification of genes into orthologous groups is always with respect to a taxonomic position: two paralogous genes from human and mouse may be orthologs of the same gene in fruit fly and will belong to either the same or different orthologous groups depending on whether these are defined with respect to the last common ancestor of metazoans or mammals. This is further complicated by evolutionary processes such as gene fusion and domain shuffling, due to which each domain of a multi-domain protein is not guaranteed to have evolved through the same series of speciation and duplication events. Finally, because we do not know how each gene evolved, one in practice always relies on operational definitions rather than the evolutionary definitions given above.
Numerous methods have been developed to derive orthologs and orthologous groups, ranging from the simple reciprocal-best-hit approach, via InParanoid (5), MultiParanoid (6), identification best-hit triangles (2,7,8) and clustering-based approaches (9), to tree-based meth-ods (10–13). By contrast, there has been only one major
*To whom correspondence should be addressed. Tel: +49 6221 387 526; Fax: +49 6221 387 517; Email: bork@embl.de The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
ß2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
effort to provide functionally annotated orthologous groups, namely the COG/KOG database (2,8), but it lacks phylogenetic resolution and is not regularly updated due to the manual labor required. There is thus a need for a hierarchical system of orthology classification with function annotation.
Here, we provide such a system, eggNOG, which (1) can be updated without the requirement for manual curation, (2) covers more genes and genomes than existing databases, (3) contains a hierarchy of orthologous groups to balance phylogenetic coverage and resolution and (4) provides automatic function annotation of similar quality to that obtained through manual inspection.
CONSTRUCTION OF HIERARCHICAL ORTHOLOGOUS GROUPS
We assemble proteins into orthologous groups using an automated procedure similar to the original COG/KOG approach (2,8). When constructing coarse-grained ortho-logous groups across all three domains of life or for all eukaryotes, we first assign the proteins encoded by the genomes in eggNOG to the respective COGs or KOGs based on best hits to the manually assigned sequences in the COG/KOG database. In case of multiple hits to the same part of the sequence, only the best hit was considered. The many proteins that cannot be assigned to existing COGs or KOGs are subsequently assembled into non-supervised orthologous groups using the proce-dure described below. When constructing more fine-grained orthologous groups, this initial step is skipped.
Briefly, we first compute all-against-all
Smith–Waterman similarities among all proteins in eggNOG. We then group recently duplicated sequences into in-paralogous groups, which are subsequently treated as single units to ensure that they will be assigned to the same orthologous groups. To form the in-paralogous groups, we first assemble highly related genomes into clades, usually encompassing all sequenced strains of a particular species in a single clade, but also other close pairs such as human and chimpanzee. In these clades, we join into in-paralogous groups all proteins that are more similar to each other (within the clade), than to any other protein outside the clade. For this, there is no fixed cutoff in similarity, but instead we start with a stringent similarity cutoff and relax it a step-wise fashion until all in-paralogous proteins are joined, requiring that all members of a group must align to each other with at least 20 residues.
After grouping in-paralogous proteins, we start assign-ing orthology between proteins, by joinassign-ing triangles of reciprocal best hits involving three different species (here, in-paralogous groups are represented by their best-matching member). Again, we start with a stringent similarity cutoff and relax it to identify groups of proteins that all align to each other by at least 20 residues. This procedure occasionally causes an orthologous group to be split in two; such cases are identified by an abundance of reciprocal best hits between groups, which are then joined. Next, we relax the triangle criterion
and allow remaining unassigned proteins to join a group by simple bidirectional best hits. Finally, we automatically identify gene fusion events by searching for proteins that bridge otherwise unrelated orthologous groups. In these cases, the different parts of the fusion protein are assigned to their respective orthologous groups. This step is a distinguishing feature of our approach and is crucial for the analysis of eukaryotic multi-domain proteins, as these would otherwise cause unrelated orthologous groups to be fused.
To construct a hierarchy of orthologous groups, the procedure described above was applied to several subsets of organisms. To make a set of course-grained orthologous groups across all three domains of life, we constructed non-supervised orthologous groups (NOGs) from the genes that could not be mapped to a COG or KOG. Focusing on eukaryotic genes, we constructed more fine-grained eukaryotic NOGs (euNOGs) from the genes that could not be mapped to a KOG. Finally, we build sets of NOGs of increasing resolution for five eukaryotic clades, namely fungi (fuNOGs), metazoans (meNOGs), insects (inNOGs), vertebrates (veNOGs) and mammals (maNOGs).
AUTOMATIC ANNOTATION OF PROTEIN FUNCTION
An important feature of eggNOG is that it provides functional annotations for the orthologous groups. These annotations are produced by a pipeline, which summarizes the available functional information on the proteins in each cluster: (1) the textual annotation for these proteins, (2) their annotated Gene Ontology (GO) terms (14), (3) their membership to KEGG path-ways (15) and (4) the presence of protein domains from SMART (16) and Pfam (17). As the textual descriptions allow for the most fine-grained annotation of protein function, we first use Ukkonen’s algorithm (18) to identify the longest common subsequence (LCS) between the description lines of any two proteins within a cluster. We then score each LCS based on the number of protein descriptions matched within the cluster, the number of occurrences of each word of the LCS in these descriptions, and the presence of words such as ‘hypothetical’, ‘putative’ or ‘unknown’. These scores are finally normal-ized against a score distribution based on randomnormal-ized clusters of the same size, and the highest scoring LCS is chosen, provided that it scores above a threshold.
For each orthologous group, our pipeline also searches for overrepresented GO terms, KEGG pathways or protein domains. To find terms that are sufficiently specific and at the same time are likely to describe the entire orthologous group, we devised a scoring function that takes into account term frequency within the group, background frequency, and the ratio of the two (i.e. the fold overrepresentation). In case no satisfactory LCS was found, a description line is constructed based on the highest scoring GO term or KEGG pathway. As a single domain may not properly reflect the function of a complete protein, description lines are constructed
Nanoarc haeum equitans Pyrobaculum aerop hilum Aeropy rum pernix Sulf olobus sol fataricus Sulfolo bus acid ocaldar ius Sulfolo bus tokodaii Picrophilus t orridus Thermopl asm a a cidophilum Thermoplas ma volcanium Ther mococ cus koda kare nsis Pyr ococcus furios us Pyr ococcus aby ssi Pyrococcus horikoshii Meth anoca ldococcus janna schii Meth anococcus maripalud is Meth anopyrus kandle ri Methanosp haer a stadtmanae Methanoth ermobacter thermautotr. Arch aeoglobu s fulgidu s Halobacterium sp. NR C−1 Natronomonas pharaonisHaloa rcula m arism ortu i Meth anoco ccoi des bur tonii Methanosarcina b arker i Methanosa rci na acetivor ans Methanosa rcina maze i Giardi a lam blia Thalassiosira pseudonana Plasmodiu m falciparum Cryp tosp oridium homini s Ara bido psis thaliana Cyan idioschyzon m erolae Dictyosteli um di scoideum Encephal ito zoon cu niculi Filobasidie lla n eofo rma ns Sch izosaccha romyc es pombe Asp ergil lus fu miga tus Yarrowia lipoly tica Debary omyc es hanseni i Eremoth eci um go ss ypi i Can dida glabrat a Klu yveromyces lactis Sac charo myces cerevisia e Cae norhabditis ele gans Apis m elli fera Dro soph ila m elanogaste r Ano pheles gam bia e Ciona in testina lis Dan io rerio Tak ifug u ru bripes Tetr aodon ni groviridis Xen opus tro pic ali s Gallus ga llus Monodelphis do m esti ca Bos tau rus Can is fa miliar is Rat tus norvegicus Mus mu
sculus Macaca mulatt a Pan troglody tes Homo sap iens Fusoba cterium nucleatum Aqu ifex aeoli cus Therm oto ga ma riti ma
Dehalococcoides sp. CBDB1Dehalococcoides ethenogenesThermus
the rmophilus HB27 The rmus thermo phil us HB8 Deinococcu s geothermali s Dei nococcus radiod urans Lept ospi ra interr ogan s Copenh . Leptospi ra interrogans Lai Tre pone ma dent icola Tre pone ma pallidum Borr elia burg dorferi Borrelia ga rin ii Chlorobium tepidu m Pelodictyon luteo lum Chlorobium chlo rochromatii Salinibacter ruber Porphyrom onas gingivalis Bacteroides thetai otaom icron
Bacteroides fragilis YCH46Bacteroide s fragilis 9343
Rhodopirellula bal tica
Protochlamydia amoebophila Chlamydia muridarumChlamydia trachomatis A/HAR−13 Chlam ydia trachom atis D /UW −3/CX Chlamydophila fel is Chlamydophila abortus Chlam ydop hila ca viae
Chlamydophila pneum. AR39 Chlamydophila pn eum. J138 Chlamydoph ila pneum. TW−183 Chlamydophila pneum. CWL029 Symbiobacterium thermophilum Corynebacterium jeikeium Corynebacterium diphtheriae Corynebacterium efficiens Corynebacterium glut amicum Nocardia farc inica Mycobacterium avium Mycobacterium leprae T N Mycobacterium bovis Mycobact. tuberc ulosis CDC1551 Mycobact. tuberculosis H37Rv Streptom yces coelicolor Streptomyce s avermitilis Frankia sp. CcI3 Therm obifida fusca Propionibacterium acne s Bifidobacterium longum Leifsonia xyli
Tropheryma whipplei Twist Tropheryma whipplei T W0827 G loeobacter violaceus Cyanobact erium Yellowstone B Cyanoba cterium Y ellowston e A Synecho coccus elongatus BP1 Synecho cystis sp. PCC6803 Anabaen a variabilis N osto c sp. P CC712 0 Syne chococcus elonga tus PCC 6301 Synecho coccus e longatus PCC794 2 Prochloroco ccus m ari nus M IT9312 Prochloroco ccus m arin us C CMP198 6 Prochlorococc us ma rinu s NATL2A Prochloroco ccus marinus CC MP1 375 Prochloroco cc us marinus MIT9313 Synecho coccus sp. CC9902 Synechococ cus sp. WH810 2 Synechococcus s p. CC9605 Thermoanaero bacter tengc ongensis Carboxydo thermus hydrogenoform. Desulfi tobacterium
hafnie nse
Moore lla thermo
acetica
Clostridium per fringens
Clostridium acetobut ylicum
Clostridium te tani
Onion yellows ph ytoplasm a Aster yellows ph ytoplasma Mesopla sma florum Mycoplas ma capric olum Mycoplasma myc oides Mycoplasm a mobile
Mycoplasma pulmonis
Mycoplasma synoviae Mycoplasma hyopneumo niae 232 Mycoplasma hyopneumoniae 7448 Mycoplasma hyopneumoniae J Ureaplasma pa rvum Mycoplasma penetrans Mycoplasma gall isepticum
Mycoplasma pneumoni
ae
Mycoplasma genitalium
Lactobacillus salivarius Lactobacillus plantarum Lactobacillus sakei Lactobacillus acidophilus Lactobacillus johnsonii Staphyl ococcu s saprop hyticu s Staphylococcu s haemol yticus Staphyl oco ccus epidermi dis RP62 A Staphylococcus epiderm idis 1222 8 Staphyl ococcus aureus C OL Staphyl ococcus aureus 8325 Sta phylococcu s aureu s USA30 0 Staphylococcus aureus RF122 Staphyl ococcus aureu s MSS A47 6 Staphyl ococcus a ureus MW2 Staphyloco ccus aureus MRSA 252 Sta phyl oco ccus aureus N315 Staphyloco
ccus aureus Mu50 Listeria monocyto genes EGD −e Listeria monocytog enes F 2365 Listeria innocua Ocean obaci llus iheye nsis Bacill us clau sii Bacillus halod urans Bac illus liche niform is Bac illus su btil is Geob acillus kaustophilus Bac illus cere us 10987 Bac illus cere us 14579 Bacill us cere us E33L Bac illus thur ingiensi s Bacillus anthracis 0581 Bac illus anthraci s Ster ne Bacill us anthra cis P orton Enterococcus faecalis Lactococcus lactis Streptococcus pneumoniae TIGR4
Streptococcus pneumoniae R6 Streptococcus mutans UA159 Streptococcus thermoph. CNRZ1066 Streptococcus thermoph. LMG18311 Streptococcus agalactiae NEM316 Streptococcus ag alactiae A909 Streptococcus agalactiae 2603VR Streptococcus pyogenes M GAS10270 Streptococcus pyogenes M
GAS6180 Streptococcus pyog
enes 700294 Streptococcus pyogenes MGAS
5005 Streptococcus pyog
enes M GAS9429 Streptoc occus pyogenes M GAS2096 Streptoc occus pyog enes MGAS31 5 Streptococcus pyogene s SSI1 Streptococcus pyog enes M GAS 10750 Streptococcus pyog enes M GAS10394
Streptococcus pyogenes MGAS8232 Aci dobacter ium sp. Ellin345 Solibac ter u sitatus Bdellovibrio b acterio vor us Ana ero myxobacte r dehalo genans Desu lfotalea psychro phil a Lawsoni a intra cellular is Desulfovibrio de sulfu ric ans Desu lfovibrio vulgaris Syntroph us aci ditr ophicus Pelobacte r carbinoli cus Geob acter meta llire ducens Geobacter sul furred ucens Thi om icrospira denitrificans Cam pyl obac ter jejuni R M1221 Cam pyl oba cter jejun i 11168 Wol inella succi nogene s Helic obacter hepati cu s Helic obacte r p ylori 26 69 5 Helic obacter pyl ori J 99 Pelagiba cte r ub iqu e Ricke ttsia bell ii Ricke ttsia feli s Ricke ttsia cono rii Ricke ttsia pro wazeki i Ricke ttsia typh i Neori ckettsia senn etsu W olbachia endos ymbiont W olbachi a sp. wM el Ana plas m a phagoc ytoph ilu m Ana plasma marg ina le Ehr lichia cani s Ehrlichi a rumi nan tium Garde l Ehr lichi a rumina ntium Welgev . Gluconoba cter oxyda ns M agnetos piril lum m agneticum Rhodo spirillum rubrum Zymo monas mobi lis Sph ingo pyx is alas kensi s Novos phi ngo bium aromatic ivoran s Ery thr obac ter lito ral is Rho dobac ter spha eroides Jann asc hia sp. C CS1 Siliciba cter pomero yi Silicibacter sp. T M10 40 Cau lobacter crescent us
Rhodopseudom. palustris BisB18
Rho dop seud om. pa lus tris CGA 009 Rhodops eudom. p alust ris BisB5 Rhod ops eudo m. p alust ris Ha A2 Bradyr hizobium jap onic um Nitrobac ter ha mburg ensis Nitr obac ter win ogradskyi Sino rhizobium meli loti Rhi zobi um etli Agrobact erium tumefaci ens Meso rhizo bium loti Bartonell a q uintana Bartonell a h ense lae Brucell a s uis Brucell a me litensi s 16 M Brucell a m elitensis abortus Brucella a bortu s Methyloba cillus flagellatu s Chro mob acte rium vio lace um Neisser ia gonorr hoea e Neisseria meningitidis Z2491 Neis seria m eningi tidis M C58 Thiobacill us d enitr ific ans Nitr osomonas europae a Nitr ososp ira multi formis Dechlor omo nas aromatic a Azoarcu s sp. E bN1 Pola romo nas sp. JS 66 6 Rhodoferax ferrired ucens Bordete lla pertussis Bordetella parap ertussis 1282 2 Bordetell a bronchiseptica Ralstonia solanacearum Ralstonia metallidurans Ralstonia eutropha Burkholderia xenov orans Burkholderia sp. 383 Burkholderia m allei Burkholderia pseudomallei K 96243 Burkholderia thailandensis Burkholderia ps eudomalle i 1710b Xylella fa stidiosa Tem ecula 1
Xylella fastidiosa 9a5c
Xanthomon as campestris 33913 Xanthom onas ca mpestris 8004 Xanthomonas ory zae KACC10331 Xanthomo nas campe stris ve sicat. Xanthomonas axonop odis Nitrosococcus oceani Methylococ cus capsulatu s Coxiella burneti i Legionella pneumoph ila Lens Legionell a pneum ophila P aris Legionell a pneum ophila P hilad. Thiomicrospira cr unogen a
Francisella tularensis SCHU S4 Francisella tu larensis holarctica Acinetobacter sp. ADP1 Psychrobacter arcticus Psychrobacter cryohalo lentis Saccharophagus d egrad ans Chromohalobacter s alexigens Hahella ch ejuensis Pseudom onas aeruginosa Pseudomonas putida Pseudomonas fluorescens Pf5 Pseudomonas fluorescens P fO1 Pseudomonas syringae DC3000 Pseudomonas syringae B728a Pseudomonas syringae 1448A
Idiomarina loihiensis Pseudoalteromonas haloplanktis5
Colwellia psychr
erythraea Shewan ella den itrificans Shewanella oneide nsis Vibrio fisc heri Photobacterium profundum Vibrio ch olerae Vibrio parahaem olyticus Vibrio vulnificu s CMCP6 Vibrio vulni ficus YJ016
Haemophilus ducreyi
Pasteurella multoci
da Mannheimi
a succiniciproducens
Haemophilus influenzae Rd KW
20
Haemophilus influenzae 8
6−028NP Sodalis glossinidius Baumannia cicadell inicola Buchnera aphidicola Bp Buchnera ap hidicola Sg
Buchnera aphidicola APS
Wigglesworthia glossinidia Blochmannia pennsylvanicus Blochmannia floridanus Photorhabdus luminescens Yersinia pseudotuberculosis Yersinia pestis 91001 Yersinia pestis KIM Yersinia pestis CO92Erwinia carotovora
Salmonella typhimurium Salmonella enterica SCB67 Salmonella enterica 9150Salmonella typhi CT18 Salmonella enterica Ty2Shigella sonnei Shigella boydi i
Shigella dysenteriae Escherichia coli
W3110 Escherichia coli UTI8 9 Escherich ia coli CFT0 73 Escher ichia coli O6 Escherichia coli
K12 Escherichia coli O
157H7 EDL933 Escher
ichia coli O15 7H7 Shigella flexne
ri 301
Shigella flexne ri 2457T
Figure 1. Statistics on the content of the eggNOG database. The eggNOG assignments for 373 complete genomes [19] were mapped onto the tree of life. The stacked bar charts outside the tree show the proportion of genes from each genome that can be assigned to a functionally annotated orthologous group (green), to an unannotated orthologous group (orange) or to no orthologous group (grey). The length of each bar is proportional to the logarithm of the number of genes in the respective genome. The pie charts inside the tree show the fractions of orthologous groups at each level in the hierarchy that could be annotated with a function description (green for NOGs, light green for extended COGs and KOGs) and that could not be functionally annotated (orange for NOGs, light orange for extended COGs and KOGs). The areas of the pie charts are proportional to the number of orthologous groups at the phylogenetic level in question. This figure was made using iTOL [20].
based on overrepresented domains only if all other options have been exhausted.
QUALITY ASSESSMENT AND SUMMARY STATISTICS
To assess the quality of the function annotations provided by our automated pipeline, we manually checked a random sample of 100 NOGs and 100 euNOGs and classified their annotations into three categories: 87.5% were correct (i.e. they describe a function that the proteins have in common), 12.5% were uninformative (i.e. they do not describe a function) and, due to our stringent rule set, no wrong functions were assigned. Uninformative annota-tions of orthologous groups are in many cases due to a lack of functional knowledge on the corresponding proteins.
Our function annotation pipeline enables us to provide description lines for 6583 of the 33 858 (19%) coarse-grained NOGs. Combined with the 9724 COGs and KOGs, this yields 43 582 global orthologous groups of which 14 356 (33%) have an annotated function. In addition, eggNOG contains 94 240 more fine-grained orthologous groups of which 55 753 (59%) could be functionally annotated. This enables us to assign 1 241 751 of 1 513 782 genes (82% of the genes in the analyzed genomes) to an orthologous group and to provide at least
a broad functional description of 951 918 of them (77% of the genes that could be assigned to an orthologous group). The corresponding numbers for each set of orthologous groups as well as for each individual genome are summarized in Figure 1.
USING eggNOG
The eggNOG resource is accessible via a web interface at http://eggnog.embl.de. The main page allows the user to input the names of one or more genes or orthologous groups and to optionally select the organism of interest. Alternatively, the user can choose to upload a set of protein sequences to be searched against the full-length sequences in eggNOG. In case of ambiguous names or query sequences with multiple hits, the user is prompted to disambiguate the input.
Figure 2 shows the result of a query for the three
G1-type cyclins in budding yeast, which belong to two
distinct fungal orthologous groups. Function descriptions are displayed for both the orthologous groups and for the individual genes. The web interface enables the user to view the complete set of genes that belong to each orthologous group and provides external links to addi-tional information on the protein products.
By default, eggNOG shows the most fine-grained ortho-logous groups that are possible given the input: just like
Figure 2. Screenshot of the main results page. The eggNOG database was queried for the threeG1-type cyclins in budding yeast, namely Cln1–Cln3.
These have been correctly assigned to two fungal orthologous groups. The navigation tree at the top of the page allows the user to change the view to more coarse-grained orthologous groups, for example the eukaryotic orthologous groups in which these cyclins are all grouped together.
entering a set of genes from budding yeast results in fungal orthologous groups being shown, a set of human genes will yield mammalian orthologous groups, whereas a combination of human and fruit fly genes will yield metazoan orthologous groups. A navigation tree at the top of the page (Figure 2) allows the user to select more coarse-grained orthologous groups if desired; for example, selecting ‘eukaryotes’ reveals that the three budding yeast cyclins all belong to the same eukaryotic orthologous group. This key feature enables the user to choose the balance between phylogenetic coverage and resolution within our hierarchy of orthologous groups.
Whereas the web interface is convenient for small-scale studies, users interested in genome-wide analyses will be better served by downloading the complete content of the underlying relational database. For this reason, the orthologous groups, functional annotations and protein sequences are all available from the eggNOG download page under the Creative Commons Attribution 3.0 License.
ACKNOWLEDGEMENTS
The authors thank Eugene Koonin for comments on
the manuscript. This work was supported by
Bundesministerium fu¨r Bildung und Forschung
(Nationales Genomforschungsnetz grant 01GR0454) as well as through the GeneFun Specific Targeted Research Project, contract number LSHG-CT-2004-503567, and through the BioSapiens Network of Excellence, contract number LSHG-CT-2003-503265, both funded by the European Commission FP6 Programme. Funding to pay the Open Acces publication charges this article was provided by the European Molecular Biology Laboratory.
Conflict of interest statement. None declared.
REFERENCES
1. Fitch,W.M. (1970) Distinguishing homologous from analogous
proteins.J. Biol. Chem.,19, 99–113.
2. Tatusov,R.L., Koonin,E.V. and Lipman,D.J. (1997) A genomic
perspective on protein families.Science,278, 631–637.
3. Sonnhammer,E.L. and Koonin,E.V. (2002) Orthology, paralogy
and proposed classification for paralog subtypes.Trends Genet.,18,
619–620.
4. Koonin,E.V. (2005) Orthologs, paralogs, and evolutionary
genomics.Annu. Rev. Genet., 39, 309–338.
5. O’Brien,K.P., Remm,M. and Sonnhammer,E.L.L. (2005) Inparanoid: a comprehensive database of eukaryotic orthologs.
Nucleic Acids Res., 33, D476–D480.
6. Alexeyenko,A., Tamas,I., Liu,G. and Sonnhammer,E.L.L. (2006) Automatic clustering of orthologs and inparalogs shared by
multiple proteomes.Bioinformatics,14, e9–e15.
7. Lee,Y., Sultana,R., Pertea,G., Cho,J., Karamycheva,S., Tsai,J.,
Parvizi,B., Cheung,F., Antonescu,V.et al. (2002) Cross-referencing
eukaryotic genomes: TIGR orthologous gene assignments (TOGA).
Genome Res.,12, 493–502.
8. Tatusov,R.L., Fedorova,N.D., Jackson,J.D., Jacobs,A.R., Kiryutin,B., Koonin,E.V., Krylov,D.M., Mazumder,R.,
Mekhedov,S.L.et al. (2003) The COG database: an updated version
includes eukaryotes.BMC Bioinformatics,4, 41.
9. Li,L., Stoeckert,C.J., Jr. and Roos,D.S. (2003) OrthoMCL: identification of orthologous groups for eukaryotic genomes.
Genome Res.,13, 2178–2189.
10. Li,H., Coghlan,A., Ruan,J., Coin,L.J., He´riche´,J.-K.,
Osmotherly,L., Li,R., Liu,T., Zhang,Z.et al. (2006) TreeFam: a
curated database of phylogenetic trees of animal gene families.
Nucleic Acids Res., 34, D572–D580.
11. van der Heijden,R.T.J.M., Snel,B., van Noort,V. and Huynen,M.A. (2007) Orthology prediction at scalable resolution by phylogenetic
tree analysis.BMC Bioinformatics,8, 83.
12. Hubbard,T.J.P., Aken,B.L., Beal,K., Ballester,B., Caccamo,M.,
Chen,Y., Clarke,L., Coates,G., Cunningham,F.et al. (2007)
Ensembl 2007.Nucleic Acids Res.,35, D610–D617.
13. Wapinski,I., Pfeffer,A., Friedman,N. and Regev,A. (2007) Automatic genome-wide reconstruction of phylogenetic trees.
Bioinformatics,23, i549–i558.
14. Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H.,
Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S.et al. (2000)
Gene Ontology: tool for the unification of biology.Nature Genet.,
25, 25–29.
15. Kanehisa,M., Goto,S., Hattori,M., Aoki-Kinoshita,K.F., Itoh,M., Kawashima,S., Katayama,T., Araki,M. and Hirakawa,M. (2006) From genomics to chemical genomics: new developments in KEGG.
Nucleic Acids Res., 34, D354–D357.
16. Letunic,I., Copley,R.R., Schmidt,S., Ciccarelli,F., Doerks,T., Schultz,J., Ponting,C.P. and Bork,P. (2004) SMART 4.0: towards
genomic data integration.Nucleic Acids Res., 32, D142–D144.
17. Finn,R.D., Mistry,J., Schuster-Bo¨ckler,B., Griffiths-Jones,S.,
Hollich,V., Lassmann,T., Moxon,S., Marshall,M., Khanna,A.et al.
(2006) Pfam: clans, web tools and services.Nucleic Acids Res.,34,
D247–D251.
18. Ukkonen,E. (1995) On-line construction of suffix trees.
Algorithmica,14, 249–260.
19. von Mering,C., Jensen,L.J., Kuhn,M., Chaffron,S., Doerks,T., Kru¨ger,B., Snel,B. and Bork,P. (2007) STRING 7—recent devel-opments in the integration and prediction of protein interactions.
Nucleic Acids Res., 35, D358–D362.
20. Letunic,I. and Bork,P. (2007) Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation.