2.2 Ontologies, Text mining and WSD
2.2.1 Ontologies in the Life Sciences
Taxonomies, Ontologies, Thesauri are all background knowledge resources which are related but differ in their degree of expressiveness and support for reasoning.
A taxonomy is a form of a classification scheme, arranged in a hierarchical structure, organized by supertype-subtype relationships, also called parent-child relationships. In a taxonomy, children (sub- types) inherit all the properties and constraints of their parents (supertypes) and can have additional ones. One of the best known forms of taxonomies is the “Linnaean taxonomy”, a biological classification of organisms.
A thesaurus is a type of controlled vocabulary used mainly for indexing or tagging purposes. A thesaurus groups together terms that are semantically close to each other. The relationships between the terms in a thesaurus can be hierarchical (‘broader than’, ‘narrower than’), equivalent (to connect synonyms and near-synonyms, e.g., ‘used for’) and associative, used to connect related terms whose relationship is neither hierarchical nor equivalent (e.g., ‘related to’).
A common definition for an ontology is “a formal explicit specification of a shared conceptualization” (Gruber,1993). According to Tim Berners-Lee, “an ontology is a document or file that formally defines the relations among terms. The most typical kind of ontology for the Web has a taxonomy and a set of inference rules” (Berners-Lee et al., 2001). An ontology is a formal representation of a set of concepts within a specific domain and the relationships between them. Within an ontology, the types of relations between the concepts can be more than simple supertype-subtype, therefore ontologies are broader and more flexible than taxonomies and thesauri. Ontologies provide dynamic, controlled vocabularies of concepts to help manage the interoperability between data sources. A typical ontology is a hierarchical structure of concepts (classes), definitions for these concepts, and associations between concepts. Ad- ditional logical axioms serve as further constraints among these entities. In a state-of-the art setting, an agent queries the ontology and a knowledge base that is based on this ontology. By exploiting the structure of an ontology, specific and reliable retrieval becomes possible.
At present the field of biology also faces the problem of the presence of a large amount of data without any associated semantics. Therefore, biologists currently waste a lot of time and effort in searching for all of the available information about each small area of research. This is hampered further by the wide variations in terminology that may be in common usage at any given time, and that inhibit effective searching by computers as well as people.
In recent years, to facilitate biomedical research, various ontologies and knowledge bases have been developed. For example the Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases (Ashburner et al.,2000). Another widely
used system has been developed by the United States National Library of Medicine called the Unified Medical Language System (UMLS) which is a consolidated repository of medical terms and their rela- tionships, spread across multiple languages and disciplines (chemistry, biology, etc) (Bodenreider,2004). Medical Subject Headings (MeSH) is a controlled vocabulary maintained by the U.S. National Library of Medicine12, mainly used for annotating and indexing articles from PubMed (Nelson et al., 2001). Moreover, several specialised databases for various aspects of biology have been developed, providing rich vocabularies including several synonyms. For example, the UniProt/Swiss-Prot Knowledge base13 is an annotated protein sequence database.
Gene Ontology
A renowned ontology in the life sciences, especially for biology and bioinformatics, is the Gene Ontology (GO) (Ashburner et al.,2000). It provides a controlled vocabulary to annotate gene products according to the biological processes in which they participate, the molecular functions they perform, and the cellular location in which they act. With each resource describing its gene products in a common form, this sharing, together with the structure provided by the relationships between terms in the GO, makes querying of within and between resources possible. A section of the GO graph is given in Figure2.7
Although the Gene Ontology was created for the express purpose of providing a common terminology for functional annotation of genes and gene products in biological databases towards the goal of database interoperability, it has since been widely used for a variety of purposes, including analyses of experimental data, predictions of experimental results, and document retrieval.
GO is the most prominent ontology of the Open Biomedical Ontologies (OBO)14, a collection
of biological ontologies that are open in that they can be used by all without constraint so long as the sources are acknowledged and the ontologies are not edited and re-distributed under the same names. In addition to the taxonomies of GO, the OBO ontologies deal with anatomies of humans and of various model organisms, biochemical substances, and sequence types, among others. Over the past years, GO has developed into the main ontology in molecular biology. Today, over 19,000 terms organised in three sub-ontologies (biological process, molecular function, cellular location) comprise the Gene Ontology. The terms are linked by three relations, ‘is–a’, ‘part–of’, ‘is–synonym’.
Medical Subject Headings (MeSH)
The Medical Subject Headings (MeSH) is a controlled vocabulary maintained by the U.S. National Library of Medicine15 (Nelson et al., 2001). It is mainly used for annotating and indexing articles from PubMed. The MeSH terminology provides a consistent way to retrieve information that may use different terminology in different articles for the same concepts. In the 2009 MeSH there are 25,186 descriptors, with an additional over 160,000 supplementary concepts, called entry terms. These entry terms assist in finding the most appropriate MeSH Heading, for example, “Vitamin C” is an entry term to “Ascorbic Acid”. In addition to these headings, there are more than 180,000 headings called Supplementary Concept Records within a separate thesaurus. MeSH is organized in a tree, with concepts such as anatomy and diseases, but also geographic locations, at the top level. The MeSH vocabulary is used for indexing journal articles from Index Medicus and Medline and also for cataloguing books and audiovisuals. PubMed contains links to full-text articles at participating publishers’ websites as well as links to other third party sites. It also provides access and links to the integrated molecular biology databases maintained by the National Center for Biotechnology Information. Table2.2shows the main differences between GO and MeSH.
12See http://www.nlm.nih.gov/mesh/meshhome.html
13See http://www.uniprot.org/
14See OBO foundry http://www.obofoundry.org/
Fig. 2.7: Section of the GO graph showing the three aspects (molecular function, biological process, and cellular component) and some of their descendant terms. The fact that GO is a directed acyclic graph (DAG) rather than a tree is illustrated by the term ‘transcription factor activity’ which has two parents. An example of a part of relationship is also shown between the terms ‘cell part’ and ‘cell’.
Gene Ontology (GO) Medical Subject Headings (MeSH)
Primary purpose gene product annotation (biolog- ical process, molecular function, cellular location)
annotation & indexing of biomedical arti- cles (Index Medicus, MEDLINE)
Number of con- cepts
>19,000 terms 25,186 descriptors, >160,000 entry terms Type of relations ‘is–a’, ‘part–of’, ‘is–synonym’ A narrower than B (so that users in- terested in B s are given the option to look at As), associative relationships (see http://www.nlm.nih.gov/mesh/intro entry.html)
Unified Medical Language System (UMLS)
The Unified Medical Language System16 (UMLS) is an attempt to make a collection of clinical/medical terminologies interoperable (Bodenreider, 2004). The UMLS Metathesaurus is a large, multi-purpose, and multilingual vocabulary database that contains information about biomedical and health related concepts, their various names and the relationships among them. It is built from the electronic versions of many different thesauri, classifications, code sets, and lists of controlled terms used in patient care, health services billing, public health statistics, indexing and cataloging biomedical literature, and /or basic, clinical, and health services research. The 2009 release of the UMLS Metathesaurus17comprises of 150 biomedical vocabularies, including the Gene Ontology, the Medical Subject Headings, the Founda- tional Model of Anatomy (FMA) (Rosse and Mejino,2003), the Systematized Nomenclature of Medicine (SNOMED) (Spackman,2004) and others. The Metathesaurus does not represent a comprehensive ontol- ogy of biomedicine or a single consistent view of the world (except at the high level of the semantic types assigned to all its concepts). It preserves the many views of the world present in its source vocabularies because these different views may be useful for different tasks.
Ontologies for Anatomy
While most biological databases contain information at the molecular level, there is a growing need to link this information to concepts about the global structure of organisms, that is to their anatomy. This development is due to two main reasons.
A central question in genetics is which genes influence the development of which parts of an organism and which genetic mutations cause which deviations from the standard phenotype. Researchers tackle this question by exploring which genes are expressed at which stage of development in which tissues of an organism. To make such findings generally accessible, a standardized vocabulary about developmental stages and tissues is needed for annotations. A second reason is that biological image data are increasingly being published on the Web. To describe in a uniform way what tissue an image shows one has to resort to some anatomical vocabulary.
Much data is collected on a variety of organisms, and very often represented in structured, thus queryable, databases:
• Mouse Genome Informatics (MGI)18gives integrated access to various types of genetic and genomic
data on the mouse (Ringwald et al.,2001).
• Wormbase19 has information on the worm C. elegans and other nematodes (Stein et al., 2001).
• Wormatlas20 is another resource for C. elegans and provides an anatomy handbook (Altun and
Hall,2006).
• FlyBase21 collects genomic information on the fruit fly Drosophila (Consortium,1998).
• Saccharomyces Genome Database (SGD)22 collects genomic information on the baker’s yeast, S.
cerevisiae.
• Zebrafish Information Network (ZFIN)23 makes gene expression, mutant, and other genomic data
on the zebrafish available (Sprague et al.,2006).
16See http://www.nlm.nih.gov/research/umls/
17See http://www.nlm.nih.gov/research/umls/knowledge sources/metathesaurus/release/source vocabularies.html
18http://www.informatics.jax.org/ 19http://www.wormbase.org/ 20http://www.wormatlas.org/ 21http://flybase.bio.indiana.edu/ 22http://www.yeastgenome.org/ 23http://zfin.org/
Anatomy ontologies can be sizable. The mouse anatomy, for instance, comprises more than 8,000 terms. Anatomies can be integral parts of larger ontologies or controlled vocabularies, like in the Medical Subject Headings (MeSH) system described earlier. MeSH contains mostly terms for human anatomy, but also some that relate to various mammal species. MeSH terms are used to annotate entries in large bibliographical databases.
The Edinburgh Mouse Atlas Project (EMAP) provides a resource that combines an anatomy ontology with a three-dimensional spatial model of the mouse embryo to give access to gene expression data
(Baldock et al., 2003). Anatomical terms are linked to regions in the spatial model and vice versa.
The Mouse Atlas is based on the same anatomy as Jackson Lab’s MGI, but has been enriched it to represent groupings between tissues such as the “skin” group, which comprises tissues in many different locations (Bard et al.,1998). We will elaborate on ontologies for anatomy in Chapter5, while describing MousePubMed, a system for mining scientific literature on mouse anatomy.