• No results found

Desiderata for ontologies to be used in semantic annotation of biomedical documents

N/A
N/A
Protected

Academic year: 2021

Share "Desiderata for ontologies to be used in semantic annotation of biomedical documents"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

Desiderata for ontologies to be used in semantic annotation

of biomedical documents

Michael Bada

, Lawrence Hunter

Department of Pharmacology, University of Colorado Denver, MS 8303, RC-1 South, 12801 East 17th Avenue, L18-6400, P.O. Box 6511, Aurora, CO 80045, USA

a r t i c l e

i n f o

Article history:

Received 22 November 2009 Available online 26 October 2010 Keywords: Ontologies Annotation Desiderata Corpus NLP Terminologies OBO Markup

a b s t r a c t

A wealth of knowledge valuable to the translational research scientist is contained within the vast bio-medical literature, but this knowledge is typically in the form of natural language. Sophisticated natu-ral-language-processing systems are needed to translate text into unambiguous formal representations grounded in high-quality consensus ontologies, and these systems in turn rely on gold-standard corpora of annotated documents for training and testing. To this end, we are constructing the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-text biomedical journal articles that are being manually annotated with the entire sets of terms from select vocabularies, predominantly from the Open Biomedical Ontologies (OBO) library. Our efforts in building this corpus has illuminated infelicities of these ontologies with respect to the semantic annotation of biomedical documents, and we propose desiderata whose implementation could substantially improve their utility in this task; these include the integration of overlapping terms across OBOs, the resolution of OBO-specific ambiguities, the integra-tion of the BFO with the OBOs and the use of mid-level ontologies, the inclusion of noncanonical instances, and the expansion of relations and realizable entities.

Ó 2010 Elsevier Inc. All rights reserved.

1. Introduction

Ontological annotation of genes and gene products is widely used as the basis for high-throughput data analysis, especially in calculations of enrichment of Gene Ontology (GO) terms in sets of differentially expressed genes[1–3]. Indeed, the stated objective of the GO was ‘‘intended to make possible, in a flexible and dy-namic way, the annotation of homologous gene and protein se-quences in multiple organisms using a common vocabulary that results in the ability to query and retrieve genes and proteins based on their shared biology’’ [4]. More sophisticated uses of formal knowledge representations in data analysis are beginning to be published[5], and there has been a strong movement toward for-mal representation of biomedical knowledge in the community-driven approach of the members of the Open Biomedical Ontolo-gies (OBO) consortium[6].

At the same time, semantic annotation of biomedical docu-ments toward formal representation of their encoded knowledge is of growing importance to the biomedical research community [7–9]. A wealth of knowledge valuable to the translational research scientist is contained within these documents, but this knowledge is in the form of natural language, which is far more difficult for computational systems to process than formal representations.

Substantial progress has been made in biomedical natural-lan-guage processing (NLP), much of it in automated methods for map-ping text to terms from ontologies and other controlled vocabularies (e.g., the unique identifiers of records of gene and gene-product databases)[10–15]. The transformation of biomedi-cal texts with their abundant synonymy, polysemy, and complex-ity into unambiguous representations grounded in high-qualcomplex-ity consensus ontologies opens the potential application of powerful computer-science methods to advance biomedical research, and the OBOs present themselves as attractive resources for this task.

A critical prerequisite to the development of sophisticated NLP systems able to perform this transformation are ‘‘gold-standard’’ annotated corpora, which are comprised of documents that have been manually marked up with formal terms and relations by ex-pert human annotators that can subsequently be used to train and test these systems[16,17]. To this end, we are constructing the Col-orado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 manually annotated full-text biomedical journal articles, primarily focused on the laboratory mouse. There have been other efforts to create systematically annotated corpora of (typically limited as-pects of) the biomedical literature[18–25]. However, while most other gold-standard corpora are limited to sentences or abstracts, we are annotating the entirety of these 97 full-text articles, exclud-ing only references – resultexclud-ing in a corpus with a total of more than 750,000 words of text. With the eight ontologies and terminologies that are being employed in the CRAFT Corpus, the range of biomedical concepts being marked up is much wider than in other

1532-0464/$ - see front matter Ó 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.jbi.2010.10.002

⇑Corresponding author. Fax: +1 303 724 3655.

E-mail addresses:mike.bada@ucdenver.edu(M. Bada),larry.hunter@ucdenver. edu(L. Hunter).

Contents lists available atScienceDirect

Journal of Biomedical Informatics

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / y j b i n

(2)

gold-standard corpora; the types of entities being currently identi-fied include (roughly in order of increasing granularity) subatomic particles, atoms, molecules and their parts, biomacromolecular se-quences (along with associated attributes and operations), cellular components, cells, organisms, molecular functions, and biological processes. Additionally, we are using OBOs and other controlled terminologies ranging from approximately one thousand to millions of terms in their entirety, thus affording a very high de-gree of semantic richness, as opposed to using only relatively small schemata, as most annotated corpora do. These ontologies and terminologies are continually under development by biomedical researchers and knowledge engineers and are widely used throughout the biomedical field, as opposed to many other annota-tion schemata that are often idiosyncratic and not likely reusable for other tasks. Furthermore, although these OBOs have been frequently used in a variety of NLP tasks, they have not been used in their entirety toward gold-standard semantic markup of text.

Thus far, we have primarily completed the annotation of men-tions of cells (using the OBO Cell Type Ontology (CL)[26]), cellu-lar components (using the cellucellu-lar-component (CC) subontology of the OBO Gene Ontology (GO)), chemicals, chemical groups, atoms, and subatomic particles (using the OBO Chemical Entities of Biological Interest (ChEBI) ontology[27]), organisms (using the NCBI taxonomy [28]), biomacromolecular sequences and their associated attributes and operations (using the OBO Sequence Ontology (SO)[29]), and biological processes and molecular func-tions (using the BP and MF subontologies of the GO) in the 97 full texts. Additionally, we are still annotating genes and gene prod-ucts (using the unique identifiers of the records of the Entrez Gene database[28]). Of these eight ontologies and terminologies, four (GO BP, CC, MF, and ChEBI) are official OBO Foundry mem-bers. Another three (CL, SO, and NCBI Taxonomy) appear in the official OBO Foundry paper (inTable 1) as ‘‘initial foundry ontol-ogies’’; it appears that these are considered candidate OBO Foun-dry members and, since they are the most prominent OBOs within their respective domains, they are likely to become official members of the OBO Foundry. The eighth, the set of unique iden-tifiers of the records of the Entrez Gene database (which is not an ontology but can be considered a controlled vocabulary), are being used because genes are a critical type of entity to identify in the literature, and, since there is no OBO of species-specific genes, we have resorted to the most prominent, widely used re-source of species-specific gene information.

Realizing the potential for the extraction of knowledge from biomedical documents and the application of powerful reasoning methods to this knowledge depends in part on the adequacy of

the formal representations used to annotate the documents. The OBOs were not originally developed as resources for the semantic annotation of documents, and although the development of the OBOs has been influenced by other use cases over the last decade, semantic annotation of biomedical texts toward formal represen-tation of their encoded knowledge has not been a prominent one. Our efforts in building the CRAFT Corpus has illuminated some areas in which the existing ontological resources have infelicities with respect to the semantic annotation of biomedical documents, and we propose changes to the OBOs that could substantially im-prove their utility in this task. Though there have been publications of domain-independent ontology desiderata[30–34], we are not aware of any focusing on desiderata for ontologies (and especially OBOs) toward the facilitation of semantic markup of natural-lan-guage documents. In this paper, we particularly focus on the use of the terms of these ontologies and terminologies toward seman-tic annotation, as this constitutes the bulk of the annotation we have completed thus far. (However, we are actively working to-ward assertional annotation using relations as well.) We assert that implementation of these desiderata would not hamper the use of OBOs toward their primary purpose of annotating entities in bio-medical databases. Furthermore, rather than distorting the OBOs specifically for biomedical NLP research, the implementation of these desiderata would also improve the overall quality of the col-lective OBOs themselves, thus enabling their more effective use to-ward other biomedical applications.

2. Annotation methodology

Semantic markup is being manually performed by three anno-tators with Ph.D.s in the biological sciences, each working with one terminology at a time (with the exception of the GO BP and MF subontologies, which are being annotated simultaneously by one annotator). These annotators follow guidelines developed by one of the authors (MB, referred to as the lead) to reflect his exten-sive experience with the construction and use of ontologies, and they are interactively trained to mark up the text with the con-trolled vocabularies with the lead.

An initial automatic annotation pass was performed for most of the terminological passes, marking up exact matches to the term names and their exact synonyms, and their plurals. The annotators review each of these programmatically created annotations, delet-ing or changdelet-ing them as needed and creatdelet-ing any missed annota-tions. The lead then reviews the annotators’ markup, correcting errors and adding missed annotations. Each annotator meets with the lead approximately weekly to come to consensus over the sec-tions of text that they had differently annotated in the previous annotation time period.

All semantic annotation work is performed in Knowtator[35], a tool developed for markup of text for NLP tasks and implemented as a plugin to Protege-Frames[36]. OBOs were automatically trans-lated from their native OBO Format[37]into a frame-based repre-sentation, which is required for Knowtator. Single-blind IAA statistics between the markup submitted by the annotator and that resulting from review by the lead were generated through Knowtator.

The articles of the corpus are also being syntactically annotated in terms of sentence segmentation, tokenization, part-of-speech tagging, and treebanking. Furthermore, the nouns and noun phrases of the articles are being coreferentially annotated, and sen-tences of the articles that are the bases for annotations of genes and gene products of the laboratory mouse are being marked up and correspondingly linked. However, further discussion of these branches of the markup of the CRAFT Corpus is beyond the scope of this paper.

Table 1

Counts of annotations and articles completed thus far for the terminological annotation passes for the CRAFT Corpus. The asterisk for Entrez Gene indicates that this is an ongoing pass with 57 articles completed; all other counts are sums of the respective annotations of all 97 articles of the corpus. Terminology # Annotations ChEBI 15,310 CL 8289 Entrez Gene* 12,544 GO BP 23,588 GO CC 7241 GO MF 5853 NCBI Taxonomy 11,201 SO 32,498 Total 116,524

(3)

3. Annotation results

As an indication of the progress of the semantic annotation of the CRAFT Corpus, we list the counts of annotations and articles thus far completed for each of the terminologies we are using in Table 1. To date, more than 116,000 annotations have been made using terms from these terminologies in this corpus of 97 articles and nearly 600,000 words. As a measure of the quality of these annotations, we have graphed the IAA statistics of the semantic-annotation projects over time, expressed as the F-score; Fig. 1 shows those for the five primarily completed projects (the CL, CC, ChEBI, NCBI Taxonomy, and SO projects), andFig. 2presents those for the GO BP and MF project (which has been graphed separately to accommodate the more wide-ranging values along the x- and y-axes, corresponding to timespan and F-scores of the project, respectively). (IAA is not being calculated for the Entrez Gene pass since this pass primarily entails the mapping of the mentions of genes and gene products to database identifiers; the annotator is using as input the results of a previous manual annotation pass in which all of these mentions were annotated, and we felt this would unfairly bias the statistics.) As can be seen from these statis-tics, our annotators routinely achieve 90 + % agreement with the project lead once they become familiar with the corresponding ter-minology and the thorniest annotation issues are resolved, the sole exception being the very challenging GO BP and MF pass (which is discussed below). Oscillations in these figures, particularly inFig. 2, are largely explained by the fact that an annotator may make the same error many times (or not) in a given article, which can strongly affect IAA statistics. This is compounded by the fact that a given article often has many mentions of some terms of the ter-minology being used, and two annotators might consistently anno-tate many mentions of a given set of terms differently, leading to a significant drop in IAA. One straightforward example of this is that in one article, the lead annotated all 305 mentions of the word ‘‘olfactory’’ with the GO BP term sensory perception of smell (since our annotation guidelines call for annotating adjectival forms of terms, ‘‘olfactory’’ being the adjectival form of ‘‘olfaction’’, an exact synonym of sensory perception of smell), while the annotator did not; this one very repetitive specific type of error accounted for a decrease in the IAA by 4.5 percentage points alone. Given that there are many concepts that may be frequently mentioned within a given article, this effect can be cumulative, leading to large vari-ations in IAA.

Our IAA statistics are calculated using the results of our single-blind annotation methodology (in that the lead can see the markup of the initial annotator); these numbers are likely higher than those resulting from true double-blind annotation would be. To

as-sess this bias, IAA was calculated for the SO pass of three articles in a double-blind fashion (with the regular SO annotator and lead). The double-blind IAA of 89.9% is very close to the single-blind IAA 90.4% for the previous SO annotation batch, suggesting that the lead is very consistent and that the single-blind IAAs are unli-kely to be significantly biased. Also, note that our criteria for anno-tation matches are very strict, requiring exact matches for the selected text spans and for the terms used to semantically annotate these spans. Many of the mismatches (which lower the IAA) are actually minor, due to small differences in the span of text selected or in the chosen class.

4. Desiderata for the Open Biomedical Ontologies for semantic annotation

The OBOs have not been specifically developed for the semantic annotation of natural-language biomedical documents, and in our construction of the CRAFT Corpus, we have encountered many is-sues in using them for this task. These range from isis-sues specific to one or a few terms from a given ontology to those that span across several ontologies or are domain-independent. We have previously presented lessons learned in using large terminologies to semantically annotate concept1 mentions in natural-language

documents[40], concept-annotation guidelines[41], and issues we have encountered in using the GO in particular for this purpose [42]. Here we focus on our use of the OBOs and present six high-level desiderata that we assert would make their use in semantic annota-tion easier and more effective. These desiderata are empirically de-rived from our substantial effort to use these terminologies in their entirety to semantically mark up full-text journal articles, resulting in over 116,000 annotations of mentioned ontological con-cepts, and they are a distillation of what we have found to be most difficult in using these terminologies for this task. If implemented, these recommendations would ensure that future development of the ontologies meets the needs of not only annotation of genes and gene products in databases but also of semantic annotation of biomedical natural-language documents in the pursuit of knowledge extraction and its formal representation.

First, several caveats. The OBOs are works in progress, and many of them have limited or no funding to support their development. The observations that follow should not be construed as criticisms

Fig. 1. IAA-agreement statistics over time for the annotation passes using the SO, GO CC, CL, NCBI Taxonomy, and ChEBI for the CRAFT Corpus. The x-axis is annotation time period (which is generally weekly), and the y-axis is F-score.

Fig. 2. IAA statistics over time for the annotation passes using the GO BP and MF (which are simultaneously being done by one annotator) for the CRAFT Corpus. The x-axis is annotation time period (which is generally weekly), and the y-axis is F-score.

1

Throughout this article, we have written in terms of concepts. We are aware of published work that emphasizes that ontologies represent reality rather than concepts[38,39], and for those readers who subscribe to this view, ‘‘concept’’ may be substituted with ‘‘universal’’ or ‘‘type’’ throughout.

(4)

of the existing work, but as desiderata for future improvements as much as possible. Most of the ontologies and terminologies within the OBO Library are not official members of the OBO Foundry (which is the natural locus for coordinating some desired changes), and each OBO development team generally retains their indepen-dence, so adoption of these suggestions may be uneven. Finally, although these issues were uncovered by the use of the OBOs in the semantic annotation of natural-language text, the suggestions are intended to logically strengthen the collective ontologies and generally heighten their utility; nothing we propose below entails changes that we believe would be logically or ontologically incorrect.

4.1. Integrate overlapping terms across OBOs

Although one of the principles of the OBO Foundry is that mem-ber ontologies should be orthogonal to each other[6], a number of OBOs do contain terms that overlap with terms in other OBOs. A simple, obvious example is GO:cell and CL:cell, which presumably refer to the same entity but are not connected (e.g., with the OWL equivalentClass construct), and have different textual definitions. (The CL developers intend to revise the definition of CL:cell to be consistent with the GO and are advocating that OBOs that cur-rently contain their own cell terms to instead refer to CL terms (Alex Diehl, personal communication).) Likewise, ChEBI has a set of terms representing biological macromolecules, which overlap with but are not integrated with the extensive protein hierarchy of the Protein Ontology (PRO)[43]nor with the macromolecular terms of the SO. In Uberon[44], which is an OBO effort to unify the various taxon-specific anatomical OBOs (e.g., those for mice, humans, amphibians, teleost fish) into a general anatomical ontol-ogy, the general terms (e.g., UBERON:uterine cervix) have formal cross-references using the OBO Flat File Format tag xref to the analogous taxon-specific terms (e.g., FMA:Cervix of uterus for hu-mans, MA:uterine cervix for mice); however, this tag is used to re-fer to an analogous term in another vocabulary[37], and such an assertion appears to be underspecified. Some effort to properly model this relationship would be valuable. For example, MA:uter-ine cervix could be formally linked to UBERON:cervix via is_a, thus stating that a mouse cervix is a type of cervix, along with a link to Mus in the NCBI Taxonomy to represent the organismal source; alternately, perhaps a more complex representation of asserted homology among the anatomical entities represented by the terms of Uberon and those of the taxon-specific anatomical ontologies would be suitable.

This lack of clearly specified relationships among semantically identical and related terms across different OBOs is problematic for a variety of reasons. In semantic annotation of biomedical doc-uments, ambiguities result from overlapping but not explicitly re-lated term sets, which makes this task more difficult. For example, in ChEBI, protein polypeptide chain is a child of polypeptide, but in the SO, the term polypeptide seems to be synonymous with CHEBI:protein polypeptide chain. (The comment for SO:polypep-tide states that it has been merged with ‘‘protein’’, and ‘‘protein’’ is an exact synonym.) To make things more difficult, ChEBI also has a term named protein, which seems to encompass both single protein chains (corresponding to CHEBI:protein polypeptide chain, SO:polypeptide, and presumably to PRO:protein) and also protein complexes (corresponding to GO:protein complex). Even though orthogonality is emphasized among the OBOs, there will always probably be some overlap of concepts; what we are advocating is a formalization of the relationships among these overlapping con-cepts. The OBO Foundry principle of orthogonality should be amended to require explicit relationships among terms across mul-tiple ontologies when overlap is deemed desirable; a mechanism such as MIREOT could be employed to avoid issues in importing

and reasoning over multiple large ontologies[45]. Formally speci-fying identity as well as other types of relationships among related terms from different ontologies would not only make it easier to semantically annotate documents with these terms but would also help to reduce ambiguity for human users and facilitate computa-tional use of multiple OBOs in a logically consistent way. 4.2. Avoid and remove general terms with context-specific meanings

Some OBOs contain terms that refer to more general concepts than the domain of the ontology itself. A good example of this ap-pears in the SO, which contains a subgraph with epistemological terms such as independently_known, predicted, and validated. These terms were created in the SO to be used in the necessary and sufficient definitions of more complex SO terms (e.g., pre-dicted_gene is a gene that has_quality predicted), and at the time of their creation, there were no appropriate ontologies of such terms or an effective way to refer to terms from other OBOs (Karen Eilbeck, personal communication). Within the SO, these terms de-note epistemic properties of sequence features; for example, the textual definition of predicted is an ‘‘attribute describing an unver-ified region’’, which may be appropriate to describe a predicted se-quence feature but is not a good general definition for the term predicted.

A context-specific definition such as this is especially problem-atic for semantic annotation of text. Annotating only sequence-specific mentions of the epistemic SO terms in the biomedical doc-uments of our corpus turns out to be quite difficult. We have come across many ambiguous mentions of these epistemic concepts, such as those that are applied to sequences but comment on their functionalities, which are largely (but not entirely) outside of the domain of the SO. Trying to decide whether an obvious mention of the more general concept (e.g., ‘‘predicted’’) falls within a con-text-specific definition is both difficult and, with a well-designed system of ontologies, unnecessary. A general ontology of these epistemological terms could obviously be applied in many areas outside of biological sequences, and there might be no need to specify SO-specific subclasses. Therefore, the epistemic terms in the SO should likely be extracted and reconstituted in an orthogo-nal, more general epistemological ontology, perhaps merged with the OBO Evidence Code Ontology[46]. These redefined terms could then be more widely applied, and their use in semantic annotation would become significantly easier, as it would eliminate the need to decide (either by a human or by a computational system) these ambiguous cases.

4.3. Resolve ontology-specific ambiguities

The use of an ontology in semantic annotation can also point out semantic ambiguities within an ontology. For example, in the GO, the nature of the MF terms (i.e., whether they represent mol-ecule-level occurrents or realizable entities) and the relationship of the MF terms to the BP terms appears ambiguous, and there are many corresponding terms in these subontologies that we have found extremely difficult to consistently differentiate given a tex-tual mention, even using their definitions and taking context into account (e.g., BP signal transduction and MF signal transducer activity, BP regulation of transcription and MF transcription regu-lator activity). (There is an additional complication as to the uncer-tain nature of the relationship between ceruncer-tain MF terms and corresponding ChEBI role terms (e.g., GO:enzyme inhibitor activity and CHEBI:enzyme inhibitor), which also involves the first desider-atum.) There are those within the OBO Consortium who advocate for the MF terms to represent realizable entities, but the large majority of terms are currently defined as molecule-level occur-rents. Furthermore, for the first two pairs of terms presented in this

(5)

section, the MF terms are now linked to the BP terms via part_of (e.g., signal transducer activity part_of signal transduction) from which it can be inferred that the former represent occurrents since the latter are obviously occurrents and only occurrents can be parts of other occurrents. However, consistently distinguishing be-tween a regulation-of-transcription occurrent and a transcription-regulator-activity occurrent is at the heart of the problem. To maintain high interannotator agreement, we have had to subopti-mally restrict annotation with the MF terms to specific lexical forms. We assert that it is important to be able to specify whether a gene or gene product has a specific functionality or is merely in-volved in a higher-level process, but we assert that these overlap-ping terms are problematic and make semantic annotation of text very difficult. In this case, we suggest a good approach would be to create a taxonomy of participant classes of independent continu-ants, (e.g., signal transducer, transcriptional regulator). Most di-rectly, the MF terms could be transformed into these participant terms. Alternately, the participant terms could be created in ChEBI, and the MF terms could be transformed into true realizable-entity terms (e.g., potential to transduce a signal, potential to regulate transcription). This would enable the former to be defined in terms of the latter (e.g., CHEBI:signal transducer has_realizable_entity GO:potential to transduce a signal), and the BP terms to formalize their participants (e.g., GO:signal transduction has_participant CHEBI:signal transducer). The separation of occurrents, their par-ticipants, and realizable entities possessed by the participants would be made clear.

An ambiguity in the use of an ontological term or term set may also be able to be resolved through a consultation with one or more developers of the ontology that may not necessarily entail a change in the structure of the ontology but rather a clarification or chan-ged definition. For example, we found it difficult to consistently annotate our documents with the GO BP term induction: Although a given textual mention might refer to the induction of a biological process (e.g., disease, mutation, enzyme activity, gene expression), this GO term represents a more specific sense of biological induc-tion (i.e., ‘‘[s]ignaling at short range between cells or tissues of dif-ferent ancestry and developmental potential that results in one cell or tissue effecting a developmental change in the other’’), and we found this distinction between the former and the latter in the tex-tual mentions frequently unclear; furthermore, there are other GO terms with ‘‘induction’’ as a head noun (and not subsumed by induction) that use it in the more general sense (e.g., induction of apoptosis). A discussion with one of the GO developers, which included evaluation of a number of examples with which we were having trouble, significantly helped toward clearer, more consis-tent annotation with this term. (The developers of the GO have also since renamed induction to developmental induction.) Many such specific ambiguities become apparent when using an ontology for semantic annotation of biomedical documents. Communication with ontology developers, resulting in structural changes in the ontology and/or definitions or simply clarifications in the usage of terms, is required before apparently ambiguous concepts can be reliably identified in biomedical documents. Generally, the developers of these ontologies have been quite responsive to such queries.

4.4. Integrate the BFO and OBOs and use or create mid-level ontologies There are already a wealth of domain-specific ontologies rele-vant to biomedical research that are associated with the OBO Library of ontologies; its Web page (http://obofoundry.org) lists 8 OBO Foundry ontologies and another 86 OBO Foundry candidate ontologies and other ontologies of interest, including those cover-ing taxon-specific anatomies and developmental stages, pheno-types, cells and cellular components, environments, diseases and

pathologies, chemical substances, macromolecular sequences, organisms, and concepts relevant to biomedical experiments. On the other end of the spectrum, the OBO Foundry has committed to using the Basic Formal Ontology (BFO) as its upper ontology; this general ontology first divides everything into continuants and occurrents and progressively subdivides these, but all concepts represented in the BFO are intended to be applicable to the repre-sentation of any domain [47]. However, linkage of the domain-specific ontologies to the upper ontology is generally lacking.

Integrating the OBOs with the BFO can be accomplished at a base level through formal is_a links between high-level terms of the OBOs with the BFO (or with other OBO terms, as discussed in the first desideratum). One project that has implemented this is BioTop, which is a proposed top-level biomedical ontology serving to link domain-specific OBOs to upper ontologies (BFO and DOLCE) [48]. However, some of these high-level terms are semantically distant from even the most specific terms of the BFO. For example, the BP terms of the GO suggest a mid-level (i.e., more general than the GO, but more specific than the BFO) set of processes. For exam-ple, GO:cell proliferation could be defined as a more general ‘‘pro-liferation’’ term (subsumed by BFO:occurrent) in which the entity that is proliferating is a population (i.e., BFO:object_aggregate) of cells (i.e., GO:cell/CL:cell). The verbs and their nominalizations that appear in many BP terms (e.g., transport, regulation, detection) are natural candidates for promotion to this mid-level ontology. Many of the BP terms have subtleties that may be challenging to repre-sent precisely, so this would likely not be a straightforward pro-cess. There are many mentions of relatively general concepts in biomedical texts, suggesting that these concepts of intermediate abstraction are important in biomedical reasoning.

We have found it necessary to use context to consistently anno-tate text[40]. For example, if the word ‘‘differentiate’’ refers to GO:cell differentiation, it is annotated with this term even though the word in the absence of context may be being used in its gen-eral-English sense. If we were not to use context and only relied on the text itself, an unacceptably large number of mentions of bio-medical concepts would not be semantically annotated. The BP term GO:biological regulation would never be used, for example, as this phrase never appears in our corpus of 97 biomedical docu-ments, despite the fact that lexical variants of ‘‘regulate’’ appear 951 times (making this one of the most common lexemes) in the corpus and that the overwhelming majority of these do refer to biological regulation (and therefore have been annotated as such, taking context into account). However, since we consistently al-ways take context into account, general words are sometimes annotated with relatively specific terms; for example, ‘‘destabiliz-ing’’ (i.e., this word by itself) has been annotated with GO:RNA destabilization since this is the correct sense of the word in its con-text and there is no more general destabilization term in the GO. It would be better to be able to annotate a mention of such a general word with a general term, provided that the biomedical sense of the word is subsumed by the general sense of the word (as we would expect GO:RNA destabilization to be subsumed by a more general destabilization term). In addition to being more intuitive, we assert that a methodology such as this would benefit text-min-ing research, as it would be easier to train systems to recognize that ‘‘destabilizing’’ should be annotated with a general destabili-zation term rather than with GO:RNA destabilidestabili-zation only in the correct places (again, assuming that this term would be subsumed by the more general term).

Additionally, this type of integration would aid information extraction in that additional assertions involving concepts more general than those represented by OBO terms may be detected. For example, for the phrase ‘‘carbonic anhydrase...facilitates this secretion’’, we have annotated ‘‘carbonic anhydrase’’ with GO:car-bonate dehydratase activity (also specifying that it is a continuant)

(6)

and ‘‘secretion’’ with GO:secretion. However, ‘‘facilitates’’ is a gen-eral word that is not represented by any OBO term and is thus left unannotated; thus, we have identified mentions of the concepts carbonate dehydratase and secretion in this phrase, but the asser-tion that carbonic dehydratase facilitates secreasser-tion is lost. If the OBOs were integrated into a more general ontology with a term representing facilitation, ‘‘facilitates’’ could be annotated and this assertion could be extracted. With the expanded set of types of assertions capable of being reliably extracted, integration of OBOs with the BFO and with mid-level ontologies would also afford pow-erful reasoning between these levels of abstraction.

4.5. Include noncanonical instances

Most of the OBOs are charged with representing canonical con-tinuants and/or occurrents within their respective domains. In using the GO to semantically annotate our corpus of biomedical documents, we at first attempted to limit annotation to mentions of canonical concepts, but this turns out to be a deceptively diffi-cult task. Noncanonicality in biomedical journal articles is espe-cially frequent due to the fact that most of these articles present and discuss experiments with organisms or components of organ-isms that are subjected to substances, procedures, and environ-mental conditions that they would not normally encounter. Often, noncanonicality can only be inferred from a very careful reading and comprehension of the text, and many other times it is not clear at all whether a given concept mention is canonical or not. Furthermore, both continuants and occurrents can be non-canonical in any number of ways, and Rector has questioned whether any real structure can be completely characterized as ‘‘normal’’[49].

Our solution has been to annotate all mentions of the terms from the OBOs we are using that match the term’s definition, even those mentions that are explicitly noncanonical. For example, a mention of a cellular proliferation is annotated even it is explicitly noncanonical in some way, pathological, artificially induced, or occurring outside of an organism; for example, the ‘‘proliferation’’ part of ‘‘hyper-proliferation’’ would be annotated with the term GO:cell proliferation (assuming it is a hyper-proliferation of cells) even though this is clearly a noncanonical proliferation. This guide-line has significantly eased our task of semantic annotation, which had been greatly complicated by trying to determine the bound-aries of canonicity. Although it does not require any but the most subtle change on the part of the ontology developers, we suggest that terms be considered to apply to both canonical and noncanon-ical instances that meet the term definition (except in those cases where such a distinction is explicit, of course, e.g., PATO:mislocal-ised). This is consistent with the representation in the Founda-tional Model of Anatomy, whose developers emphasize that their ontology is one of canonical anatomy, which can then be used to represent instantiated anatomy that differs in one or more aspects from the idealized anatomy[50]. Similarly, anatomical structures in the Ontology of Biomedical Reality are subdivided into canonical anatomical structures and variant anatomical structures (setting aside the question as to whether a structure can be entirely canon-ical, as previously mentioned)[51].

In addition to sparing (human or computational) annotators from having to make difficult and subject decisions with regard to canonicity, this recommendation allows those who plan to use ontological terms to represent noncanonical instances, for exam-ple, in automated reasoning systems. Rather than identifying an in-stance as simply noncanonical, a knowledge representation should explicitly identify which ontological assertions are violated by the instance. For example, in the GO, GO:membrane is asserted to be part_of GO:cell. However, a membrane can also be artificially sep-arated from the cell and continue to exist as a membrane. Such a

membrane would be noncanonical in that it is not part of a cell, but it still could be canonical in other respects (e.g., molecular composition, polarization). This approach is essentially a reframing of default reasoning. Incorporating and reasoning with defaults and exceptions into ontologies and knowledge bases requires rea-soning beyond standard deduction and is a challenging endeavor, as researched by Rector et al.[52,53]. One system that treats asser-tions of canonical ontologies as default knowledge and enables revocation of this default knowledge through the use of a proposed class of relations has been implemented by Hoehndorf et al.[54]. The GO is developed in an annotation-driven way in that its terms and assertions among the terms are continually edited to conform as much as possible with the annotations in model-organism dat-abases created using these terms[55]; however, for those annota-tions that remain inconsistent with the ontology, representing and reasoning with defaults and exceptions would be a mechanism to formally relate them to the ontology. Note that this approach does not require developers of canonical ontologies such as the GO to in-clude explicitly noncanonical terms or alter in any way canonical definitions of terms or assertions of relationships between these terms.; however, it does warn them that users of these ontologies might create representations of concepts that are noncanonical in one or more ways with respect to canonically represented concepts but still are subsumed by them.

4.6. Expand relations and realizable entities

The previous desiderata deal principally with concept annota-tion, i.e., marking up mentions of ontological concepts with their corresponding terms. To extract knowledge from biomedical docu-ments, these annotated concept mentions must also be linked via relations to create formally represented assertions, e.g., the ana-tomical parts in which genes are expressed, the noncanonical bio-logical processes that result in particular phenotypes, and diseases and their treatments. In contrast to the OBOs, which have large numbers of terms of which to make use in the annotation of these articles, the ontology of relations is much sparser.

The OBO Relation Ontology (RO)[56]currently contains 26 ba-sic relations, and another of the principles of the OBO Foundry is that member ontologies use relations that are unambiguously de-fined in the manner of those in the RO. Some of the member ontol-ogies (as well as many of the other ontolontol-ogies and terminolontol-ogies of interest listed on the OBO Foundry site) already use relations out-side of the RO within their respective ontologies, and most of these unofficial relations are not well-defined. In some cases, these unof-ficial relations used in assertions linking terms could be replaced with existing RO relations; for example, develops_from, which is used in several of the anatomical developmental ontologies, could likely be replaced in many cases with the official RO:transforma-tion_of, the relation linking entities that change their classifica-tions but retain their identities. In other cases, new relaclassifica-tions will likely be needed for the assertions in these ontologies.

A particular area in which we (and others) have found the RO insufficient is in the precise representation of participants in occur-rents in the OBO cross-products, which are an effort to create formal necessary and sufficient definitions of OBO terms using other OBO terms and in so doing link these OBO terms[57]. For example, the GO BP term GO:sulfur amino acid transport is formally defined as GO:transport that results_in_transport_of CHEBI:sulfur-containing amino acids, thus formally linking the previously unlinked GO:sulfur amino acid transport and CHEBI:sulfur-containing amino acids. The participants in many of the GO BP occurrents are represented provi-sionally using specific relations, e.g., results_in_transport_of for GO:transport and its subclasses (as above) and results_in_acquisi-tion_of_features_of for GO:cell differentiation and its subclasses. (Though such a relation such as results_in_transport_of may seem

(7)

odd, it would simply be a subrelation of has_participant, an official OBO relation; the latter links a subject occurrent to an object contin-uant, while the former links a subtype of occurrent (a transport) to a continuant.) If occurrents and their participants are to be directly linked in this way (i.e., with the occurrent as the subject term and the participant as the object term, directly linked with one relation), then these specific relations are often needed: has_participant can-not be used to indicate these specific roles of these participants, as many biomedical occurrents (particularly higher-level ones, such as those at the cellular or anatomical level) may have any number of participants. Thus, while using has_participant would result in a necessary assertion, it would not be sufficient since a given continu-ant may participate in any number of occurrents. Therefore, the cross-products defining GO BP terms in terms of other OBO partici-pant terms are currently using a number of specific relations outside of the RO.

Another option in defining occurrents in terms of its partici-pants is through the use of has_participant assertions that link the occurrent to each input participant in which the participant realizes some BFO:realizable_entity. In this approach, there would likely be a need for only a few new relations linking occurrents, their participants, and their realizable entities. More specific clas-ses of realizable entities could be used as well, since the BFO al-ready divides these into BFO:disposition, BFO:function, and BFO:role; however, we have had considerable difficulty in distin-guishing among realizable entities. The specificity in this represen-tation is pushed from the relations to the realizable entities, so there would be a need instead for specific (named or anonymous) classes of realizable entities.

We assert that the precise representation of participant continu-ants in occurrents is crucial for the annotation of concept mentions and knowledge extraction from biomedical documents since there are so many mentions of occurrents and their participants in text, ranging from the molecular level (e.g., alanine biosynthesis) to the cellular-component level (e.g., membrane budding) to the cellular level (e.g., cell cycle) to the organ/organ-system level (e.g., lung development) to the organismal level (e.g., behavior), each which their own set of participants. In addition to these canonical occur-rents, biomedical documents contain frequent mentions of nonca-nonical/abnormal occurrents (e.g., increased growth) and pathological occurrents (e.g., tumorigenesis) that are often of signif-icant interest in translational research and that have their own sets of participants. However, even though we have focused on occur-rents and their participants here, we assert there is a need for addi-tional relations and realizable entities in other aspects as well.

5. Conclusions

We have briefly presented our ongoing efforts in building the CRAFT Corpus, a collection of full-text biomedical journal articles that are being manually annotated with the entire sets of terms of select OBOs; this corpus is intended to serve as a gold standard in biomedical text-mining research. We have also shown prelimin-ary IAA statistics as measures of the quality of our efforts. From an analysis of the various difficulties we have encountered in using these OBOs for semantic annotation of the documents of our cor-pus, we have compiled and presented six high-level desiderata for ontologies, and particularly for OBOs, that we assert would sig-nificantly ameliorate these difficulties and would therefore lead to easier and more effective semantic annotation of and knowledge extraction from biomedical documents for the translational re-search scientist. In addition, we assert that the implementation of these desiderata would improve the collective ontological qual-ity of the OBOs themselves, thus enabling their more effective use toward other biomedical applications. The implementation of

these desiderata would likely be long-term, challenging projects, but we assert that they are well worth the effort.

Acknowledgments

The authors are grateful for helpful comments provided by Bar-ry Smith of the OBO FoundBar-ry, Judith Blake of the Mouse Genome Informatics Group and the Gene Ontology Consortium, Alex Diehl of the CL project, Karen Eilbeck of the SO project, and Janna Has-tings of the ChEBI project. This work is supported by NIH 5G08M009639 and 5T15 LM009451.

References

[1] Khatri P, Dra˘ghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 2005;21(18):3587–95. [2] Curtis RK, Orešicˇ M, Vidal-Puig A. Pathways to the analysis of microarray data.

Trends Biotech 2005;23(8):429–35.

[3] Huang DW, Sherman BT, Lempicki R. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucl Acids Res 2009;37(1):1–13.

[4] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nat Genet 2000;25:25–9. [5] Leach SM, Tipney H, Feng W, Baumgartner WA, Kasliwal P, Schuyler RP, et al.

Biomedical discovery acceleration, with applications to craniofacial development. PLoS Comput Biol 2009;5(3):e1000215.

[6] Smith B, Ashburner M, Rosse C, Bard C, Bug W, Ceusters W, et al. The OBO foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotech 2007;25:1251–5.

[7] Anadiadou S, Kell DB, Tsujii J. Text mining and its potential applications in systems biology. Trends Biotech 2006;24(12):571–9.

[8] Wilbur WJ, Rzhetsky A, Shatkay H. New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinform 2006;7:356.

[9] Bodenreider O. Biomedical ontologies in action: role in knowledge management, data integration and decision support. Yearb Med Inform 2008:67–79.

[10] de Bruijn B, Martin J. Getting to the (c)ore of knowledge: mining of biomedical literature. Int J Med Inform 2002;67(1–3):7–18.

[11] Mack R, Hehenberger M. Text-based knowledge discovery: search and mining of life-sciences documents. Inform Technol 2002;7(11):S89–98.

[12] Cohen AM, Hersh WR. A survey of current work in biomedical text mining. Briefings Bioinform 2005;6(1):57–71.

[13] Erhardt RAA, Schneider R, Blaschke C. Status of text-mining techniques applied to text. Drug Discovery Today 2006;11(7–8):315–25.

[14] Hunter L, Cohen KB. Biomedical language processing: what’s beyond PubMed? Mol Cell 2006;21(5):589–94.

[15] Zweigenbaum P, Demner-Fushman D, Yu H, Cohen KB. Frontiers of biomedical text mining: current progress. Briefings Bioinform 2007;8(5):358–75. [16] Tsuruoka Y, Tateishi JD, Ohta T, McNaught J, Ananiadou S, Tsujii J. Developing a

robust part-of-speech tagger for biomedical text. In: Proc 10th Panhellenic Conf on Informat 2005. p. 382–92.

[17] Lease M, Charniak E. Parsing biomedical literature. Natural language processing. Berlin/Heidelberg: Springer; 2005. p. 58–69.

[18] Kim J-D, Ohta T, Tateisi Y, Tsujii J. GENIA corpus – a semantically annotated corpus for bio-text mining. Bioinform 2003;19(Suppl. 1):i180–2.

[19] Kulick S, Bies A, Liberman M, Mandel M, McDonald R, Palmer M, et al. Integrated annotation for biomedical information extraction. Hum Lang Tech Conf/N Am Chapter of the Assoc for Comp Ling Annual Meeting (HLT/NAACL), Biolink Workshop; 2004. p. 61–8.

[20] Tanabe L, Xie N, Thom LH, Matten W, Wilber J. GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinform 2005;6(Suppl. 1):S3. [21] Pyssalo S, Ginter F, Heimonen J, Björne J, Boberg J, Järvinen J, et al. BioInfer: a

corpus for information extraction in the biomedical domain. BMC Bioinform 2007;8:50.

[22] Roberts A, Gaizauskas R, Hepple M, Davis N, Demetriou G, Guo Y, et al. The CLEF corpus: semantic annotation of clinical text. Proc Am Med Inform Assoc 2007:625–9.

[23]http://fetchprot.sics.se/.

[24] Kim J-D, Ohta T, Tsujii J. Corpus annotation for mining biomedical events from literature. BMC Bioinform 2008;9:10.

[25] Thompson P, Iqbal SA, McNaught J, Ananiadou S. Construction of an annotated corpus to support biomedical information extraction. BMC Bioinform 2009;10:349.

[26] Bard J, Rhee SY, Ashburner M. An ontology for cell types. Genome Biol 2005;6(2):R21.

[27] Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, et al. ChEBI: a database and ontology for chemical entities of biological interest. Nucl Acids Res 2008;36(Database Issue):D344–50.

[28] Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, et al. Database resources of the National Center for Biotechnology Information. Nucl Acids Res 2009;37(Database Issue):D5.

(8)

[29] Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, et al. The sequence ontology: a tool for the unification of genome annotations. Genome Biol 2005;6:R44.

[30] Gruber TR. Toward principles for the design of ontologies used for knowledge sharing. Int J Human-Comp Stud 1995;43(5–6):907–28.

[31] Swartout B, Ramesh P, Knight K, Russ T. Toward distributed use of large-scale ontologies. AAAI Symp Ontological Eng, 1997.

[32] Cimino JJ. Desiderata for controlled medical vocabularies in the twenty-first century. Methods Inf Med 1998;37(4–5):394–403.

[33] Burgun A. Desiderata for domain reference ontologies in biomedicine. J Biomed Inform 2006;39(3):307–13.

[34] Wang X, Almeida JS, Oliveira AL. Ontology design principles and normalization techniques in the web. In: Proceedings of the international workshop on data integration in the life sciences, 2008.

[35] Ogren PV. Knowtator: A plug-in for creating training and evaluation data sets for biomedical natural language systems. In: Proceedings of the ninth international protege conference, 2006.

[36] Gennari JH, Musen MA, Fergerson RW, Grosso WE, Crubézy M, Eriksson H, et al. The evolution of protégé: an environment for knowledge-based systems development. Int J Human-Comp Stud 2003;58(1):89–123.

[37]http://www.geneontology.org/GO.format.obo-1_2.shtml.

[38] Smith B. Beyond concepts: ontology as reality representation. In: Proceedings of the internat conf on formal ontology in information systems (FOIS) workshop on the potential of cognitive semantics for ontologies, 2004. [39] Smith B. From concepts to clinical reality: an essay on the benchmarking of

biomedical terminologies. J Biomed Inform 2006;39(3):299–306.

[40] Bada M, Hunter L. Using large terminologies to semantically annotate concept mentions in natural-language documents. Proceedings of the semantic authoring, annotation and knowledge markup workshop (SAAKM) 2009.

[41] Bada M, Eckert M, Palmer M, Hunter LE. An overview of the CRAFT concept annotation guidelines. In: Proc Assoc Comp Ling (ACL) Ling Annotation Workshop (LAW) IV, 2010.

[42] Bada M, Hunter L. Using the gene ontology to annotate biomedical journal articles. In: Proc Int Conf Biomed Ontology (ICBO), 2009.

[43] Natale DA, Arighi CN, Barker WC, Blake J, Chang TC, Hu Z, et al. Framework for a protein ontology. BMC Bioinform 2007;8(Suppl. 9):S1.

[44] Haendel MA, Gkoutos GV, Lewis SE, Mungall CJ. Uberon: towards a comprehensive multi-species anatomy ontology. Nat Precedings, 2009. [45] Courtot M, Gibson F, Lister AL, Malone J, Schober D, Brinkman RR, et al.

MIREOT: the minimum information to reference an external ontology term. Nat Precedings 2009.

[46]http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code.

[47] Grenon P, Smith B, Goldberg L. Biodynamic ontology: applying BFO in the biomedical domain. In: Pisanelli DM, editor. Ontologies in medicine. Amsterdam: IOS Press; 2004. p. 20–38.

[48] Beisswanger E, Schulz S, Stenzhorn H, Hahn U. BioTop: an upper domain ontology for the life sciences: a description of its current structure, contents and interfaces to OBO ontologies. Appl Ontol 2008;3(4):205–12.

[49] Rector AL. Anatomy for clinical terminology. In: Burger A, Davidson D, Baldock R, editors. Anatomy ontologies for bioinformatics, principles and practice. Springer; 2008.

[50] Rosse C, Mejino Jr JLV. The foundational model of anatomy ontology. Ibid. [51] Smith B, Kumar A, Ceusters W, Rosse C. On carcinomas and other pathological

entities. Comp Func Genom 2005;6:379–87.

[52] Rector AL, Wroe C, Roger J, Roberts A. Untangling taxonomies and relationships: personal and practical problems in loosely coupled development of large ontologies. Proc Knowledge Capture (KCAP) 2001:139–46.

[53] Rector A. Defaults, context, and knowledge: Alternatives for OWL-indexed knowledge bases. Proc Pacific Symp Biocomput (PSB) 2004;9:226–37. [54] Hoehndorf R, Loebe F, Kelso J, Herre H. Representing default knowledge in

biomedical ontologies: application to the integration of anatomy and phenotype ontologies. BMC Bioinform 2007;8:377.

[55] Hill DP, Smith B, McAndrews-Hill MS, Blake JA. Gene ontology annotations: what they mean and where they come from. BMC Bioinform 2008;9(Suppl. 5):S2. [56] Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, et al. Relations in

biomedical ontologies. Genome Biol 2005;6:R46.

[57] Mungall CJ, Bada M, Berardini TZ, Deegan J, Ireland A, Harris MA, et al. Cross-product extensions of the gene ontology. J Biomed Informatics 2011;44(1): 80–6.

References

Related documents

The summary resource report prepared by North Atlantic is based on a 43-101 Compliant Resource Report prepared by M. Holter, Consulting Professional Engineer,

An analysis of the economic contribution of the software industry examined the effect of software activity on the Lebanese economy by measuring it in terms of output and value

Passed time until complete analysis result was obtained with regard to 4 separate isolation and identification methods which are discussed under this study is as

Central to the Company’s strategy is the ability to leverage the patented DepoVax™ platform across multiple business models and markets at the same time. Therefore,

This request shall be provided to the Systems Engineer or Designee (SED) 30 calendar days in advance of the anticipated completion date. Inspection groups may include

For the poorest farmers in eastern India, then, the benefits of groundwater irrigation have come through three routes: in large part, through purchased pump irrigation and, in a

· Paper Presentation, “Connecting National Ruptures: U.S., West German, and East German Cultural Diplomacy in Lebanon, 1955-1970,” Third International Conference at the American