• No results found

Ontology-based tools in the life sciences

The biological knowledge is currently developing very fast. Ontologies are being developed to describe this knowledge. The massive amounts of information nowadays available to scientists makes it difficult to stay up to date. Various ontology-based tools have been developed to support biologists in that task. Ontologies as described in section 2.2.3 can be used in a number of ways. Rubin et al. [233] discuss the functionalities such as searching in biomedical data, exchange of data between programs, integration of information, knowledge representation, computer reasoning and natural language processing or textmining. Another new functionality of ontologies is discussed in the chapter 4, structured searching in domain knowledge. This section discusses examples of current state of the art ontology-based tools.

2.5.1

Aligning and Merging Ontologies

Many bio-ontologies have already been developed and many of these ontologies contain overlapping information. The user wants to be able to use multiple ontologies. For instance, companies may want to use community standard ontologies and use them together with their own ontologies. Ontology developers may want to use existing ontologies as the basis for new ontologies. Tools such as SAMBO [152] are used to align ontologies by defining relations between terms in different ontologies. The aligned ontologies can be merged into a new ontology.

An interesting approach for ontology alignment in the context of this work was proposed in Lambrix et al. [155]. The authors suggest to use a literature corpus classified with ontology concepts to use for the alignment of two ontologies. The idea is the a similarity measure between concepts in different ontologies can be defined based on relationships between the documents in which they are used.

2.5.2

Search in biomedical data

Some ontologies were primarily created to annotate biomedical data. The Gene Ontology [98] is used for functional annotation of known genes. The AmiGO browser [13] allows to search the GO database for genes and gene products annotated with GO terminology. The GO is also used in numerous web-based [71, 171] and standalone tools [291, 293] analysing microarray experimental results were a list of differentially expressed genes is given and a functional description of the given genes is the output. Khatri and Dr˘aghici [141] compare 14 tools and discusses current limitations.

The NCI Thesaurus [114] indexes cancer related research data with categories such as findings, drugs, therapies, anatomy, genes, pathways, cellular and subcellular processes, proteins, and experimental organisms.

BioPrompt [56] is an ontology-based clustering tool for searching in biological databases. BioPrompt defines documents as a biological sequences plus the associated meta-data. Sev- eral ontology-based hierarchical clustering strategies offer different views on large set of database entries.

2.5.3

Data exchange

The MGED Ontology is used to describe microarray experiments. MAGE-ML is an XML grammar used for the interchange of microarray experimental data between researchers and programs [18]. BioPAX is a collaborative effort to create a data exchange format for biological pathway data. Pathway databases such as KEGG [138], BioCyc [139] and

Reactome [135] export their data in BioPAX format. Pathways in BioPAX format can be visualized with tools such as PATIKA [63], a web-based integrated environment dealing with pathways, and Cytoscape [244], a visualization software for graphs and networks.

2.5.4

Information integration

The TAMBIS ontology [259] can be used to formulate high level queries over biological entities stored in different databases. The system translates the query to source model queries which were mapped to the concept model and executes the queries on the selected databases. Wrapper for each datasource were developed manually. Information Integration using ontologies in biology and medicine is a wide field. P´erez-Rey et al. [201] discuss methods of information integration. Approaches pertaining ontologies are single conceptual schemes, were a global conceptualization exists, which covers all information. This is the approach of TAMBIS. Any change in the system might require the change of the global conceptualization. In contrast to this multiple conceptual schemes describe the semantics of different databases separately. Here the difficulty is moved to the mapping between the schemes. It is not trivial to map concepts with similar or equal meaning. OBSERVER [175] is a system which maps concepts of different domain ontologies. Hybrid approaches develop independent domain ontologies based on a common ontology defining base concepts. Ontologies of the OBO Foundry [253] are based on the common Relation Ontology [251].

2.5.5

Knowledge representation

The Foundational Model of Anatomy ontology (FMA) is a representation of the phenotypic structure of the human body [231]. It comprises 75,000 classes. In contrast to the Gene On- tology which instantiates only two types of relations the FMA defines 168 types of relations with 2.1 million instances. Mainly macroscopic anatomical structures such as cells, tissues and organs are represented. FMA is developed using Prot´eg´e and can be viewed on the web with the Foundational Model Explorer. Zhang and Bodenreider [292] used the FMA as one of two large ontologies for the development of lexical methods to map large ontologies to one another, taking into account their semantic structure as well as their terms.

The Edinburgh Mouse Atlas Project (EMAP) [38] is another effort of an anatomy on- tology. It adds to the symbolic representation, the structured collection of terms each cor- responding to a particular anatomical concept, an iconic representation of mouse embryos. A 3D voxel representation of 26 theiler stages of mouse developmental stages were created as well as a mapping between the textual anatomical classes and the volumes in the 3D representation. Transition relations between structures of subsequent stages allow for fol- lowing the development of organs and tissues. The modeling of knowledge is a prerequisite for reasoning over it.

2.5.6

Computer reasoning

Computer reasoning uses methods to infer new facts from knowledge stored in ontologies and asserted facts. In Rubin et al. [232] the authors showed that FMA can be useful as a reference knowledge source to predict the anatomic consequences of penetrating injury. Hybrow [215] is a system to test consistency of hypotheses with observed data and prior knowledge by applying constraints and rules.

2.5.7

Textmining and Natural Language Processing

Ontology-based textmining provides the methodological basis of the work described in this thesis. Some other tools in the life sciences employ ontologies for the processing of natural language. Textpresso, introduced in section 2.4.2, is an ontology-based search engine built of scientific literature on C. elegans and selected others domains. Textpresso maintains a list of regular expressions to identify concepts of its ontology. Textpresso maintains a flat list of 101 concepts. Each concept in the ontology has its own identification algorithm.

Whatizit [Kirsch et al.] is webservice which processes any free text or list of PubMed abstracts. The user can select between textmining pipelines. The most relevant pipeline pertaining ontology-based textmining tools is the “whatizitGo” pipeline. Any given text to this pipeline is marked up with Gene Ontology cross-references. The XML output is translated into HTML containing hyperlinks to a copy of the original databases. Whatizit provides four categories of modules: (1) basic NLP modules for syntactical information, e.g. sentence splitting, part-of-speech tagging, (2) modules matching controlled vocabularies, e.g. protein names from Uniprot and GO terms (3) syntax pattern matching modules and, e.g. identification of abbreviations, definitions, mutations, (4) modules for shallow parsing based on cascaded patterns. Modules identify entities using a finite state automate build from a large set of regular expressions. The protein name “col1a1” was added to the list of accepted expressions by “(COL1A1—[cC]ol1a1)”. The patterns “the X protein”, “the protein X”, “T domain of NP” and “NP is a protein” are used to detect explicit mentions of proteins. Mutations are detected in the text by the pattern “AA [0-9]+ AA” where AA denotes all variants of an amino acid or nucleic acid. Protein-protein interactions are detected by finding two noun phrases connected by a verb phrase containing one of 21 predefined verbs. At least one of the noun phrases must name a protein or gene name. The modules process an XML stream in a UNIX pipe manner. Each module processes the appropriate elements and passes the rest unchanged. Elements may not overlap and may not recursively contain elements of the same type. The Java components implement the Runnable interface and must read an InputStream and write an OutputStream.