Practical Considerations - Knowledge Acquisition in a system

3.5 Knowledge Acquisition in a system

3.5.1 Practical Considerations

Even though most of this dissertation will describe the theory and practice behind the im-plementation of a knowledge-acquisition system, it was important to lay out an episte-mological background for knowledge acquisition within a system. On the one hand it is necessary to be aware of the fact that information is not the same as knowledge. But more importantly, it makes a clear case for the necessity of having constant verification of the statements that a knowledge-based system operates upon.

This chapter thus laid a background for a methodology that makes it possible to call the acquired statements knowledge, given that the requirements of belief, justification and

truth are met. However, these requirements need to be attainable within the system. Hence, I make the following assumptions for a working system:

• Belief - The system holds knowledge in the form of formal statements.

• Justification - Validation measures are accomplished in a manner that allows justifi-cation. Thus, many people have to agree on a statement for it to be considered in the first place. Further, the algorithms used for information extraction need have a high degree of precision. This provides a high degree of confidence in the correctness of extracted statements.

• Truth - It is assumed that the information sources used in this work contain mostly correct information. It thus needs to be shown that the aggregation mechanisms that are used are truth-promoting. There is also a - somewhat overoptimistic - notion that one of the first principles of human nature is “a propensity to speak the truth”

(Reid, 1764), which would indicate that humans tend to speak the truth more often than not. However, even if this notion is only statistically correct, an aggregation of statements will likely yield a correct outcome. This idea is also underlying the “Wis-dom of the crowds” (Surowiecki,2005) paradigm, which gives good indications that a statement that has been asserted by many independent agents and/or been validated by independent agents is likely to be correct.

chapter four

Knowledge Engineering - Based Domain Model Creation

The world is the totality of facts, not of things. (Wittgenstein(1922))

This chapter describes a knowledge engineering approach to knowledge acquisition in a tightly coupled system. It provides a contrasting view to the loosely coupled knowledge acquisition system that is presented in chapter5. Here I demonstrate how a domain defini-tion and part of the domain descripdefini-tion can be created manually based on expert agreements about the knowledge in a domain. However, an automated domain description algorithm is also used to add structural and factual knowledge.

This chapter is meant to contrast knowledge acquisition for explicit knowledge with an attempt to encode a combination of tacit and explicit knowledge that constitute a deep understanding of the domain at hand. Many of the triples that encode knowledge in the ontology that is described here could not be extracted using the general-purpose methods described in chapters5.3and5.4.

By the time the research that underlies the chapter was conducted, the Web Ontology

Language OWL was still in its first iteration. The next iteration, OWL2, introduced some new features, such as punning, that could potentially change some of the formalizations in the ontology.

4.1 Introduction

The field of BioInformatics has seen a dramatic increase of available ontologies for many of the life sciences domains. The Ontologies in the OBO project¹ , especially the Gene Ontology (GO)(Ashburner et al., 2000) with its comprehensive schema and thousands of instances, take leading roles. As a broad lexicon or dictionary, GO serves one of the major purposes of ontologies: facilitating agreement. However, it is not designed for extensive computational use, so the amount of inference that can be done with the knowledge is limited. Only two types of relationships between the entities in the ontology are formalized:

is aand part of.

An ontology that provides rich, machine accessible relationships must be rigidly for-malized. Knowledge modeling languages such as KIF (Genesereth et al., 1992), RDF (Klyne et al., 2004) or the W3C-recommended Ontology Web Language OWL (Horrocks et al., 2003) allow such formalizations with different expressiveness. OWL promises to be a good compromise between expressiveness and computational complexity on the one hand and versatility and simplicity on the other.

This chapter focuses on issues related to representation, expressiveness, granularity and instance population in the development of the Glycan Structure Ontology GlycO. It is one of the ontologies designed as part of a suite of web-accessible ontologies for the glycoproteomics domain alongside the Enzyme function ontology EnzyO and the prove-nance ontology Propreo (Sahoo et al., 2006). The goal of this suite is to have a basis for description, annotation and reasoning, such that every step from experimental setup

1OBO: Open Biomedical Ontologies

over experimental conduct and analysis to acquisition of hypotheses and theories can be formalized. This work was conducted in the context of the “BioInformatics for Glycan Expression” core of the NCRR Integrated Resource for Biomedical Glycomics project at the Complex Carbohydrate Research Center (CCRC) of the University of Georgia.

Glycans are complex carbohydrate structures, which play key roles in the development and maintenance of living cells. Glycans are built from simpler monosaccharide residues (such as mannose and glucose), which constitute the nodes of tree structures with edges that are comprised of chemical bonds between the residues. The synthesis of these glycans in organisms is an intricate process that can be modeled as a collection of biosynthetic pathways. At each step in such a pathway, an enzyme-catalyzed reaction “adds” a new residue as a leaf to an existing structure or “moves” a whole sub-tree to a different parent.

It is well established that alongside genes and proteins, glycans play a major role in cell functions.

The aim of glycoproteomics is to understand cellular processes that are mediated by the interaction of proteins, the genes that encode them, and the glycans that are attached to them. The goal in developing GlycO has been to assess the extent to which knowledge in this domain can be logically formalized to facilitate the discovery and specification of rela-tionships between the glycan structures, their metabolism, and their functions. Among the challenges faced were those of a limited expressiveness of the chosen OWL-DL standard, and mereological issues of granularity.

The main contributions of this work include:

• Creating a more meaningful domain model by

– Building a Domain Definition that captures the richness of the domain using expressive language, esp. restrictions

– Supporting modeling of molecular structures that are important for domain sci-entists

– Rigorous modeling with contextual archetypal instances used as building blocks

• Creating a Domain Description for the ontology by extracting and disambiguating instance information from multiple heterogeneous sources

• Allowing for more meaningful queries by formalizing knowledge that is usually in-ferred in database models

• Addressing granularity issues

Following this introduction, section 4.2 will describe the conceptualization and for-malization of the glycoproteomics domain in GlycO. section 4.3 will detail the sources and algorithms used for the automatic population; section4.4will evaluate GlycO and dis-cuss the impact it can have on biochemical applications. Section4.5 finally concludes the chapter.

In document Knowledge Acquisition in a System (Page 72-77)