Scope - Knowledge Acquisition in a System

This dissertation will outline a theory and practice of knowledge acquisition in collabora-tive environments or systems. It will juxtapose two different styles of knowledge acquisi-tion that are each appropriate in different types of systems. One Knowledge Engineering centric approach that can be most successful in a smaller, tightly coupled environment and one Information extraction centric approach that is suitable within the scope of a large system such as the World Wide Web.

Chapter2 gives a summary of the conceptual and technical contributions of the dis-sertation. Chapter3introduces epistemological concepts, namely knowledge, justification, belief, and truth as well as describes conceptual differences in the acquisition of knowl-edge between individuals, groups and systems. Chapter 4describes the design of an on-tology in the biochemical domain. It demonstrates the need for complex representations in specialized domains and introduces domain-dependent ways to automatically increase the amount of knowledge, juxtaposing this chapter with the domain-independent knowledge acquisition in the following chapters. Chapter 5 describes efforts to automatically create taxonomies or hierarchies of domains of interest by connecting concepts in a hierarchy with semantic relations. Knowing that automatic extraction does not produce infallible results, Chapter6 shows how to verify the automatically extracted information using community efforts and human computation. Chapter7finally concludes the Dissertation.

chapter two

Overview

This dissertation takes on some issues discussed in Welty and Murdock (2006). In their discussion of the integration of Knowledge Engineering and Information Extraction the authors identify five dimensions of interoperability problems:

1. Precision 2. Recall

3. Relationships

4. Annotation vs. Entities 5. Scalability

The discussion brings up some important points. Precision in IE is to date never reliably perfect, which means that for a Knowledge Acquisition task, IE techniques by themselves are not adequate. Recall is problematic, because in most cases, a human reader would get more information from a document than any generic automated method can.

However, the assumption in this dissertation is that conceptual knowledge is not extracted from a single document, but from the entire available document collection. Extraction that

aims at getting most information out of a single document should rather be seen in the context of text summarization than knowledge acquisition. The hypothesis of this disserta-tion also states that a single mendisserta-tion of a statement is not a reliable evidence for its factual character.

Relationship (i.e. n-ary predicates with n ≥ 2) are harder to extract than type/class assignment, which are unary predicates. In free text, however, unary predicates are usually expressed in the same general ways as binary predicates. We seem to think about a type relationship as is a(entity, tyoe), rather than type(entity), for example is a(Opus, Penguin) instead of Penguin(Opus).

The issue of annotation vs. entities brings up an important point, even though I be-lieve that the term ‘annotation’ is poorly chosen and should rather say ‘entity mention’

or ‘entity reference’, because the entity mention does not present a remark to the entity, but rather refers to it. The point remains, however, that text only refers to entities, it does not “contain” them. Even disambiguation and coreference resolution by themselves do not solve this problem, because there is still no grounding of the concepts that are mentioned in the text. This dissertation therefore starts with a grounded representation of concepts and attempts to find the referring concept mentions in text. The issue of coreference resolution is sidestepped by assuming that in a large-enough corpus, facts will be expressed multi-ple times, possibly with different concept designators/labels for different occurrences. It is then the collection of factual expressions, rather than a single occurrence, that gives the algorithms confidence in the formal statements that are extracted from these expression.

The interoperability that is highlighted by Welty and Murdock (2006) brings about a question of how much of each technique should be used in an application. I will ad-dress these issues from two different directions, exemplified by two Knowledge Acquisition projects; one that is more Knowledge-Engineering oriented, one that is more Information Extraction oriented (See Figure2.1).

Figure 2.1: Classification of the work in this dissertation in terms of Knowledge Engineer-ing vs. Information Extraction

The circle of knowledge as an abstract guideline for Knowledge Acquisition assumes a current state of knowledge that is then improved by learning and validating new knowl-edge items, i.e. concepts and facts involving both the learned as well as the already known concepts. Before the validation phase, facts are merely statements that are waiting to be ver-ified. This dissertation offers two approaches to knowledge acquisition within this frame-work. The differences are in the assumptions that in some cases, background knowledge is mostly tacit and needs to be formalized by domain experts and knowledge engineers in a discourse, whereas in other cases, the background knowledge needed is available in the form of formal statements. The learning phase is in either case split into the acquisition of a Domain Definition and a Domain Description. This reflects the idea that a) the presence and the definition of concepts is usually less contentious than their properties and b) formal concept designators are more abundantly available and can more easily be automatically extracted. In a highly axiomatized system, validation can be done deductively using auto-mated reasoning techniques, whereas in systems that lack a clear axiomatic underpinning, validation requires extensive human involvement.

Depending on the application of the knowledge, it is acceptable to have a shallower representation or a more in-depth representation of the domain knowledge. Ontologies or

domain models for Information Retrieval applications do not need to be highly expressive in order to be useful. Rather, they should contain a fair number of entities that are important for a domain and ideally have alternative terms/synonyms for the entities and concepts. In applications that require reasoning over data, it is important to have a highly expressive ontology that is able to deduce or to be queried for complex relationships between concepts or entities.

The work on GlycO (Chapter 4) is an example of the latter. It mostly uses Knowl-edge Engineering techniques methodologies to create a highly expressive T-Box as well as a set of so-called “Archetypal Instances” that encode domain experts’ knowledge about complex carbohydrate structures and interactions. However, it also has an information extraction part that uses the knowledge in the Archetypal Instance to extract formal de-scriptions of complex carbohydrate structures from text or carbohydrate databases. The archetypal instances encode tacit domain knowledge in addition to explicit knowledge, i.e.

properties that a real world structure that corresponds to the instance has and that is well known, but rarely expressed by experts is made explicit in the formal description. Chapter 4will describe this interoperation of strong KE and weak IE in-depth.

The work on Doozer/Doozer++ (Chapter5) is an example of the former. Even though its mechanism to create ontologies or domain models is fully automated, it takes advantage of weak Knowledge Engineering techniques in the form of socially constructed knowledge bases, such as Wikipedia and DBPedia (Bizer et al.,2009b).

2.1 Terminology

In this work the terminology used will mostly be familiar, however, I will define some terms here to ensure a consistent understanding.

Terms denoting the work that underlies this dissertation

GlycO Complex Carbohydrate (Glycan) domain ontology.

Doozer Domain Hierarchy creation application.

Doozer++ Current evolution of Doozer, including fact extraction .

Technical terminology used throughout this dissertation

Concept Generally defined as a unit of knowledge. Here it refers to an individual, a class or a property type in an ontology or a domain model.

Term Single noun or compound noun phrase that denotes a concept

Entity A thing with a separate, self-contained existence. Here it refers an individual in an ontology or a domain model.

Class An entity type.

Category Structure on Wikipedia to organize articles into a hierarchy.

Statement An assertion of a relationship involving concepts

Fact Actual state of affairs, manifested in a validated statement

Pattern A set of things (events, objects, etc.) that occur and repeat in a predictable manner.

In this dissertation, a pattern is always a sequence of textual tokens that represent a binary relationship.

2.2 Knowledge Engineering oriented knowledge

In document Knowledge Acquisition in a System (Page 28-35)