2.3 Information Extraction oriented knowledge acquisition
2.3.2 Automatic Domain Model Creation
The conceptual separation of Domain Definition and Domain Description is reflected in the use of different techniques for both tasks. If the domain is defined by extracting known concepts from a corpus that already contains clearly delineated concepts with unique
identi-fiers, ambiguities can be avoided and the description step can work on the basis of concepts, rather than just terms. Conversely, in many NLP-based approaches, ontology learning is based on promoting phrases to predicates. In these cases, no clear denotation and no clear designator are given, because only concept mentions are extracted. Thus the statements in the ontology often fail to refer to actual entities, events or states of affair. If we have information sources that give us an idea of the identity of concepts and specific types of relationships, we know that the extracted concepts and relations do actually refer.
To put the idea of combining top-down and bottom-up approaches in context, I will show a main difference to previous approaches.Buitelaar et al.(2005) present an ontology learning layer cake with terms on the bottom and rules on top (Figure2.2). The layer cake suggests that the learner goes from terms to synonyms to concepts and concept hierarchies, before adding relationships and rules. This approach can and should be taken when the only information available is in the form of raw text. However, the steps from terms and syn-onyms to concepts are error-prone when done automatically. When conceptual knowledge is available in the form of taxonomies, encyclopedias or thesauri, this knowledge should be harvested. Humans are much better at intuitively identifying concepts than machines are and this capability is reflected in these knowledge sources. For this reason, this approach starts with concepts rather than terms. Instead of extracting previously unknown concepts from text, concepts are assumed to be available and their existence needs to merely be ver-ified in the text corpus. Once evidence is found in the text corpus for the existence of the concept further knowledge about the concepts can be gained through descriptions in the form of relationships (see figure2.3).
To summarize the conceptual considerations, this work is built on the following premises:
1. Humans succeed at identifying, defining and describing concepts
2. Ontologies represent a human conceptualization and abstraction of the world
Figure 2.2: Traditional Ontology Learning Layer Cake
Figure 2.3: Doozer++ Ontology Learning Layer Cake
3. Unambiguous (semi-)formal concept designators and identifiers are available in greater abundance than concept descriptors and relations.
(a) Encyclopedias, glossaries and vocabularies provide concept designators
(b) Community-created or peer-reviewed Encyclopedias, glossaries and vocabular-ies express a shared view of a domain
⇒ Extract domain definition top-down from such corpora
4. Concept descriptions such as attributes and relationships are plentiful in informal text.
5. Concept descriptions are manifested in multiple documents.
6. An aggregation of multiple statements about a concept yield a more accurate descrip-tion
(a) A macro-reading-based (Mitchell et al.,2009) Information Extraction approach aggregates distributed information
(b) Pattern-based IE inherently conforms to an aggregative process
⇒ Extract/improve domain description bottom-up from free text
Domain Definition A focused set of concepts that define the scope of a domain provides a grounding and a contextualization of the knowledge acquisition task. Domain Definition is accomplished by restricting existing structured or semi-structured sources to only con-tain concepts pertinent to a focus domain. In this work, Wikipedia is used as a knowledge source. under the assumption that most concepts and entities that most users are inter-ested in are represented by articles in Wikipedia. Domain Definition can be compared to engineering a T-Box of an ontology, which specifies the types of concepts and relation-ships/attributes that can exist in a domain. However, a T-Box does usually not contain definitions of entities, whereas the Domain Definition incorporates entities that are of in-terest in the domain.
Domain Description According to the procedural definition of knowledge, the step from information to knowledge is from the “know-what” to the “know-how”. With a domain def-inition, it is already known which concepts are of interest. In this next step, the “know-how”
is acquired by finding facts involving the domain concepts in order to have a description of the concept interactions and dependencies. Facts put the mere definitions of concepts into perspective by relating them to other concepts or endowing them with attributes. By defini-tion, facts are verified statements that refer to actual states of affairs. Hence it is important to have a measure of confidence as well as rigorous testing of extracted statements in order to assure correctness.
Evaluation and Validation in Use The idea of extrinsic evaluation or evaluating an al-gorithm in use, i.e. analyzing the user’s interaction with the software, has been of interest mostly in the design community. Bannon (1996) realized early that evaluation should be an integral part of an application. His work is concerned with design applications, but the principle applies to IR, IE or Knowledge Acquisition as well.
My goal is to integrate the validation seamlessly into the application of the knowledge.
In IR research, great strides have been made, notably byAgichtein et al.(2006a) andBian et al.(2008), to tune Web search ranking to selection preferences. However, finding more relevant results closer to the top of search results is mostly an issue of convenience rather than of absolute correctness. For Knowledge Acquisition, the stakes are higher in terms of the required accuracy of facts that a user interaction should provide. To be absolutely certain about the correctness of a fact, it has to be given directly to a panel of experts that approve or disapprove. Given the amount of assertions that can be extracted using automated methods, though, this is not feasible for all facts. User interaction with extracted statements can point to those that are most likely correct. In Chapter6 I will discuss the methodology behind the envisioned ”Validation in Use”.
chapter three
Epistemological Foundations
We never are definitely right, we can only be sure we are wrong.
(RichardFeynman(1967), The Character of Physical Law, p. 152.)
3.1 Introduction
This chapter outlines an epistemological background to knowledge representation, acqui-sition and propagation as it pertains to this dissertation. It will discuss different definitions of the concept Knowledge and will justify the hermeneutic approach to knowledge acquisi-tion that the dissertaacquisi-tion is built upon. Further, it will discuss the change in the concept of knowledge when it is seen in light of a system rather than an individual, i.e. the notions of subjective and objective knowledge.
In the introduction I gave a functional definition of knowledge in the context of the DIKW hierarchy (Rowley,2007;Ackoff,1989). This definition was as follows:
Knowledge is know-how, and is what makes possible the transformation of information into instructions. Knowledge can be obtained either by
trans-mission from another who has it, by instruction, or by extracting it from
expe-rience. (Rowley(2007))
Other functional definitions from the more recent literature go in the same direction. Bird (2010), for example, sees knowledge as an input to deliberation and action. In these func-tional views, knowledge usually guides and enables action (Hawthorne and Stanley,2008).
A functional view of knowledge becomes important in the application of the knowl-edge models that are the focus of this work. However, this view also fits the term actionable informationand may thus encompass pieces of information that can not formally be called knowledge according to more epistemologically rigid definitions. For example, the above definition does not require a statement to be correct in order to qualify as knowledge. In a normative sense, actions and instructions should be based on correct information that fits a narrower definition of knowledge. In the following section I will give a categorical definition that identifies knowledge as “Justified true belief” and thus implies correctness.