Developing A Knowledge Based Index to Distributed, Disparate Data Sources

(1)

Developing A Knowledge Based Index to Distributed, Disparate Data Sources

Kathleen Lossau¹, Israel Mayk², Christopher York¹, Erik Eilerts¹

1Austin Info Systems (AIS), Austin, Texas

2U.S.Army Communications-Electronics Command (CECOM), Fort Monmouth, NJ

Abstract

In this paper we discuss our approach, progress, and results obtained in the course of developing a set of prototypes to extract, organize and retrieve relevant doctrinal material for C2 applications. The tool under development is called the “Knowledge-Based Discovery Tool” (KBDT*). KBDT is currently being expanded to include multiple user-defined domains instead of concentrating solely on the military doctrine domain. KBDT extracts content from documents and builds knowledge indexes based on meaning, rather than keywords. Searching the KBDT indexes provides more meaningful and accurate retrieval of information than traditional document retrieval technology. KBDT also includes visualization technology to display relevant results that can be integrated into existing applications. This paper addresses the components of KBDT that create the knowledge-based indexes, extends the domain ontology and searches the knowledge index for relevant information.

Overview

The KBDT knowledge base is built using a suite of components that improve the indexing of and subsequent search for on-line information. This suite consists of the following components:

✦ Knowledge Extractor - a user interface that parses on-line information (e.g. HTML, PDF, Microsoft Word documents, etc) and extracts content (not keywords) from actual or summarized text.

✦ Ontology - concepts / definitions that are used for content extraction. Concepts are collections of synonymous words (c.f. Felbaum, 1998), and links between Concepts drawn from a fixed set of relationships such as "has-children", "has- members", and "has-parts".

✦ Knowledge Assertions - an assertion is a collection of concepts defined by a set of related words or synonyms. As content is extracted from documents, those concepts that semantically belong together are tagged as a ’knowledge assertion’. Sets of assertions are used as keys in

* This work is supported in part by a CECOM C2SID Phase II SBIR Contract number: DAAB07-98-C-D027.

the knowledge indexes. Concepts point to the set of assertions that reference them. Each assertion points to the document that it was acquired from.

✦ Search Interface - an API that searches the knowledge base using a context based query, and returns a set of assertions and their related URL

information.

Figure 1 shows a diagram of the knowledge extraction process. A set of assertions is extracted from each document based on known concepts in the ontology. In addition, new concepts that are not yet in the ontology (potential concepts) are also extracted. These concepts can later be processed by a knowledge engineer to expand the existing ontology.

Knowledge Extraction

There are two aspects to knowledge extraction: 1) identify new concepts and extend the ontology; 2) identify and group concepts from a document to create new knowledge assertions. In both cases, concepts can be proper nouns or noun phrases (actors), verbs (action), locations, times or other defined words.

During knowledge extraction, the concepts are Figure 1. Overview for building a knowledge index

(2)

extracted from a given set of texts and evaluated against the current ontology. Noun phrases receive special handling since multiple concepts may be embedded within a noun phrase (e.g. army tank gun turret is parsed as "army tank" and "gun turret.") If acceptable combinations are not found the system will either integrate the separate nouns or flag the user to determine the final breakdown of the noun phrase.

The next step is to prepare the extracted information for insertion into the knowledge base. Any words or phrases that did not produce suitable matches are stored for later processing by a knowledge engineer.

Matching concepts are grouped semantically to form collections for sentences, paragraphs, chapters, or the overall document.

Lastly, assertions are entered into the knowledge base to connect concepts with their smallest representational collection, usually a sentence. This structure gives the search engine extra power since it provides several additional metrics for evaluating the strength of a hit. This includes the semantic use of a word or phrase, rather than just a simple keyword and the semantic distance between concepts rather then the relative distance between keywords.

Ontology

The ontology is the core of the knowledge base and contains definitions for all of the stored knowledge.

The initial ontology was derived from the WordNet database (Fellbaum 1998) and may be expanded though the process of identifying new concepts from documents as they are indexed.

The primary component of the ontology is a

"concept". In the knowledge base, a concept denotes a collection of synonyms plus a description that indicates the concept’s usage.

The following example best illustrates two concepts that are defined in the current ontology:

✦ tank, army tank - a military tank

✦ tank, storage tank - a container that holds gases or liquids.

The word tank has at least two different senses, making it impossible to specify its usage using the word “tank” alone. By combining the word “tank”

with its synonyms and a description, a common meaning can be determined. Information in the knowledge base is stored in terms of synonym collections, rather than as single words. The synonym collections are called concepts, and are organized into "actors", "actions", "location" and

from nouns so that they can be reasoned over during the search or analysis process.

Concepts greatly improve the representative power of the knowledge base by allowing information to be attached to a word’s usage, rather than just to the individual word.

Each concept contains a description and a list of connections between the concept and other concepts, as well as a value for how popular this definition of the word is so some guesses can be made about which sense of the word is probably being used. Figure 2 shows the description of the tank, army tank, armored combat vehicle concept.

Although WordNet provided an acceptable seed for the knowledge base, each domain has its own set of concepts that are usually well known noun-phrases.

The ontology is constantly being extended with newly discovered and described concepts. The extension includes both new senses for existing words and new concepts for previously unknown words. In a military domain the word 'division' is "an army unit large enough to sustain combat”. A more common definition for 'division' in a scientific (biology) domain might be: "a group of organisms forming a subdivision of a larger category." The usage of a word may evolves over time (e.g., the most common definition, and new definitions).

Figure 2. A concept

Figure 3. User is involved in building the ontology

(3)

The process of extending the ontology requires user involvement. While the KBDT system can track and identify new noun phrases and unknown words, it cannot define them or create relationships between them and existing concepts. Further, when KBDT encounters a known word used in a new sense, it cannot generate new definitions (senses) for the word automatically. Both of these functions require user intervention. The process of building up the ontology is made easier with KBDT ontology extension tools.

These tools help identify new concepts by looking for unknown word phrases that occur frequently in documents being acquired. The phrases are later analyzed by a knowledge engineer and added to the knowledge base. After the concepts are updated or word phrases are converted into new concepts, the knowledge assertions that link the concepts and the location of the document or summary area created.

Knowledge Assertions

Knowledge assertions are created by extracting concepts from documents. This process results in the addition of knowledge indexes to the knowledge base. These indexes can then be queried to produce references to the original documents. While this is the primary use of the indexes, it is also possible to use intelligent reasoning components to reason over the assertions themselves to find direct answers rather than present all possible solutions since the information is stored in knowledge base format.

A knowledge assertion is a set of concepts that are grouped semantically. The concepts are identified when the text is extracted and parsed. The most appropriate concept is automatically selected for that word given a defined context (for the document), and the concept is then referenced in the new assertion.

The assertion contains pointers to the concepts and the each concept contains an index of all the assertions that reference it. Each assertion contains a list of concepts that identify the actor, object acted upon, action, location, and time implied by a given textual fragment.

For example, if we were extracting the following paragraph from a document:

The Organization of African Unity (OAU) on Monday called for an immediate end to the fighting between Ethiopia and Eritrea, now that both sides have accepted its blueprint for peace. OAU officials were expected to meet members of the Ethiopian government Monday before heading for Asmara, where authorities on Monday said clashes were still taking place.

In our example, the knowledge assertions generated would be:

Assertion 1:

A: OAU, end, fighting, side, blueprint, peace P: call, accept

O: immediate, both T: Monday

L: Ethiopia, Eritrea Assertion 2:

A: OAU, official, member, government, authority, clash

P: be, expect, meet, head, say, take place O: Ethiopia, still

T: Monday L: Asmara Potential Concept:

A: blueprint for peace

A:- Actor, P:-Process or Action, O:-Other, T:-Time, L:-Location. The term actor, may later be divided into two categories: actor and recipient.

Notice that the ontology did not contain a concept for

’blueprint for peace’ but the knowledge extractor identified it as a potential concept. The assertion can only reference known concepts, but if the user later goes in an adds a concept for ’blueprint for peace’, then Assertion 1 is modified to reflect the new reference.

Assertions contain the pointers to the concepts, rather than storing the words themselves. Assertions also contain a pointer to either the original document, a local copy of the document, or a document summary.

Searching the Knowledge-base

The search engine uses the same knowledge extraction component to parse the user’s query into concepts. The search engine then looks to each known concept and finds the joined list of assertions.

This ensures that the best hit is the one in which the same concepts were referenced in a single semantic grouping. If no hits are found, or the user wants additional hits, then the search engine looks for a grouping of assertions that contain related concepts determined by following the original concepts’

parent/child, composition, and membership relationships. The resulting hits are ranked based on semantic distance between the assertions. This ensures that the hits that appear within the same context, such as a paragraph or topic are ranked higher.

(4)

Figure 4 shows how the search for fighting in Ethiopia results in a hit the document that produced Assertion 1, used in our previous example.

Maintaining Current Information Ensuring that the links between knowledge assertions and the documents that produced them remain valid is a critical function for any searchable index. KBDT

provides a mechanism to easily validate and update links, ensuring that the content in the knowledge base is current.

As shown in Figure 5, there are three parts to the knowledge base: the ontology, the assertions, and the documents (links, copies or summaries). A separate KBDT process is dedicated to checking the validity of the list of links. If this process determines that a link is no longer valid then all assertions that point to the document are removed and their references are deleted from the concepts in the ontology. If the process determines that a link is out of date, one of two options happen. (1) If the changes to the document are minor, only assertions that are referenced by the altered areas are removed and recreated. Otherwise, (2) all assertions are removed and the document is reprocessed. As in the initial acquisition process, a knowledge engineer may be required to revise the ontology and produce the correct assertions.

Future Work

This paper described early prototypes of the KBDT effort. These prototypes are part of a larger project devoted to knowledge discovery and intelligent Figure 4. Search for assertions with defined concepts

(5)

that concepts provide a useful foundation for discovering relevant information from a wide variety of sources. These early successes encourage future project development in this area.

References

Fellbaum, C. ed. 1998 WordNet: An Electronic Lexical Database. Cambridge, Mass.: MIT Press.

Mayk, et.al. 1998 A Knowledge Based Doctrine Tool for Command and Control, Proceedings of the Command and Control Research and Technology Symposium.