Conversion of documents into unstructured databases

REASONING AS EXTENDED RETRIEVAL

7.3 REASONING AS QUERY-INVOKED MEMORY RE-ORGANIZATION

7.3.3 DOCUMENT STORAGE AND RETRIEVAL THROUGH RELATIONAL DATABASE OPERATIONS

7.3.3.1 Conversion of documents into unstructured databases

The model was implemented using Prolog with some restrictions, such as documents must be written following a small set of grammatical rules. The motivation of imposing these restrictions is to avoid technical difficulties while still demonstrating the usefulness of our model. The parser converts these documents into items to be stored in the knowledge base.

fd fc

Figure 7.1 The COGMIR model for storage and retrieval

There are two basic constructs in the knowledge base: objects and relationships. Relationships indicate how the objects are associated, and objects indicate what relationships are associated with them. All the data structures are represented in terms of Prolog lists. Lists can be nested, and the contents of a list are put in squared brackets. The conversion (i.e. parsing) from document to document stems (consisting of objects and relationships) is done through the system function f (which works like a parser). In the following, we will only discuss the representation issue.

Representation objects. An object can be attached to an attribute list as well as other associate lists. Each object is represented by a tuple written in list form. Each object tuple has the following format:

[L, [N], [A], [RL]],

where L is the unique location of the object in the knowledge base, N is the name of the object, A is an attribute list, and RL represents the location of

related relationships in the knowledge base (notice that an object may be associated with several relationships).

Representing relationships. We will consider binary relationships only. (If a relationship is not binary, it will be first converted to several binary relationships by the parser. However, in this book, we will not address this issue.) A relationship name is a verb or verb phase. Each relationship takes the form of a tuple with the following format:

[L, [N], [A], [Ar, Ae]],

where L is the unique location of the relationship in the knowledge base, N is the name of the relationship, A is an attribute list, Ar is the location of the first

object associated with the relationship, and Ae is the location of the second

object associated with the relationship.

Document Base Document Description List Knowledge Base Conceptual Memory

For instance, the sentence (in restricted English) “the scientist discovers capillaries” can be presented by two sublists in the object list and one tuple in the relationship list. In the object list, we have

[115, [scientist], [], [116]],

[117, [capillaries], [], [116, 118, 120]]]

The meaning of the first item is: an object, ‘scientist,’ is stored at memory location 115, does not have its own attributes, and is associated with a relationship stored at memory location 116. The second item represents an object, ‘capillaries,’ which is stored at location 117, does not have its own attributes, and is associated with relationships stored at location 118 and 120. In the relationship list, we have

[116, [discovers], [ ], [115,117]]

This item represents a relationship, ‘discover,’ and is stored at memory location 116. This relationship represents an action taken by the object stored at location 115, with an object stored as location 117 as the receiver of this action.

These lists can be viewed as relational databases with fixed fields of attributes. Therefore, the underlying structure of the knowledge base resembles relational databases. All the objects list can be considered a tuple in a relation. All the object tuples form a relation object, which has fixed fields. L, N, A, R. One difference that must be noted here is that the sublists are ordered according to the location numbers assigned to objects or relationship; while for relational databases, tuples are not ordered. But this kind of order just imposes some additional restrictions on the relations, and the standard relational operations such as select or union can be adopted with only minor revisions (as explained in the next subsection). Each sublist in the relationship list can also be considered a tuple in a relational database. All the relationship tuples form another relation, relationship, which has fixed fields, L, N, A, Ar,

Ae. Notice that unlike the relations discussed in Chapter 4, these relations are

schema-free, because they represent unstructured data [Motro 1986]. In these relations, the meaning of the key of a relation should be explained as the location of an object or a relationship in knowledge base.

Representation document stems. A document stem consists of object tuples along with some relationships between these objects. In other words, a document stem is the collection of related object tuples and relationship tuples. A document stem has the following format:

[O, R],

where O is its object list and R is its relationship list.

Therefore, a document stem is implemented as a relation. Similarly, other system components, such as the concept memory, form a relation, as does the document description list. But in our approach, instead of a tabular form, a list form is used, due to the considerations from Prolog language. By mapping the input documents into a frame-like list representation, which is much more regular than that in the original documents, the power of manipulating a regular, homogeneous structure, such as that demonstrated in a relational database, is adopted.

A document stem with O and R as its object list and relationship list, respectively, can then be expressed as (δ(O), δ(R)), or δ(O,R) for short (the superscripts are used to denote its associated object list and relationship list). We will also use the notation r(O ), r(R), r(O,R) to denote the relations that implement δ(O), δ(R), or δ(O,R), respectively.

As a comprehensive example, consider the documents (written in restricted English) in Figure 7.2(a). The correspondent object list and relationship list of their document stems are shown in Figure 7.2(b); each can be viewed as a relation, and each row in a list can be viewed as a tuple in that relation.

_______________________________________________________________

1. the arteries carry blood from the heart, the veins carry blood to the heart.

2. a bat emits sound, the sound is inaudible. an obstacle reflects the sound, the obstacle is invisible. the bat detects the obstacle.

3. a scientist discovers the capillaries. the capillaries connect the arteries. the capillaries connect the veins.

_______________________________________________________________ (a) Several documents

Object list: Relationship list:

[101, [arteries], [], [102], [[102, [carry], [], [101, 103]], [103, [blood], [], [102]], [104, [to], [], [103, 105]], [105, [heart], [], 104]], [107, [carry], [], [106, 103]], [106,[veins], [], [107]], [108, [from] , [], [103, 105]], [109, [bat], [], [110,114]], [110, [emits], [], [109, 111]], [111, [sound], [inaudible], [110,112]], [112, [reflects], [], [113, 111]], [113,[obstacle], [invisible], [112,114]], [114, [detects], [], [109, 113]], [115, [scientist], [], [116]], [116, [discovers], [], [115, 117]], [117, [capillaries], [], [118,120]], [118, [connect], [], [117, 119]], [119, [arteries], [], [118]], [120, [connect], [], [117, 121]]]. [121, [veins], [], [120]]]. (b)

7.3.3.2 Document algebra: an algebra on document stems and relations

In document Computational Intelligence For Decision Support Chen pdf (Page 185-189)