• No results found

Document algebra: an algebra on document stems and relations Document storage and retrieval in such a system is done through the

REASONING AS EXTENDED RETRIEVAL

7.3 REASONING AS QUERY-INVOKED MEMORY RE-ORGANIZATION

7.3.3 DOCUMENT STORAGE AND RETRIEVAL THROUGH RELATIONAL DATABASE OPERATIONS

7.3.3.2 Document algebra: an algebra on document stems and relations Document storage and retrieval in such a system is done through the

various components of the system. In the following, we define document algebra to handle documents that are operated as relations. Essentially, document algebra is an extension of relational algebra discussed in Chapter 4. But first, we should notice that there is a need to distinguish operations at two levels: the higher level of document stems, and the lower level of relations.

Operations on document stems. We start with the following remark. Since the relations used in the COGMIR model are schema-free, some operations originally defined on conventional relations become meaningless. Unlike conventional relational databases, we have only two relations to handle: the object relation and the relationship relation. Therefore, the join operation as defined in conventional relational database now does not make sense. In addition, since both the object relation and the relationship relation have fixed attributes, project operation as defined on conventional relational database will not be of interest. Based on these considerations, we will not include join and project as operations in our implementation. On the other hand, the concept of document stem is new, and no previous definitions have been given on the document stems. Therefore, we will borrow the names project and join, to redefine these two operations on the document stems (rather than relations).

Project. A document (or its correspondent document stem) relevant to a query does not necessarily imply the entire document is relevant to that query. The operation πq(δ), that is, project operation over a query q for a document

stem δ, is to exclude, from that document stem, those object tuples and relationship tuples that are not relevant to q.

Join. The join operation connects two document stems if they share an object with the same name. The join of two document stems δi and δj(here i, j

are two document identifiers and i < j) is denoted as δi δj. This operation is

very important in our model, because separately acquired information stored in different document stems can be connected. A note to be made here is the role of the document identifier in the join operation of document stems; they are used as time stamps of the corresponding documents, because they are assigned according to the order in which they are acquired by the system. Although better methods may be desirable, in the current implementation, we have assumed documents acquired earlier record things which happened earlier. In the current implementation, δi δj is not defined if i > j.

Operations on relations. Operations defined on document stems are conceptual operations; the purpose of defining these operations is to provide a convenient way to envision the system behavior. These operations are actually performed through operations defined on relational databases, because the object list and relational list resemble relations. We will use the following operations on schema-free relations.

Select. The select operation ρq (R) selects tuples (rows) relevant to query q

from the schema-free relation R. This operation is much like the standard select operation; the only exception is to keep the selected tuples in their original order. For instance, we can select one or more rows (i.e. tuples) from any of the two lists in Figure 7.2(b).

Union. The union of two schema-free relations R1 and R2 is denoted as R1

R2. This operation is much like the standard union operation on relations, except for the requirement of keeping the original order to tuples in two relations. Therefore, in general, R1 ∪ R2 cannot be replaced by R2 ∪ R1. For instance, we can perform the union operation on two or more rows (tuples) selected from the above operation.

The major step of performing retrieval from several documents can be expressed as below. As already stated, we will use the notation δ (O, R) to denote a document stem with object list O and relationship list R. In the following, we will also use the notation δ q(O, R) to denote a document stem

with object list O and relationship list R in answering query q. if qs and qt are

two subqueries of the original query q, qs∪ qt = q (where ∪ denotes set

union), if we define the answer for the union of two queries as the union of the answers of these two queries, namely,

δ(qsqt) (O) = δqs(O) δqt (O) then we have δq (O) = δ (qsqt) (O) = δqs(O) δqt (O) = πqs (δ (O) ) πqt (δ (O) )

(due to the definition of projection on document stems and the meaning of query), and

= σqs (r(O)) ∪σqt (r(O))

(due to the definitions of selection and union on unstructured relations). Similarly, we have

δq(R) = δqs (R) δqt(R)

= πqs (δ (R)) πqt (δ (R))

= σqs (r(R)) ∪σqt (r(R)).

The above two formulas can be combined to

δq(O,R) = δqs (O,R) δqt(O,R).

Note that the join ( ) of document stems δqs and δqt is realized through the

union of tuples in schema-free relational databases. The union operation will be performed only when δ q s (O) and δ qt (O) share some common object

name. In our implementation, an auxiliary list of object names is constructed for each involved document stem (δqs (O) and δ qt (O)); the intersection of

these two lists is then checked to determine whether any object name is shared.

In summary, when dealing with a retrieval, first, relevant document stems are identified through L and C. The remaining steps are carried out by using the following four formulae:

δqs(O) =

σ

qs (r(O)), (1) δqs(R) =

σ

qs (r(R)), (2) δq(st) (O) = δqs(O) δqt (O)qs (r(O)) ∪σqt (r(O)), (3) δq(st) (R) = δqs(R) δqt (R) = σqs (r(R)) ∪σqt (r(R)). (4)

The meaning of these formulae can be shown in the following example. Consider the knowledge base depicted in Figure 7.2(b), and suppose the objects specified in a query are a set q = {capillaries, heart}. Figure 7.3(a) is basically a duplication of Figure 7.2(b), with rows (tuples) from different documents separated by blank lines. (Identifying tuples from different documents is handled by the document description list, but the details will not be addressed in this article.) Document description list L and conceptual memory C determine that only document stem δ1 and δ3 are relevant to the query q. Figure 7.3(b) depicts relevant document stems (which include only tho se tup les wh ich are r elevant to th e cu rr ent q u er y q ) . Th e r es ult o f per fo rm ing a select operation on relations is shown in Figure 7.3(c), where tuples are in the relevant documents but those not directly related to the current query (namely, tuples that correspond to “a scientist discovers the capillaries”) are excluded. This is done by applying formulae (1) and (2). Starting from Figure 7.3(c), we now apply formulae (3) and (4), where qs = {heart} while qt =

{capillaries}. The object names involved in two document stems are {arteries, blood, heart, veins} and {capillaries, arteries, veins}, respectively. Since these two document stems share object names {arteries, veins}, the union of tuples can be performed, resulting in Figure 7.3(d). From this resulting document stem, a fact in restricted English can be reconstructed. (Here the term fact is used in the same sense as the term “fact retrieval” has appeared in IR literature, which refers to a part of the contents.)

Notice that the two occurrences (with two different memory location number, 106 and 121) of object ‘vein’ as it appears in two documents are treated as one thing, and so are the two occurrences of the entity ‘arteries’ (which two location numbers, 101 and 119). Sharing object names is a necessary condition for performing the join operation. As a result, the following fact can be constructed to answer the query {heart, capillaries}:

arteries carry blood to heart. vein carry blood to heart. capillaries connect arteries. capillaries connect veins.

Note that, although this short paragraph looks like a document, it is not. Instead, it is generated from the document stems that contain contents relevant to the query.

Object list: Relationship list: [101, [arteries], [], [102], [[102, [carry], [], [101, 103]], [103, [blood], [], [102]], [104, [to], [], [103, 105]], [105, [heart], [], 104]], [107, [carry], [], [106, 103]], [106,[veins], [], [107]], [108, [from] , [], [103, 105]], [109, [bat], [], [110,114]], [110, [emits], [], [109, 111]], [111, [sound], [inaudible], [110,112]], [112, [reflects], [], [113, 111]], [113,[obstacle], [invisible], [112,114]], [114, [detects], [], [109, 113]], [115, [scientist], [], [116]], [116, [discovers], [], [115, 117]], [117, [capillaries], [], [118,120]], [118, [connect], [], [117, 119]], [119, [arteries], [], [118]], [120, [connect], [], [117, 121]]]. [121, [veins], [], [120]]].

Figure 7.3 (a) Document stems (knowledge base containing δδδδ1, δδδδ2, δδδδ3)

Object list: Relationship list:

[101, [arteries], [], [102], [[102, [carry], [], [101, 103]], [103, [blood], [], [102]], [104, [to], [], [103, 105]], [105, [heart], [], 104]], [107, [carry], [], [106, 103]], [106,[veins], [], [107]], [108, [from] , [], [103, 105]], [115, [scientist], [], [116]], [116, [discovers], [], [115, 117]], [117, [capillaries], [], [118,120]], [118, [connect], [], [117, 119]], [119, [arteries], [], [118]], [120, [connect], [], [117, 121]]]. [121, [veins], [], [120]]].

Object list: Relationship list: [101, [arteries], [], [102], [[102, [carry], [], [101, 103]], [103, [blood], [], [102]], [104, [to], [], [103, 105]], [105, [heart], [], 104]], [107, [carry], [], [106, 103]], [106,[veins], [], [107]], [108, [from] , [], [103, 105]], [117, [capillaries], [], [118,120]], [118, [connect], [], [117, 119]], [119, [arteries], [], [118]], [120, [connect], [], [117, 121]]]. [121, [veins], [], [120]]].

Figure 7.3 (c) Document stems after projection through select operation on relations [using formulae (1) and (2)].

Object list: Relationship list:

[101, [arteries], [], [102], [[102, [carry], [], [101, 103]], [103, [blood], [], [102]], [104, [to], [], [103, 105]], [105, [heart], [], 104]], [107, [carry], [], [106, 103]], [106,[veins], [], [107]], [108, [from] , [], [103, 105]], [117, [capillaries], [], [118,120]], [118, [connect], [], [117, 119]], [119, [arteries], [], [118]], [120, [connect], [], [117, 121]]]. [121, [veins], [], [120]]].

Figure 7.3 (d) Join of document stems through union operation on relations [using formulae (3) and (4)]

From the above simple example, we have seen that the operations are not trivial. In order to see why these nontrivial operations are needed, let us summarize some features of the overall system by associating system components (other than the knowledge base) to the retrieval process. First , let us recall that conceptual memory identifies relevant document stem. The boundary consists of only part of object names; starting from the boundary, the interior of the document stem can be examined by processing a portion of the object list and relationship list.

Implementing knowledge base as an unstructured database with necessary operations defined on it has some significant merit over the use of a plain “sentence base” consisting of all sentences acquired from documents. This is partly because, in our model, the conceptual memory will identify document stems which are relevant to the current query; only a portion of the knowledge

base (instead of the entire “sentence base”) will be searched. This will represent a significant saving when the number of document stems becomes large. Storage using object tuples and relationship tuples actually implements a net-like structure.

This net-like structure clearly indicates which relationships are related to an object. Determining the connection between different document stems through objects is thus much easier than directly checking the sentences one by one, particularly when the number of sentences that need be checked becomes large. In addition, due to the net-like structure (implemented as tuples), our system is able to generate new documents through structure mapping (using information concurrently available in the knowledge base), thus realizing a kind of analogical reasoning. Document generation through structure mapping might be a more difficult problem if a “sentence base” (rather than the knowledge base consisting of object and relationship tuples) is maintained.

There are also some limitations of this implementation, such as documents being acquired in the order of the events (as described by these documents) that occur; documents with larger identifiers are acquired later, thus containing more updated information; the event which occurred or information contained in a document with a larger identifier may update those in a document with a smaller identifier, but may not be consistent (namely, contradictory) with them.

These assumptions have caused some limitations on our experimental system. Some limitations related to the management of the unstructured databases used in the experimental system are listed below.

1. Order requirement of join. As stated earlier, two document stems

δi and δj can only be joined to form a resulting document stem

δi δj if they share at least one object name and i < j. In this

case, δj δi is not defined, and consequently, the relationship

δi δj = δj δi is not true. The rationale of this requirement is

due to assumption (i); consequently, in the result after join, knowledge contained from the document acquired later always appears later, even though it may be concerned with some earlier event.

2. Simplified treatment for partially redundant documents. Due to the above assumptions (i), (ii), and (iii), if two document stems contain redundant information, the information contained in the document with the larger identifier will always be used.

3. The need for dealing with inconsistency. Current implementation simply assumes that inconsistency does not exist. Therefore, even if a document contains information which is contradictory to an existing one, the implemented system cannot detect it. All these limitations can be removed or reduced to a lesser degree, although the tasks may not be trivial. For instance, in order to remove limitation 3, we may add an independent component which employs an advanced computational technique (such as the approach described in [Baral, Kraus and

Minker 1991]) so that only consistent information will be included in the final result.