Structure as part of Aboutness - Situation Theory Aboutness for XML Retrieval

3.3 Situation Theory Aboutness for XML Retrieval

3.3.1 Structure as part of Aboutness

In a Situation Theory formal representation, we capture topical information in documents with situations as aggregations of infons. In general, we need to define how the information is aggregated for each retrieval approach. We show how this can be done in our analysis of actual retrieval models in Chapter 5.

For XML retrieval, documents are aggregations of document components differentiated according to XML element types. Document components do not simply describe smaller documents, but are structured information units and replace documents as the targeted information carrier in XML retrieval. Document components can be small, but what really distinguishes them from documents is that they add structure to the aboutness decision and therefore increase its reasoning complexity. Structures are new properties of documents and add information to the aboutness decision.

In order to specify the nature of this aboutness reasoning using the interaction of structure and content, we now discuss three different structural document paradigms: passage retrieval, hypertext retrieval and XML retrieval. We suggest to embed their reasoning in the structured document retrieval paradigms of an IR model developed in [Chiaramella,

2001]. As seen in Section 3.1.1.1, ‘embedding’ describes a theoretical evaluation approach in IR [Huibers,1996] that formalises a model in order to describe other models.

In order to further analyse the influence of structure in passage, hypermedia and XML retrieval, [Chiaramella, 2001] has presented an algorithm to represent particularly XML retrieval by dividing it into a ‘fetch’ and ‘browse’ phase. In the fetch phase, a pre-selection of document components takes places, which is narrowed down in the browse phase to retrieve the best document component regarding structural constraints. We would like to extend this paradigm to become a generic mechanism to describe the retrieval of document components. To this end, we divide the reasoning process for structured document retrieval into two analytical phases. In the first phase aboutness is decided, while in the second phase aboutness is specified with the help of structural hints.

structured document retrieval approaches. In the fetch phase, we would like to evaluate a pre-selection mostly based on the general relevance of document components, while in the browse phase we consider the structure to better define aboutness in the retrieval process. In the next section, we use the fetch and browse paradigm to describe different structured document retrieval approaches. For each of the three approaches we are able to specify the general aboutness relation. We can explain differences in the aboutness be- haviour as differences in how structure is considered in the browse phase — whether it is not considered at all as in passage retrieval, whether it is considered as an independent constant as in hypermedia retrieval or whether it is seen as an integral part of the content in a document as in XML retrieval.

Passage retrieval, hypertext retrieval and XML retrieval are all examples of develop- ments in IR [Chiaramella,2001] that assume that structure can be used to further describe the topicality of a document and therefore improve the determination of aboutness. Here, they are taken as paradigmatic examples of structured document retrieval and analysed in two steps. Firstly, they are mapped on to the model of fetch and browse. Chiaramella’s model is used to clearly distinguish structure and content aspects of the retrieval process. Secondly, the aboutness relation of the retrieval paradigm is related to the one of flat document retrieval: If D describes the document and Q the query, then D Q describes how D answers the query Q. Table 3.1summarises the results of our findings. We define more formally in Section4.3.

3.3.1.1 Passage Retrieval

Passage retrieval [Mittendorf and Sch¨auble,1994] is one of the earlier approaches to structured document retrieval. It is based on the assumption that a more focussed discussion of information can be found in the passages of a document rather than the complete document. The targeted document components are passages and the document is seen as a sequence of passages. Passages only contain textual data and form a linear structure to represent aspects of the document. Passages can be of fixed or variable length. The indexing process creates the passage document component and either uses the existing document structure or a fixed number of words for each passage [Mittendorf and Sch¨auble,

1994].

Most importantly, in passage retrieval passages are not regarded as being topically interlinked. Each passage forms a distinct discourse, each document component is independent. Thus, in passage retrieval structure is only used during indexing and not for retrieval. If we consider the fetch and browse paradigm, for passage retrieval in the fetch phase passages Di are retrieved and no browsing or focusing of the results takes

place. Therefore, passage retrieval is expressed by the aboutness relation: Di Q with

D ≡ D1 ⊗ ... ⊗ Dn, where ⊗ stands for the composition of document components. The

problem with passage retrieval is obviously that structure is not considered in each part of the retrieval process, but only during the indexing. Moreover, passage indexing does not necessarily try to reflect the discourse of a document.

Table 3.1: Structured document retrieval paradigms

Structured Document Retrieval Paradigm Nature of aboutness relation

Passage retrieval Di Q with D ≡ D1⊗ ... ⊗ Dn

Hypermedia retrieval D pr (D c Q)

XML retrieval R(D, Q) = F (Q (D Q)) 3.3.1.2 Hypermedia Retrieval

Hypermedia retrieval is our second structured document retrieval paradigm. So-called links and hyperdocuments form together a space of document components that are clus- tered via internal and external hyperlinks [Huibers, 1996]. Hypermedia documents are the basis of the world wide web. The retrieval of such documents uses the additional information of those hyperlinks to confirm the relevance of document components. A hyperstructure does not divide the individual document into smaller components, but clusters documents according to hyperlinks.

Most successful for everyday use was the two step strategy of the original PageRank algorithm [Page et al.,1998]. In a simplified view of PageRank, first a query Q is evaluated against hyperdocuments D using conventional retrieval techniques: D c Q. This step

can be called the fetch phase in the generic fetch and browse algorithm. After the fetch, the browse step will consider the structure of the hyperlinks. The result list of the first step will be sorted in descending order according to their so-called PageRank (pr), which is a value calculated on the basis of the link authority of the page. The pages are displayed in this order.

Overall two different and independent aboutness relations are calculated to determine aboutness: F [D c Q|D pr Q]. F is a function representing the complete retrieval

process to push the results of the first retrieval stage into the arguments of the second: D pr (D c Q). Aboutness is therefore based firstly on the topical relatedness of

documents and query and secondly on the authority of the hyperdocument — a value en- tirely derived from structure. Hypermedia retrieval with such strategies lacks a combined attempt to use structure and content. Fetch and browse follow two independent aboutness relations. In the case of the original PageRank hypermedia retrieval algorithm, the browse step is even calculated independent of content and before the fetch and authority step. 3.3.1.3 XML Retrieval

Out of these three structural retrieval approaches, only XML retrieval fulfils the full paradigm of fetch and browse by integrating structure and content fully. As seen in Section 2.2, XML specifies the discourse in documents by giving a formal representation of their division into document components. As presented, XML documents form a tree of information by using a recursive definition of document content. The advantage of the hierarchical structure is clearly that many information carriers from texts and websites to multimedia documents are commonly presented in a hierarchical structure. The discourse in most texts is structurally organised in sections, subsections, titles, etc., all of which

can be easily represented given the flexibility of XML, by creating corresponding XML elements.

As defined in Section 3.2, we consider documents and queries to be situations with infons representing the collection of information in them. The structure of the XML representation of information allows us to focus the document situation on specific topics in predefined components:

1. XML elements are the atomic information units in XML. We translate this into our Situation Theory framework by stating that each XML element is an XML situation. For hypertext retrieval, on the other hand, there is no need to change the basic information unit, as the scope of the retrieval was still the full document. 2. Two or more such atomic information units can be linked. A link between two

XML elements is called an edge. The semantic content of two linked units is never independent. Generally, XML elements have to be parents or children of other XML elements. A second and special case are XML attributes that offer either information about the specific element they are linked to or about the complete document tree. In passage retrieval, on the contrary, passages were informationally independent and therefore did not have relational infons. In hypertext retrieval the information flow was strictly separated in a structure and a content flow.

XML attributes are special in that they are not simply children, but properties of other XML elements [G¨overt et al.,2006]. Furthermore, they might be informationally related not just to the XML element they are properties of: e.g., an author attribute might be part of an article element. This does not mean, however, that subelements of this article do not have the same author. Unless otherwise specified they do. This example demonstrates that for attributes at least the information in an XML tree is not just aggregated bottom up or ascending. It depends on how the attribute is propagated [Chiaramella,2001].

This propagation of an attribute’s information can be descending as just demonstrated or ascending, as, e.g., in an edited book where the overall author is the sum of all authors of all book sections. If two different information units have two different authors, then their parent will have both as authors. Chiaramella calls those attributes static which only apply to their specific element [Chiaramella,2001]. XML element names are examples of such static attributes of structured information units. A title element name only declares its content to be a title. It fully depends on the power of the indexing model whether this kind of distinction is translated into the information units representing the document components. Our Situation Theory framework has to be expressive enough to consider all three structural meanings of attributes.

Clearly attributes are special in so far as they do not aggregate information of their context XML elements. They can make an answer to an information more focussed by providing additional information, but this focus does not necessarily specify information in the surrounding XML elements. Apart from the special case of attributes, the ‘natural’ information flow between XML elements indicates a hierarchy of information in XML

documents. We will discuss this in more detail when we look at hierarchical inclusion for XML retrieval in Section 4.7.1.

According to Chiaramella’s definition articles in INEX are informationally ‘maximal’, as they are exhaustive, while the lowest level paragraphs are ‘minimal’ and very specific in their information return [Chiaramella,2001]. For Chiaramella, the aim of XML retrieval is to avoid both maximal and minimal information units as answers to information needs [Chiaramella, 2001]. The maximal unit is the document if not the complete document collection, as the complete document collection can be regarded as one large (virtual) tree of XML elements. A user most satisfied with the complete document can, however, hardly be imagined. At the same time, the average user most likely requires more information than given in just one paragraph. She needs to know more about the context by possibly looking at surrounding paragraphs or by looking at information in other more distant paragraphs. Users have to ‘browse’ around. Only a combination of fetch and browse gives the best results, and XML retrieval integrates both.

Using this fetch and browse analysis of XML retrieval, we are now able to define XML retrieval aboutness.

In document Theoretical evaluation of XML retrieval (Page 42-46)