INEX Evaluation Scales - Theoretical evaluation of XML retrieval

As seen in Chapter 2, for INEX the aim of XML retrieval is to retrieve not only relevant document components, but those at the right level of granularity, i.e. those that specifically answer a query. To evaluate how effective XML retrieval approaches are, it is necessary to consider whether the ‘right’ level is correctly identified. For this purpose, two evaluation criteria have been the basis for INEX to consider the structure when evaluating XML retrieval effectiveness, which we now want to look at in more detail.

As seen in Section 2.3, INEX has two evaluation dimensions:

• Topical exhaustivity reflects the extent to which the information contained in a document component satisfies the information need.

• Component specificity reflects the extent to which a document component focuses on the information need.

Specificity and exhaustivity are first used in IR literature to describe properties of the set of indexing terms assigned to a document [Kazai and Lalmas,July 2005]. INEX uses them more in an aboutness sense to name properties of document components. The history of the evaluation criteria and INEX in general is described in [Kazai and Lalmas,2006].

As discussed in Chapter 2, we use INEX 2005 as a baseline and refer to INEX 2004 results in this part only to explain INEX 2005. That is why we need to discriminate exhaustivity and specificity. Since 2005, specificity has become the focus of INEX evalua- tions. It was found to reflect the requirements of XML retrieval better.

In order to capture varying degrees of exhaustivity and specificity, INEX has modelled them using graded scales following an investigation by Kekäläinen and Jarvelin [Järvelin and Kekäläinen, 2002]. Some advantages of such a scale are discussed in [Gövert et al.,

2006]. Using two measures of relevance, in particular, allows to discuss various degrees of exhaustivity against various degrees of specificity. And, a document component can be compared to its subcomponent. It might be seen to be more exhaustive than its children.

Prior to 2005, INEX has developed a four-point ordinal scale:

1. Not exhaustive (0): The document component information is not about the topic of request (query).

2. Marginally exhaustive (1): The topic of request according to the query is mentioned, but no more than in passing.

3. Fairly exhaustive (2): The document component discusses many aspects about the topic of request of the query, but not all. This includes those requests which have several subtopics and only some of them are considered.

4. Highly exhaustive (3): The document component is fully about the aspects of the query.

For specificity the same principles apply. XML retrieval systems should be rewarded if they deliver focussed document components. A retrieval system that locates the exact relevant paragraph in a document is likely to trigger higher user satisfaction than one that returns a too large component. Again, a binary scale was seen to be not sufficient.

1. Not specific (0): The topic as suggested in the query is not about a theme of the document component.

2. Marginally specific (1): The topic (query) is only a minor theme of the document component.

3. Fairly specific (2): The topic is a mostly covered in the document component. 4. Highly specific (3): The topic is about the document component.

These are the evaluation scales for INEX 2004. INEX 2005 continues to use degrees of exhaustivity and specificity during the evaluation process, but not on an ordinal scale such as the one above. In Section 6.5.3, we discuss the implication in the changes of how the values for exhaustivity and specificity are derived for INEX 2005. Before that, we elaborate the relationship of the evaluation scales of exhaustivity and specificity.

In order to show the relationship between exhaustivity and specificity, we employ the idea of an ideal concept space developed in [Wong and Yao, 1995]. Concepts are the elements in such a concept space, and document components and topics containing concepts are subsets of that concept space. Following this approach in [G¨overt et al.,2006], a so-called component coverage matrix is developed that symbolises the differing degrees of overlap in concepts between topic and component for exhaustivity and specificity. This visualisation is very close to aboutness determination, as it treats information represented by a number of concepts as properties of document component and query (topic). The relationship of such concepts in query and documents allows us to determine specificity and exhaustivity.

[Kazai and Lalmas,July 2005] explain specificity and exhaustivity with the ideal concept space. Exhaustivity and specificity can be interpreted with the following formulas: Say T is a topic, C is a component, and |.| is a measure of the size or a counting measure, as van Rijsbergen calls it [van Rijsbergen, 2004] (e.g., the total number of words in a document). Then:

exh = |C T | |T | spec = |T C|

Please note the difference between C T and T C, which reflects the difference between exhaustivity and specificity according to Chiaramella’s fetch and browse paradigm, as explained in Section 3.3.2.

The concept matrix is a powerful abstraction. It lacks, however, means to represent relationships between the concepts. We would therefore like to reinterpret it as an ideal infon space. As convincing as the abstraction of a concept space for traditional IR seems to be, concepts themselves are not able to express relationships among them. For XML retrieval this is not satisfactory, as structure cannot be represented. The relations between the concepts are neglected in favour of a simplified semantic model. A Situation Theory framework is more powerful. We suggest to use infons instead of concepts in order to include relational infons and therefore structure. With Situation Theory, there is no need to assume independence of the elementary elements.

Figure 6.2: Infon coverage matrix with INEX 2004 scale

In the infon coverage Figure 6.2, the upper left square of each entry represents the document component situation, whereas the bottom right square represents the query situation. Together they form an abstract visualisation of an aboutness relation between a query and document component situation. The shaded area symbolises the existence of aboutness. The larger the shaded area the higher the corresponding specificity or exhaustivity value. E.g., a (3,3) combination leads to a full shading, while (2,1) and (2,2) differ in that for (2,1) larger parts of the query situation are not covered by the document component.

Exhaustivity is measured by the size of the overlap of query and document component information in the shaded grey areas. On the other hand, specificity is determined by counting the rest of the information in the component that is not about the query. The less additional, non-useful information can be counted in the component, the higher the specificity value. Thus, specificity measures the relation of relevant to non-relevant content within a single document component.

For the INEX scales, all possible combinations of query and document component situations on the basis of an ideal infon space are shown in Figure 6.2. Each square

Table 6.1: Quantisations in INEX 2004

Function f(e, s) User model

Strict4 f(e, s) = ( 1 if e=3 and s=3 0 otherwise UU Gen4 f(e, s) =                1 if (e,s) = (3,3) 0.75 if (e,s) ∈ {(2,3),(3,2),(3,1)} 0.5 if (e,s) ∈ {(1,3),(2,2),(2,1)} 0.25 if (e,s) ∈ {(1,2),(1,1)} 0 if (e,s) = (0,0) EU SOG f(e, s) =                          1 if (e,s) = (3,3) 0.9 if (e,s) = (2,3) 0.75 if (e,s) ∈ {(1,3),(3,2)} 0.5 if (e,s) = (2,2) 0.25 if (e,s) ∈ {(1,2),(3,1)} 0.1 if (e,s) ∈ {(2,1),(1,1)} 0 if (e,s) = (0,0) SU AnyRel f(e, s) = ( 0 if (e,s) = (0,0) 1 otherwise TU

represents a situation within a two dimensional space spanned over the three different exhaustivity and specificity values. We have therefore 10 defined positions in this space, as we can discard any combination of (0, i) or (i, 0) with i ∈ [0, 3]. There can be no specificity without exhaustivity and vice versa.

In Figure 6.2we can see that the discrimination of scale 1 and 2 for exhaustivity and specificity is based on the relatively larger parts that are not shaded. It is therefore a quantitative difference in degree. We suspect that the discrimination does not add value to an approach investigating aboutness, as it looks at qualitative properties. We shall investigate this in Section 6.5.2.

Figure 6.2 visualises the relationship of the INEX specificity and exhaustivity scales. This chapter considers the changes in the scales used in INEX 2004 and 2005 from a theoretical point of view. Section 6.4 relates them to models for agent reasoning, as they are expressed in the so-called INEX quantisation functions which map the graded scales onto scalar values. Quantisations in INEX reflect the importance attached to exhaustivity and specificity as well as user standpoints as to what constitutes a relevant component [G¨overt et al., 2006]. For example, the strict quantisation functions evaluate whether a given retrieval method is capable of retrieving highly exhaustive and highly specific document components.

By representing the agent reasoning in a formal logical framework we will be able to relate them to exhaustivity and specificity. As shown in [Huibers,1996] rational agents, whether computer systems or human, have the ability to gather information and reason about this gathered information. In Section6.5, we analyse the aboutness decisions behind the graded scales for INEX 2004 and 2005. We demonstrate how to reason about the changes in the graded scales within our theoretical logic-based framework.

In document Theoretical evaluation of XML retrieval (Page 145-149)