Evaluation - Interactive Information Retrieval with Structured Documents

One can distinguish between four types of evaluations: 1) system-oriented, 2) user-based, 3) hybrid-approach, and 4) operational. Each type aims to evaluate different aspects of IR systems.

2.5.1 System-driven evaluation

System-oriented evaluations are based on the Cranfield model that tests the quality of IR systems by considering test collections. The main aim of such evaluations is to evaluate algo- rithms: How good are indexing techniques? How good is the ranking algorithm? How good is the relevance feedback. This type of evaluation doesn’t require the involvement of users and can be performed in laboratories in the controlled settings.

Test collections are comprised of three components: 1) a set of documents varying from a few thousand titles to terabytes of text, 2) queries created usually by collection creators and occasionally derived from real queries, 3) relevance judgements containing the information of relevant/irrelevant documents in response to each query. Relevance is obtained in different ways for different collections, sometimes by recruiting the assessors and sometimes by collaborative efforts.

Most collections are too large to be completely assessed for finding all relevant documents. Thus, pooling is performed before obtaining the relevance judgements for each topic. The main idea is to concentrate only on those documents that are most likely to be relevant. Mul- tiple IR systems run the same topic to obtain lists of top ranking relevant documents. A fixed number of top-ranking documents is taken from each run and then merged into one pool. As- sessors then read each document and rate its relevance.

In order to evaluate the performance of a specific algorithm, two measures are used; precision and recall. Precision reports the proportion of retrieved documents that are relevant and recall measures the proportion of relevant document that are retrieved. High recall refers to retrieving everything relevant but with possibly low precision and high precision means retrieving a (possibly small) set of highly relevant documents. Systems are evaluated normally at various levels of recall. The F-measure (equation2.1) combines precision and recall into one number.

One can tune the metrics according to interest in precision and recall.

F − measure_α= (1 +α) · P · R

α· P + R (2.1)

The assumption of the Cranfield approach are often criticised because 1) relevant documents are assumed to be independent of each other, 2) all the documents are equally important, 3) emphasises of high recall, 4) interaction is ignored.

2.5.2 User-centred evaluation

User-oriented measures evaluate systems as a whole including algorithm and interfaces. The integral parts of such evaluations are experiment subjects, search tasks, system and collections. Such evaluations are performed in relatively controlled environments. Control is imposed on task, time taken to perform task, instructions, training, help and by permutating the order in which tasks are performed.

Qualitative and quantitative analyses are performed for presenting the results. Qualitative data is gathered by questionnaires (users’ characteristics, task-level standing before and af- ter performing each task), think-alouds, semi-structured interviews, and by open discussions. Qualitative data is gathered by system logs and video recordings. Results are presented using statistical significance tests such as Mann-Whitney, t-test and Chi-square tests.

The TREC interactive track was set up to develop better methodologies for the evaluation of IIR systems. The methodology employed by the track was critiqued due to the adaptation of system-driven conditions for interactive experiment execution and evaluation. For instance, interactive TREC doesn’t deal with information need but with pre-constructed information requests, binary relevance assessments, etc. [Borlund, 2000b].

2.5.3 Hybrid evaluation

[Borlund, 2003] proposed the hybrid approach for the evaluation of interactive retrieval systems that takes into account the searcher, dynamic nature of information needs and relevance and experimental control. She proposed the measures Ranked-Half-Life and Relative Rele- vance to measure the effectiveness of an IR system. The measures are based on the subject and objective types of relevance.

2.5.4 Operational evaluation

The fourth type is operational evaluation when the whole system is used in real situations without any controlled settings. Searchers work with their own tasks, they decide when to

stop, search without any training and it is difficult to interpret results but they are more re- alistic. Longitudinal evaluations have some similarities with this type where an information problem is assumed to persist over a longer period of time such as days, weeks, months or even years. Some studies performed along these lines focused on the information seeking behaviour [Ellis, 1989,Kuhlthau, 1991,Kelly, 2004].

In this chapter, we introduce the search system DAFFODIL and describe its ar-

chitecture and design details.

DAFFODIL(Distributed Agents for User-Friendly Access of Digital Libraries) provides user- oriented access across a federated digital libraries and offers a rich set of functionalities across heterogeneous set of digital libraries. The current prototype gives access to 10 digital libraries in the area of computer science. From iTrack 2005 onwards, a modified version of DAF-

FODILwas used as user interface for XML retrieval. Thus the basic features of DAFFODILare described in this chapter.

3.1 Functionality of a federated digital library system

DAFFODIL is aimed at providing high level search functions—in contrast to conventional

search engines which mostly offer only simple, basic search operations. The concept of high level search activities for strategic is based on Bates’s ideas [Bates, 1990]. She distinguishes four levels of search activities on the basis of empirical studies of the information seeking behaviour of experienced library users. Typical information systems only support low-level search functions (so-called moves), Bates introduced three additional levels of strategic search functions:

• A Move is a simple act like typing terms into a search form or submitting a query (In DAFFODILat this level, wrappers connect to various DLs. The heterogeneity problem is addressed, by DL-specific translation of the submitted query or by mapping the returned data into a homogeneous XML metadata format).

• A Tactic is combination of moves. For example, breaking down a complex information need into subproblems, broadening or narrowing a query are tactics applied frequently. • A Stratagem is a complex set of actions, comprising different moves and/or tac-

tics, exercised on a single domain (DAFFODIL provides domain specific depth-search-

functionality, by applying tactics to a set of similar items, like e. g. subject search, journal runs or citation search).

• A Strategy is a complete plan for satisfying an information need. Typically, it consists of more than one stratagem. Strategies are not supported by Daffodil automatically, yet. Instead the user is enabled to work much more strategy-oriented, by applying the high level functions of stratagems and tactics.

3.2 The WOB model

The graphical user interface design of DAFFODILis based on the WOB model [Krause, 1995].

WOB is a German acronym for “object oriented directly manipulative graphical user interface based on the tool metaphor”. It attempts to solve the inherent contradictions in the interface design process like that between flexible dialogue control and conversational prompting using a set of co-ordinated ergonomic techniques. It tries to fill the conceptual gap between interface style guides (e. g. like Java Look and Feel Guidelines) and generic international standards (like e. g. ISO 13407: Human-centred design processes for interactive systems).

The general software ergonomic principles of the WOB model are as described in [Fuhr et al., 2002c]:

Strict Object Orientation and Interpretability of Tools Strongly related functionality of the system is encapsulated in tools that are displayed as icons (not as menus). The tools open views, which are ’normal’ dialogue windows. Due to well-defined dialogue guidelines, the chain of views a user is working on can be interpreted as a set of forms to be filled. In contrast, experienced users will prefer the tool view, which enables them to perform tasks more quickly; however, this view is cognitively more complex, and it is not required for interpretation. The user can manipulate objects on the surface in a direct manipulative manner. It is essential that consistency is guaranteed for the direction of the manipulation. Thus, the model requires an object-on-object interaction style with a clear direction and semantics. The generally recommended interaction style is as follows: To apply a function on an item, the latter has to be dragged to a tool.

Dynamic Adaptivity The interface adapts its layout and content always to the actual state and context. This is mostly used for a reduction of complexity in non-trivial domains, like browsing simultaneously in several relevant hierarchies at once. For example, the user may set the relevant context by choosing a classification entry; when activating the journal catalogue as the next step, the journals are filtered according to the valid classification context, to reduce complexity.

Context Sensitive Permeability When known information is reusable in other contexts, it will automatically be reused.

Figure 3.1: D A FF O D IL in the use

Dialogue Guidelines The views of the tools are functionally connected, e. g. by means of action buttons, hypertext links or rules which are triggered by plan recognition. A tool can also open its view proactively if the user may need its function in a given situation.

Intelligent Components Tools and controls in the interface have access to context and state, in order to decide, if their function is valuable for the user. If applicable, they shall interact pro-actively with the user or the shared environment (the desktop).

3.3 Agent-based Architecture

In order to implement high-level search activities, an agent-based architecture (ABA) was cho- sen (see e.g. [Wooldridge and Jennings, 1995]). The following features of agents are relevant for IR applications [Fuhr et al., 2000]:

Autonomy An agent is a process of its own, and thus it can operate independently of other agents.

Intelligence An agent is able to process knowledge and to draw inferences; in our case of an IR application, an agent should be capable of uncertain reasoning.

Reactiveness An agent reacts when prompted by another agent.

Proactiveness An agent is able to take the initiative itself, e. g. when it detects changes in its environment that require action.

Adaptiveness An agent can adapt its behaviour to the application it is being used for.

Communication An agent is able to communicate with other agents peer-to-peer.

For our DL application, communication and the control flow (including autonomy, reactiveness and proactiveness) are the most relevant features.

For the communication with digital libraries, so-called wrappers are responsible. The wrap-

pers provide access to a variety of heterogeneous data sources. Among them are locally avail-

able databases, and removeable web services and Internet sites that work with enquiry forms. For the iTrack version of DAFFODIL three wrappers for collections IEEE-CS, Wikipedia and Lonely Planet were set up. The wrappers have a common query language so that the client can uniformly distribute the queries to the wrappers. The agents communicate among each another over CORBA1as shown in figure3.2.

The Middleware agents (so-called Services) offer functions and data, that are necessary for the realisation of stratagems and tactics. For example, there is a service for merging the metadata

of a document from different wrappers and there also exist specialised authors, journal and conferences services. For iTrack DAFFODIL there are services for fetching document/element details, contexts of related terms etc.

Figure 3.2: DAFFODILArchitecture

The event-based message architecture connecting the user interface tools also uses, via the Message Transfer Agent (MTA), the cross-system message structure. Internal events, which relate to ASK or TELL events, are transformed into messages and sent via HTTP to the corre- sponding service. Then the answer is delivered to the original sender in the GUI.

In document Interactive Information Retrieval with Structured Documents (Page 35-43)