Evaluating Information Retrieval Systems

2.6 Evaluation Measures

2.6.2 Evaluating Information Retrieval Systems

There are several ways to evaluate an information retrieval system. For instance, one can distinguish between evaluations in batch mode – a single query results in a particular system answer – and those in interactive sessions (Baeza-Yates and Ribeiro-Neto, 1999: p.74). While the latter ones require the analysis of user behavior in a series of interactive steps with the system, evaluations in batch mode can be considered as laboratory experiments and are repeatable (Baeza-Yates and Ribeiro-Neto, 1999: p.74). Furthermore, there are evaluation measures for unranked and ranked retrieval scenarios (Manning et al., 2008: p.155). In unranked retrieval scenarios, the retrieved documents are considered as an unordered set of documents, while in ranked scenarios the ordering provided by the system comes into play.

As in almost all academic research works, only batch-style evaluations are considered in this work. In the following, evaluation methods for those experiments will be surveyed. After briefly explaining measures for unranked retrieval, we present some further measures for ranked retrieval.

Unranked Retrieval Evaluation Measures: Precision, Recall, and F-score

Similar as information extraction, unranked information retrieval can be considered as binary classification task (Manning et al., 2008: p.152). Given a collection of documents and an information need expressed by a query, documents can be classified as being relevant or non-relevant, and the system either retrieves a document or it does not retrieve a document for a query. Thus, the decisions of a system can again

2.6 Evaluation Measures

document collection (ground truth) system prediction relevant non-relevant

retrieved TP FP

not retrieved FN TN

Table 2.4: Confusion matrix in the context of information retrieval.

be classified as true positives, false positives, true negatives, and false negatives. As shown in Table 2.4, instead of “positive” and “negative”, one uses the terms of “relevant”, “non-relevant”, “retrieved”, and “not retrieved” when evaluating information retrieval systems (Manning et al., 2008: p.155).

Based on the categories in the confusion matrix, one can easily calculate precision and recall as already detailed in Equation 2.1 and Equation 2.2, respectively. However, in the context of information retrieval, it is more intuitive to formulate precision as ratio of retrieved relevant items to all retrieved items, and recall as the ratio of retrieved relevant items to all relevant items as done in Equation 2.7 and Equation 2.8, respectively (Manning et al., 2008: p.155).

precision p = TP TP + FP =

#(relevant items retrieved)

#(retrieved items) (2.7)

recall r = TP

TP + FN =

#(relevant items retrieved)

#(relevant items) (2.8)

For a meaningful combination of precision and recall, the f-score can again be calculated (Equation 2.4, page 29). However, in the presented form, precision, recall, and f-score can only be used to evaluate systems in unranked information retrieval scenarios. To evaluate systems in ranked retrieval scenarios, they have to be adapted or other evaluation measures have to be used, as will be detailed next.

Ranked Retrieval Evaluation Measures: Precision at k, Average Precision at k, nDCG at k When evaluating a system in a ranked retrieval scenario, “appropriate sets of retrieved documents are naturally given by the top k retrieved documents” (Manning et al., 2008: p.158). Thus, a simple evaluation measure in ranked retrieval scenarios is precision at k (Equation 2.9).

precision at k p@k = #(retrieved relevant items ranked ≤ k)

#(retrieved items ranked ≤ k) (2.9)

A shortcoming of precision at k is that it does not consider the ranking within the set of the top k documents. For instance, two systems A and B could have a precision at k of 0.5 if both systems retrieve k/2 relevant documents. However, system A might have ranked the documents on rank 1 to k/2 while system B might have ranked them k/2 + 1 to k. Clearly, one would prefer system A over system B since its ranking is much better.

An evaluation measure taking into account the ranking within the set of the top k documents is average precision_{since it “aggregates many precision numbers into one evaluation figure” (Manning and Schütze,}

2 Context of the Work & Basic Concepts

2003: p.535). Usually, “precision at relevant documents that are not in the returned set is assumed to be zero” (Manning and Schütze, 2003: p.536).

However, instead of considering all relevant documents for a query – which requires that each document in a document collection is annotated as being relevant or non-relevant – average precision can also be used with a fixed cut-off level, i.e, as average precision at k. As formulated in Equation 2.10 (following Yue et al., 2007), it is calculated as the sum of precision scores (p@j) at each rank j of a relevant document retrieved by the system (pj = 1) for ranks smaller or equal k. This sum is then averaged by either the

number of relevant documents for the query (rel) or k (if there are more than k relevant documents).

average precision at k ap@k = 1 min(rel, k)

j:pj=1

p@j (2.10)

In some evaluation scenarios, documents are not only classified binary as relevant or non-relevant but are graded according to how relevant they are. While average precision at k takes into account the ranking of documents, it only considers whether a document is relevant and does not distinguish if the relevance of documents is graded. Intuitively, highly relevant documents are more valuable than marginally relevant documents (Järvelin and Kekäläinen, 2002) and should be ranked higher than lower graded ones. Thus, an evaluation measure suitable for graded relevance judgments should penalize if highly graded documents are ranked lower.

An evaluation measure taking non-binary relevance judgments into account is (normalized) discounted cumulative gain as depicted in Equation 2.11 (Järvelin and Kekäläinen, 2002). By dividing the relevance judgments of all retrieved documents with a rank i > 1 by log2(i), the maximal score that can be

achieved is lower, the higher the rank i. Thus, an optimal ranking has to order the documents by its relevance scores.17

discounted cumulative gain at k DCG@k = rel1+ k X i=2 reli log2(i) (2.11)

An alternative calculation of discounted cumulative gain at k is formulated in Equation 2.12 (see, e.g., Manning et al., 2008: p.163). Note that the two equations 2.11 and 2.12 do not result in identical scores. However, they share the behavior of penalizing highly relevant documents being ranked lower.

discounted cumulative gain at k (alternative) DCG@k =

k X i=1 2reli_{− 1} log2(i + 1) (2.12)

Independent of whether using Equation 2.11 or Equation 2.12, the DCG@k measure can be normalized so that a perfect ranking results in a score of 1. For this, DCG@k is divided by the ideal discounted cumulative gain at position k (IDCG@k), i.e., by the score for a perfect ranking.

normalized discounted cumulative gain at k nDCG@k = DCG@k

IDCG@k (2.13)

17_{Note that it is neither necessary to use the logarithm base 2 nor to use a logarithmic discount at all. However, using the base 2} logarithm results in a smooth reduction (Järvelin and Kekäläinen, 2002).

In document Domain-sensitive Temporal Tagging for Event-centric Information Retrieval (Page 46-49)