Evaluating information retrieval experiments

Information retrieval methods are usually compared based on the number of relevant documents that they find for the queries. Relevant documents are those that satisfy the user’s information need expressed by the query.

Precision and recall are metrics that are commonly used for evaluating the effectiveness of information retrieval systems. They are both computed according to the number of relevant

Unless otherwise stated, we used the Lemur toolkit (www.lemurproject.org) for all experiments reported in this thesis.

Cosine(q, d) = P twt,q× wt,d qP twt,q2 × qP tw2t,d (2.1) KL(q, d) = −X t∈q p_(t|ˆθq) log p(t|ˆθd) (2.2) BM25 (q, d) =X t∈q log _V c− dft,c+ 0.5 dft,c+ 0.5 × ft,d(k1+ 1.0) ft,d+ k1 1.0 − b + b×|d|_|d| × ft,q(k3+ 1.0) ft,q+ k3 (2.3) INQUERY (q, d) = 0.4 + 0.6 × ft,d ft,d+ 0.5 + 1.5 ·|d|_|d| × log(|c|+0.5_df t,c ) log_{(|c|) + 1} (2.4) where the definition of each quantity is listed below:

Symbol Definition

b an Okapi BM25 constant parameter, range [0, 1] |c| the number of documents in collection c

|d| the number of terms in document d (document length) |d| the average number of terms in a document

df_t,c the number of documents in collection c that contain t

ft,c, ft,d, ft,q the frequency of term t in collection c, document d, and query q k₁, k₃ _{Okapi BM25 constant parameters, range [0, ∞)}

q the query as a set of terms |q| the number of terms in the query

Vc the total number of distinct terms in collection c

wt,q ln 1 + Vc df_t,c wt,d 1 + ln fd,t ˆ

θd, ˆθq the language models of document d and query q

Figure 2.3: Document ranking models that are currently in use and have been used in our experiments in this thesis. From top to bottom: a variation of Cosine metric [Salton and McGill, 1983; Salton, 1989], KL-Divergence language modeling [Lafferty and Zhai, 2001], Okapi BM25 [Robertson et al., 1992; Sparck-Jones et al., 2000], and INQUERY [Callan et al., 1992; 1997; Allan et al., 2000].

documents that the system has returned for a query. Precision is the fraction of retrieved documents that are relevant, and recall is the proportion of relevant documents that are retrieved. According to Salton and McGill [1983]:

“Recall measures the ability of the system to retrieve useful documents while precision conversely measures the ability to reject useless materials.”

To calculate the recall value for the results returned for a query, the total number of relevant documents needs to be known. In environments such as the web, users tend to only click on a few top-ranked documents returned by the search engines [Joachims et al., 2005]. Therefore, most recent studies in information retrieval, such as those listed below, pay special attention to the relevance of the top-ranked documents:

Precision at n. This is also denoted as P @n and simply shows the proportion of the top ndocuments returned by a retrieval system that are relevant to the query [Baeza-Yates and Ribeiro-Neto, 1999; Rijsbergen, 1979]. For instance, P @10 represents the fraction of the top 10 returned documents that are relevant to the query.

Average precision. This is perhaps the most commonly used metric for evaluating information retrieval experiments. Average precision emphasizes both recall and precision aspects. For example, it awards systems with relevant documents ranked highly (precision), but also accounts for recall by normalizing by the number of relevant documents for a topic. The average precision value for a ranked list returned for a query is computed by calculating the mean of the precision values after visiting each relevant document in the ranked list [Buckley and Voorhees, 2000]. When systems are evaluated with more than one query, then the mean average precision (MAP) over multiple queries is used.

R-precision. This metric calculates the precision value at rank r, where r is equal to the total number of relevant documents for a query [Baeza-Yates and Ribeiro-Neto, 1999]. Hence, if there are 10 documents relevant to a query, then the R-precision value is the same as the precision at 10 (P @10 value).

Bpref. Buckley and Voorhees [2004] proposed Bpref as an evaluation method particularly suitable for testbeds with incomplete relevance judgements. Bpref shows how many times judged nonrelevant documents are returned before judged relevant documents. Therefore,

Figure 2.4: The results of the query “federated search” returned from Metacrawler [Selberg and Etzioni, 1997a] metasearch engine. It can be seen that the results are merged from different sources such as Google, Yahoo! and Ask search engines.

documents that are not judged as being relevant or irrelevant do not have any impact on the evaluation results.

Reciprocal rank. The reciprocal rank value is calculated according to the position of the first relevant document in a ranked list. If the rank of the first relevant document is r, then the reciprocal rank value is computed as 1_r. When the systems are evaluated for more than one query, then the average reciprocal rank over multiple queries is used. This is also known as the mean reciprocal rank (MRR) [Shah and Croft, 2004]. Mean reciprocal rank is mainly used to evaluate information retrieval systems for queries with only a few relevant answers.

In document Federated text retrieval from independent collections (Page 34-37)