Evaluation - News vertical search using user-generated content

to retrieve the final ranking of documents from the target corpus. As with traditional pseudo-relevance feedback approaches, the idea is that the expanded query will lead to better retrieval performance than the initial query.

We use both pseudo-relevance feedback query expansion and collection enrichment in this thesis to improve retrieval performance when retrieving documents from user-generated content sources. In particular, we use collection enrichment in Chapter 6 to improve the representation of each news story that we rank. We use pseudo-relevance feedback to improve the ranking of tweets for display to the user, in Chapter 9.

2.4 Evaluation

IR has a long history of improvement through experimentation. Evaluation of an IR system typically involves measuring the effectiveness of the system on a query for which some relevant documents are known. Success can then be measured in terms of how many of these documents where retrieved, and at what ranks. This will be repeated over multiple queries and then an aggregate function used to determine the final effectiveness measures for the system.

2.4.1 Evaluation Measures

A large number of evaluation measures have been proposed to evaluate IR tasks. Below we describe the main evaluation measures used in this work.

2.4.1.1 Precision and Recall

Initially, it is important to have a metric with which to measure performance. In IR, two different but related measures are used: precision and recall. Precision is the proportion of retrieved documents that are relevant to the query, while recall is the fraction of the documents that are relevant to the query that were successfully retrieved. In particular, precision measures how ‘good’ our returned documents are, calculated (using Table 2.5) as [1]/([1] + [3]). Recall on the other hand, measures how many of the correct documents we returned, in comparison to how many there were - calculated (using Table 2.5) as [1]/([1] + [2]). It is of note that precision and recall are also used in a classification context as well as for ranking. In this case, precision and recall are calculated in the same manner as above, however the meaning of entry in the Table 2.5 confusion matrix differs. In particular, [1] refers to the number of true positives, [2] the number of false positives, [3] the number of false negatives, and [4] the number of true negatives.

2.4 Evaluation

Returned Not-Retrieved Relevant [1] [2] Not-Relevant [3] [4]

Table 2.5: Document ranking/classification confusion table for calculating precision and recall.

Furthermore, metrics combining both precision and recall have also been proposed. In this thesis, we use one of these metrics, namely F1(Rijsbergen, 1979), to report overall precision and recall in a classification context. F1is calculated as follows:

F1= 2 ·

precision · recall

precision + recall. (2.18)

2.4.1.2 Mean Average Precision

One popular evaluation measure in an IR setting is the Mean Average Precision (MAP) measure. This is measure calculates the mean of the average precision (AP) values for all queries. The AP for a query is the average of all precision values calculated after each document is retrieved (Voorhees, 2003). It is notable that MAP is a top-heavy measure, i.e. documents ranked correctly near the top of the ranking contribute more to MAP performance than documents ranked near the bottom. The MAP is calculated as follows: M AP = Q X q=1 Pn k=1P recision(R(q), k) · Recallδ(R(q), k) |Q| (2.19)

where |Q| is the number of queries, n is the number of retrieved documents, k a the rank within the retrieved documents R(q), P recision(R(q), k) is the precision at cut-off k and Recallδ(R(q),k) is the change in Recall between ranks k − 1 and k.

2.4.1.3 (Normalised) Discounted Cumulative Gain

Mean Average Precision (MAP) has been used for many years to measure the effectiveness (Voorhees, 2003) of IR systems. However, as it is built upon the precision metric, which is binary in nature, i.e. each document is considered relevant or not, it is not useful when documents are evaluated with respect to multiple relevance grades. For example, a document might be considered in terms of a 3-grade system: highly relevant; relevant or not relevant. To evaluate tasks that use more granular evaluation labels, Discounted Cumulative Gain (DCG) metrics were proposed (Järvelin & Kekäläinen, 2002). These approaches assume that documents with higher relevance grades are more relevant than those with lower grades and that highly relevant documents are most useful when returned in the top

2.4 Evaluation

ranks. Hence, DCG measures are compatible with multi-graded assessments and are top heavy in nature. DCG is calculated as follows: DCG = rel1 p X i=2 reli log2i (2.20)

where reliis the relevance of the document at rank i.

However, DCG is not sufficient for a typical IR evaluation, because not all result sets for a query are of the same length, i.e. fewer documents than the rank cutoff may be retrieved (Järvelin & Kekäläinen, 2002). To account for this, Normalised Discounted Cumulative Gain (nDCG) was proposed, that nor- malises the cumulative gain across queries. This is achieved by dividing DCG by the ideal DCG, i.e. that which would have been achieved by the perfect ranking according to the relevance assessments.

2.4.2 Cranfield Paradigm

Evaluation in IR is driven by the need to provide better rankings of documents and hence focuses on how to distinguish between different document rankings for the same queries. Classical IR evaluation has centred around human relevance judgements. This is where a human looks at a document given a query and marks it relevant or not to that query. Many of these judgements are then combined together to form a test collection. This consists of the relevance judgements, the corpus of documents and the queries used. These test collections can then be used by multiple systems to generate a ranking for each query that can be compared to the relevance judgements. This approach was first pioneered by the Cran- field experiments. It is important to note that the Cranfield experiments used full relevance judgements, where each document was assessed for each query (Cleverdon, 1991). Since then, the corpora of documents that are used have become sufficiently large that full relevance judgements are not feasible. For example, the largest test collection currently available is the ClueWeb09 corpus, containing 1.2 billion documents (see Table 2.4). Indeed, even for a set of 50 queries, where an assessor is assigned a minute per document, it would take over 95,000 years to finish assessing. Instead, incomplete assessments are used, where only a very small proportion of the corpus is judged. However that small portion is carefully selected to contain as many relevant documents as possible.

Typically, a strategy known as pooling is used to select the documents to be judged (Sparck-Jones & van Rijsbergen, 1975). The aim of pooling is to create a high-recall sample of the collection. Under pooling, different IR systems rank documents for the query topics. The individual rankings from each system are combined to create a single ‘pool’ of candidate documents to be assessed. Ideally, all of the documents within this pool should be judged. However, the pool size may still exceed the resources

2.4 Evaluation

available to judge it. In this case, the pooled documents are then ranked in some manner and then assessed in rank order until the assessment resources are exhausted or a rank cut-off has been reached.

2.4.3 TREC

The Text REtrieval Conference (TREC) is a collection of IR workshops sponsored by the National Insti- tute of Standards and Technology (NIST) and the Disruptive Technology Office of the U.S. Department of Defence. Unlike standard conferences in IR however, TREC was designed to encourage evaluation within the IR community by providing the infrastructure necessary for large-scale evaluation. TREC each year runs a number of tracks, which each represent a topic of research. A single track will contain one or more tasks that IR groups can participate in, by providing an afore-mentioned test collection suit- able for evaluating that task. At TREC, the test collections are created using pooling (Sparck-Jones & van Rijsbergen, 1975). All the participating groups provide initial document rankings using the corpus and queries, known as runs. Only the top n (normally 100) documents in each of these runs are judged by human assessors, which are then merged to create the relevance judgements. The idea is that by using runs from multiple diverse IR systems, the final relevance judgements will not be biased toward any single system or algorithm. Also, these judgements should be complete enough to judge systems that did not submit runs during the initial phase, since (hopefully) most of the relevant documents will have been identified and assessed. The assumption is that by using multiple IR systems, the probability of selecting the most relevant documents will be high.

2.4.4 Relevance Assessment

To assess approaches for various tasks, it is important to have a ground truth that can be compared against. Typically, this involves employing human assessors to judge documents from the pool to determine their true relevancy for the query that each was retrieved for. However, assessing large numbers of documents is time consuming and expensive. For example, if a document takes 30 seconds to assess (Voorhees et al., 2005), then to judge each of the 19,381 pooled documents for the TREC 2011 Web track, for example, will take in excess of 161 man-hours, or 6.7 weeks for a single assessor (assuming a 7 hour working day). Indeed, assuming a national minimum wage of $7.25 (US dollars) per hour, the cost of recreating the TREC Web track relevance assessments totals $1,170.88. TREC, sponsored by NIST, has traditionally paid a group of specialist assessors to judge documents for the participants (Voorhees et al., 2005).

However, NIST has a limited amount of funds to support the tracks that it runs. Indeed, in rare cases, TREC tracks have been known to have used the participants to judge documents if NIST could not sup-

In document News vertical search using user-generated content (Page 54-58)