Judgment Set Selection - Test Collection-Based Evaluation

2.4 Test Collection-Based Evaluation

2.4.1 Judgment Set Selection

In principle, all documents in the collection should be judged, which is true for a few TREC early tasks. However, with the increasing size of document collections, it is too expensive to judge exhaustively for one collection and across a reasonable range of topics. Therefore, new strategies are needed in order to select the documents to be judged based on a judgment budget.

Shared Evaluation Tasks. Most of the widely used experimental test collections, including judgments, come from TREC tasks, which has been held since 1992. At the starting point of TREC, there were two tracks: an Ad Hoc task and a Routing task. In the Ad Hoc tracks, a set of topics are provided along with the document collection; systems are required to find a set of documents that are most relevant to each given topic. The routing track is different, where the set of information needs is static, but the documents are dynamic, and the goal of systems is to identify the set of documents similar to the ones provided [39]. Since 2002, additional tasks were added to TREC, each of which focuses on different challenges in various areas. There are domain specific tracks such as Genomics and Legal; also there are tracks that focus on enhancing different techniques, such as Spam and QA; in addition to textual based retrieval, content-based tracks were also introduced the video track [118]; moreover, the web [40] and TeraByte tracks [25] provide

5_{http://trec.nist.gov/}

6_{http://www.clef-initiative.eu/}

s1,1 s1,2 s1,3 . . . s1,n−1 s1,n s1,n+1 s2,1 s2,2 s2,3 . . . s2,n−1 s2,n s2,n+1 s3,1 s3,2 s3,3 . . . s3,n−1 s3,n s3,n+1 s4,1 s4,2 s4,3 . . . s4,n−1 s4,n s4,n+1 s5,1 s5,2 s5,3 . . . s5,n−1 s5,n s5,n+1 . . . . sd,1 sd,2 sd,3 . . . sd,n−1 sd,n sd,n+1 . . . . sk,1 sk,2 sk,3 . . . sk,n−1 sk,n sk,n+1                                                               D1 D2 D3 . . . D3 D4 D8 D3 D1 D7 . . . D4 D2 D9 D2 D6 D2 . . . D7 D8 D10 D7 D5 D10 . . . D2 D1 D3 D6 D3 D5 . . . D1 D9 D7 . . . . D10 D6 D1 . . . D5 D3 D40 . . . . D50 D50 D50 . . . D50 D50 D10                                                             S1 S2 S3 _{. . .} Sn−1 Sn Sn+1 S1 S2 S3 . . . Sn−1 Sn Sn+1 Rank 1 2 3 4 5 . . . d . . . L

Figure 2.6: The top-d pooling process. Among all n + 1 participants, there are n contributing systems. Each row is a document vector returned by systems at rank k and each column represents the ranked list returned by a system.

the possibility for researchers to consider building large-scale retrieval systems that can be applied in World Wide Web (WWW) environment as a target. Among all of the tracks, we focus on the ad hoc retrieval tasks, where newswire or web-based document collections may be adopted. A detailed description of the topics and documents will be presented in Section2.5. Other shared tasks such as NTCIR and CLEF also have a similar range of tracks, but may differ in focus, for example, more cross lingual tasks are involved. All shared tasks deliver publicly available test collections and TREC is the most influential of these and we will discuss the process of building test collections in the next several paragraphs.

The Fixed Depth Pooling Method. One solution to the limited budget problem is to make use of the ranked list submitted by contributing systems. Among all participants, we choose some of them as being contributing systems where the judged documents are sourced. Since only a limited labeling budget is available, we can only consider a subset of documents returned by contributing systems. By making use of the key aim of any IR system that the returned ranked lists are sorted in descending order of the likelihood of a document being relevant to the topic, the TREC top-d pooling method was proposed and is still in use today. Intuitively, if sufficient high quality IR systems are taken to be the sources of a judgment pool, most of the relevant documents are identified. This idea is core to the top-d pooling method, shown in Figure 2.6, where n systems are considered to be the source of a judgment pool. Each column in the left part of Figure2.6represents a ranked list submitted by one of the systems presumed to be part of the joint experiments. Documents falling in the rectangle are then selected as candidates to be judged. Note that although the figure labels systems in order, there is no such requirement in the original pooling process. In other words, documents from all systems are selected in parallel, regardless of which ranked list they appeared in, as is indicated by the rectangle. As d is the maximum rank of a document to be judged, it is likely to have a large impact on both the judgment cost and quality. A small d value may result in missing relevant documents while a large one may cost too much

judgment effort. In TREC NewsWire test collections, the pooling depth is 100, but for recent large document collections, the pooling depth is around 20.

Concerns have been raised regarding to the shallow pooling depths. One argument is that the small judgment pool can not identify the majority of relevant documents. If we assume the unjudged documents are not relevant, then there is a high risk that system performance can not be evaluated precisely. As pointed by Voorhees [116], organizers for other tasks, such as NTCIR, may use some additional manual runs to complement the pool. However, in general, the missing documents have little impact on early test collections such as TREC-5 [134]. Also, the leave-one- out experiments conducted by Zobel [134] showed that new systems such as Sn+1 in Figure2.6 can often be fairly evaluated. This is because a majority of relevant documents were found by the pooling process. That is, the assumption that unjudged documents are not relevant is reasonable in the early NewsWire test collections. Nonetheless, as is pointed out by Buckley et al. [16], the shallow judgment pool results in biased system comparisons, especially when a new system is evaluated. The trade-off between judgment budget and test collection quality has been referred to as the problem of building reusable test collections [21].

Sampling-Based Selection. A second option is to sample a set of documents to be judged from the candidate pool. One of the earliest work was done by Cormack et al. [29], who showed that it is possible to construct a judgment set using a sampling method to estimate the precision of a retrieval system. One simple and straightforward method is to perform uniform sampling with regard to the judgment budget [127]. However, this uniform sampling method ignores the fact that retrieval systems tend to put documents that are likely to be relevant in top ranked positions. Selecting documents uniformly may result in a judgment pool with few relevant documents and thus leading to biased estimations of system effectiveness. To mitigate this problem, stratified sampling is often employed in practice. For example, the test collection of TREC 2009 Ad Hoc task is constructed using stratified sampling [26], which has a shallow initial pooling depth d = 12. How different sampling strategies and how the design choices of strata affect the evaluation results is recently studied by Voorhees [117], considering the quality of different sampling methods by measuring the accuracy of the total number of relevant documents found. Under the guidance of a judgment budget and some evaluation metrics, more cost-effective collection construction methods have been studied [4,5, 21]. Other than the budget benefits, these proposed methods are often tailored to one evaluation metric. Moreover, the exact extent to which they can provide unbiased evaluation results remains unclear.

Interactive Searching and Judging. While the top-d pooling method selects documents to be judged by treating documents in the d × n range (shown in Figure2.6) equally, a further im- provement may be to consider prioritizing documents that have higher estimated probabilities of being relevant. The intuition of prioritizing documents that are more likely to be relevant is often referred to as the Move-to-Front (MTF) heuristic. The Interactive Searching and Judging (ISJ) method introduced by Cormack et al. [28] makes use of MTF. It selects documents by considering the likelihood of relevance for all documents in the document collection. It is known that

documents appearing in each ranked list are sorted in descending order with respect to the probability of being relevant as estimated by that system, but how best to aggregate across multiple ranked lists remains challenging, because there is no indication of the quality of each ranked list. To compensate for this,Cormack et al.considered an estimate of the quality of each ranked list by using a “human-in-the-loop” approach. For each ranked list, a set of documents are judged according to their positions in the ranked list. Based on these judged documents, each system is then evaluated and assigned a different priority. In later rounds of the judgment process, documents are then selected by considering the run quality first and then their positions in the ranked lists. Although the “pool-to-d” method adopted by TREC ignores the difference between topics, it is an important factor affecting both judgment cost and quality. Cormack et al.also considered applying ISJ on a per-topic basis, in which case the judgment budget has been made for a set of topics instead of individually. Although the ISJ method may work well, it is difficult to apply in practice, and one reason is that the process requires a human assessor to stay in the loop. Also, there is a requirement on the estimation of the quality of a ranked list, meaning that the choice of evaluation metric is also a key issue.

Targeted Relevance Judgments. A third method also considers the probability of different documents being relevant to the topic. The targeted relevance judgment method aims at reducing residuals against certain optimization criteria. Moffat et al. [75] proposed three methods to select documents based on the evaluation metric Rank-Biased Precision (RBP) [74]. Considering the n ranked lists submitted by contributing systems in Figure2.6, it is clear that a document may ap- pear multiple times across n lists at different ranks. While both top-d pooling and ISJ ignore these duplicates, it is a feature suggesting the importance of a document. The first method proposed by Moffat et al. [75] makes use of this information and considers the total contribution of a document by summing its contributions at all rank positions. Since all ranked lists are sorted, different weightings are given to ranks indicating the importance. Let WM(i)be the weighting function of rank i and let kD,i(1≤ i ≤ n) be the rank of document D given by the i-th contributing system. Then the contribution of D is calculated as:

Contrib(D) = n X

i=1

WM(kD,i).

The Method A proposed byMoffat et al.selects the |J| documents with the highest contribution, where |J| is the judgment budget. This method was designed to optimize toward finding the most popular documents in the candidate set to be judged since they have a high likelihood of being relevant. The Method B proposed byMoffat et al.leverages the information of unjudged documents in a system instead of only relying on the set of judged doucuments. Let J be the set of judged documents and sj,i be a document returned by system i at position j, then Method B

scores the contribution of a document as: Contrib(D) = n X i=1 WM(kD,i)· ∆i,where ∆i = ∞ X j=1 sj,i∈J/ WM(j)

where ∆iaccounts for the total weightings of unjudged documents in system i. By doing so, documents are selected not only to maximize the amount of relevant documents, but also to minimize the uncertainties in system evaluations. A third method proposed byMoffat et al.requires some knowledge of the relevance of selected documents, which is similar to the idea of Cormack et al. [28]. The major difference between Method C [75] and ISJ [28] is that it estimates an overall system quality: Contrib(D) = n X i=1 WM(kD,i)· ∆i· (Basei+ ∆i)3/2, Basei = ∞ X j=1 sj,i∈J rj · WM(j). (2.14)

where rjis the relevance of document at rank j. Obviously, the additional component in Method C estimates the system quality and thus the last method assigns the weightings to a document using all three aspects. A recent study conducted by Lipani et al. [58] compared the three collection construction methods and also the pooling-to-d method. Experimental results show that Method A and Method C perform the best in constructing reusable test collections, however, Method C is difficult to implement in a one-pass mechanism, but is possible with several judgment rounds. Summary. Table2.3 summarizes the widely discussed methods of selecting documents to be judged. A top-d pooling method is required in most sampling-based approaches, for example the first stratum is also a shallow top-d pool in the stratified sampling process, so does the approach proposed byYilmaz et al.. Column “Prioritizing” indicates whether the corresponding approach assigns document weightings according to their contributions, when considering ranked lists from all contributing systems. Both ISJ [28] and methods in Moffat et al. [75] are based on this since all documents are prioritized according to some weighting schemes. Also, ISJ and Method C require human assessors to continue judging the documents during the document selection, which is indicated as “Assessors”. While most of the document selection process is independent of the evaluation process, methods proposed by Yilmaz and Aslam [127], Yilmaz et al. [130], Aslam et al. [5] and Carterette et al. [20] are not, which are marked as “Metric Dep.”. The resulting test collections constructed using these methods are required to be evaluated with specific metrics.

In document Efficient and effective retrieval using Higher-Order proximity models (Page 40-44)