• No results found

Estimating Relevance on ClueWeb

6.4 Experiments with the Estimation Framework

6.4.3 Estimating Relevance on ClueWeb

To further explore the merit of our approach, we examine the CW09A-2010 collection. It has a shallow pool depth (d = 20), meaning that validation is not possible, as there is no deep- pool reference ordering. Instead, we compute the normalized τ distance between each pair of estimation methods, and simply record how much the rankings differ, shown in Figure6.12. The UB estimator assumes that all unjudged documents are relevant. As a reference point, we also compute the same values for TB05, at two depths, d0 = 20 and d0 = 60. At the latter depth all estimation approaches tend to agree with each other. On TB05, all of the estimation results,

10 20 40 100

TREC9 TREC10 Rob04 TB04 TB05

Collection Distance Method InfAP InfRBP LLab 0.02 0.04 0.10 0.20

TREC9 TREC10 Rob04 TB04 TB05

Collection 1 - Kendall's τ Method InfAP InfRBP LLab

Figure 6.11: System ordering comparisons on a two-strata sampled judgment set, repeated ten times. Judgments are to depth d0 = 10, plus a 10% random sample of remaining documents to depth 100 to form the second stratum. Note the logarithmic vertical scales.

d'=20 CW10 d'=20 TB05 d'=60 TB05 LB Lin La Lb RM UB

LB Lin La Lb RM UB LB Lin La Lb RM UB LB Lin La Lb RM UB

Estimation Methods Estimation Methods τdist 0.125 0.100 0.075 0.050 0.025 0.000

Figure 6.12: Normalized τ distance between system orderings generated by different estimation methods using a pool of depth d0= 20on CW09A-2010 (CW10), and on TB05 with pool depths of d0= 20and d0 = 60.

including UB, tend to agree on the system ordering. However, on CW09A-2010, there is a clear uncertainty, confirming that d0 = 60 is a more robust pool depth for TB05 than is d0 = 20 on either TB05 or CW09A-2010 when seeking to apply RBP(0.95) as an evaluation metric. Great caution should be exercised when the d = 20 judgments are used for anything other than shallow metrics.

6.5

Conclusions

We have presented a new estimation framework to improve system comparisons in batch IR eval- uation, with the key idea being to predict a gain value for each unjudged document. The entire process consists two components: a set of rank-level estimators and a two-stage estimator formu- lated as an optimization problem. The two-stage estimation is based on the rank-level estimators and it provides a unified global gain for each unjudged document, whilst rank-level estimators

only consider a local gain for each document, ignoring the fact that a document may be associated with multiple different retrieval ranks.

We show that estimation is a viable technique to predict scores for deep evaluation metrics when limited judgments are available, including the case when the judgments are obtained using stratified sampling rather than pooling. One important aspect of our approach is to make decisions on when to adjust topics, instead of treating all topics equally. By making use of the coefficient of covariance (γ) [22], we verified our proposed hypothesis – that whether a topic requires score estimation can be predicted via considering the sample coverage, measured using γ. We show that adjusting a subset of topics on earlier TREC test collections can improve the estimation accuracy; and it confirms the earlier statement that on these test collections, a majority of relevant documents can be identified.

A secondary contribution is the development of a new technique to more precisely com- pare system orderings. By focusing on swaps that are conclusive, our weighted rank correlation coefficient dist can be used to measure the stability of a variety of estimation techniques. Our experimental results show that, solely focusing on the significant swaps can give a more robust comparison of system orderings and that treating all swaps equally may make the results too sen- sitive to be conclusive. Using dist, we show that estimation improves our ability to score and compare systems using limited judgments.

It must be noted, however, that the estimation is built on the m rank-level fitted models, each of which requires that when constructing the judgment set, documents up to some rank d0be fully judged. This means that for some sampling-based judgment approaches, the proposed method is not applicable. Second, while we show that our estimation methods can also account for system bias to some extent, outcomes might be further improved by introducing more randomization into the optimization framework. Overall, there remain many evaluation challenges in this area, despite the gains that have been achieved here.

CHAPTER

7

Conclusions and Future Work

The focus of IR studies is to develop efficient and effective retrieval systems. The strategy of effi- cient query evaluation, the model for effective retrieval and the approaches for evaluating current systems are all important, and highly related to each other. Many questions are considered in the process of delivering a better retrieval system. Whether or not a retrieval system can provide a high quality ranked list to the user, for which a key factor is the heuristics or features employed within the model. In the long history of developing IR systems, many features have been examined, among these for which the proximity (or term-dependency) was shown to be important.

However, this general observation makes sophisticated models usually require more comput- ing effort, and so for the term-dependency models to make use of higher-order proximity features. How can we efficiently to compute higher-order proximity statistics? What are the possible effi- ciency and effectiveness trade-offs when using higher-order term-dependency models? As men- tioned, to examine efficiency and effectiveness, large-scale evaluation is required to be carried out, although with incomplete judgments. Therefore, several questions arise in this process. How reliable are the evaluation results on large-scale web test collections with limited judgment ef- fort? How can we draw reliable conclusions when using current evaluation configurations? This thesis has contributed answers to these questions: the eager and lazy versions of PLANESWEEP

for finding all higher-order proximity features in one-pass (Chapter3); the use of distance-based local statistics as a trade-off between effectiveness and efficiency when considering higher-order proximity feature (Chapter4); that weighted-precision metrics considering only top ranked doc- uments such as ERR or RBP(0.8) should be employed on large test collections with a shallow pooling depth (Chapter5); and a score estimation framework to adjust evaluation results for deep effectiveness metrics on shallow pooled test collections (Chapter6).

In the process of developing new techniques for better satisfying users’ information needs, there remain many open problems. We must continually seek answers to new questions that arise from the entire development life-cycle of IR systems that encompass algorithm design to evaluation and testing feedback for furthering the field. With the vast amounts of data these systems must today ingest, process and organize, we are faced with the issue of rediscovering new solutions to old problems.

7.1

Thesis Outcomes

The thesis began by proposing two PLANESWEEPvariants for the efficient computation of higher-

order term-dependency statistics of all subqueries (Chapter3). Among all possible retrieval heuris- tics, proximity features, which assume terms have a certain level of dependency between terms, are known to be effective. It is also known that when considering all term-dependencies of subqueries having a length greater than or equal to two, the computational cost is high. Therefore, we empiri- cally studied the effectiveness of higher-order proximity features in MRF-based term-dependency models [73], especially considering the difference between phrases and unordered windows. As our results showed, using either type of proximity statistics may help in improving retrieval ef- fectiveness, which further motivated us for finding efficient solutions of computing these features. Sadakane and Imai [92] propose the PLANESWEEPalgorithm that computes proximity statistics

for each subquery consisting of more than two query terms. However, the exponential growth in the amount of subqueries leads to increased overhead, since the extraction process is repeated across all subqueries. Our proposed method overcomes this weakness and performs the extraction by evaluating the original query only once for each document.

A second bottleneck when computing higher-order term-dependency models is the require- ment for global statistics, implying a two-pass evaluation on the index structure. We show and demonstrate in Chapter4that a variant of an existing MRF-based term-dependency model can be adopted, and local proximity statistics can be used as a surrogate of global proximity statis- tics. By doing so, we show that although the effectiveness may somewhat be degraded, avoiding unnecessary second evaluation on the index helps to reduce the computational cost.

We empirically examined the effectiveness of higher-order proximity models, and demon- strate their applicability on large test collections. Nevertheless, the incompleteness of these large test collections raise concerns of the reliability of the evaluation results. Shallow pooling depths lead to a high level of uncertainty in system comparison, when using deep metrics to evaluate systems. Different evaluation configurations are possible, and may vary in their conclusion of system performance. In Chapter5, we experimented with various test collections, ranging in their different sizes and number of judged documents. We showed that the configuration that consists of pooling depth, evaluation depth and metric affect the final conclusions drawn from the evaluation process. More specifically, if only a shallow pooling depth is available on large test collections such as web-based collections, there are potential risks arising from the failure of identifying enough relevant documents. Missing a relatively large fraction of the relevant documents may cause an imprecise system performance evaluation, especially if recall-based or deep metrics are employed. We concluded that setting a proper evaluation depth and using the right metrics with the consideration of test collection is important. More specifically, on these large test collections with limited pooling depth, a shallow weighted-precision metric with residual information is rec- ommended.

If there are situations where deep metrics must be used, estimation of the effectiveness score is required. We considered a possible solution to this problem in Chapter6 and proposed an estimation framework for this problem. The estimation framework contains two components: one

that models effectiveness scores as a function of retrieval ranks and another that gives the estimates using a two-step optimization. We concluded that, it is possible to estimate system performance using our proposed framework. However, the question of how we can reduce evaluation bias when only a limited amount of judgments are available remains an open research direction.