• No results found

FairPairs Preference Test

4.9 Evaluation

5.7.3 FairPairs Preference Test

Next, to confirm the correctness of data generated with FairPairs directly, con- sider the difference between the bottom click probabilities when two results are swapped. The left side of Figure 5.4 shows that reversing a top-five result and the 50th result within a pair behaves as the theory tells us it should. We see that when the fiftieth result is at the bottom of a pair, it is significantly less likely to be clicked on than when an original top-five result is at the bottom of the pair. On the right side of the figure, we see the click probability on the bottom result for pairs of the form 1-2 and for pairs of the form 2-1. In fact, summing the counts for the top 2 pairs (1-2 and 2-3), the difference in click probability is statistically significant. This shows that on average the top three results returned by the search engine are ranked in the correct order.

We also evaluated our approach in a situation where we have the true relative relevance of documents as assessed by human judges. Using the results of the eye tracking study by Joachims et al. (2005) and described in Chapter 2, we computed

0% 5% 10% 15% 20% 25% 30% 35%

Probability of user clicking

Pair with bottom result clicked

1-2* 2-3* 3-4 4-5* 5-6* 6-7* 7-8 8-9 9-10 rel(Top)>rel(Bot)

rel(Top)<rel(Bot)

Figure 5.5: Probability of user clicking only on the bottom result of a pair as a function of the pair. The two curves are for when the document immediately above the document clicked was judged strictly more relevant or strictly less rele- vant by expert human judges. * indicates the difference is statistically significant with 95% confidence using a Fisher Exact test.

0% 5% 10% 15% 20% 25% 30%

Probability of user clicking

Results in the pair considered with FairPairs

on bottom result

1,# 2,# 3,# 4,# 5,# 6,# 7,# 8,# Pairs 1-#, 2-#, ...

Pairs #-1, #-2, ...

Figure 5.6: Probability of user clicking only on the bottom result of a pair as a function of the pair for all queries generating at least one user click. The two curves are the cases where the document that was originally ranked fiftieth is the top or the bottom document in the pair. The error bars indicate 95% confidence intervals.

the probability of a participant in the user study clicking on the bottom result of a pair of results when the top result was judged strictly more relevant or strictly less relevant by expert human judges. Figure 5.5 shows that although FairPairs was not performed on the results in the study, the data supports the FairPairs premise that the probability of a user clicking on a document diat rank i is higher

if rel(di−1) < rel(di)than if rel(di−1) > rel(di).

Figure 5.6 shows the equivalent curve for the arXiv search engine, in effect providing a more detailed view of Figure 5.4. We again considered all queries that generated at least one click and exploited symmetries in our experiment design to obtain the maximal amount of data for this figure. It shows that if the fiftieth ranked document is displayed in a pair with a top-eight document, the FairPairs data collected is in agreement with our hypothesis that the fiftieth ranked document is less relevant than any from the top eight. In particular, the first five differences in click probabilities are statistically significant. For lower ranks the curves appear to proceed in a similar manner. This includes result pairs below the sixth, which are usually are not visible without users scrolling.

5.8

Summary

In this chapter we introduced FairPairs, a method to modify the presentation of search engine results with the purpose of collecting more reliable relevance feedback from normal user behavior. We showed that under reasonable assump- tions the data gathered is provably unaffected by presentation bias. We also showed that given sufficient clickthrough data, training data generated with FairPairs will allow a learning algorithm to converge to the ideal ranking. We

performed real world experiments that evaluated the assumptions and conclu- sions in practice. Given bias-free training data generated in this way, it is possible to use existing methods for learning to rank without additional modifications to compensate for presentation bias being necessary.

CHAPTER 6

ACTIVE METHODS FOR OPTIMIZING DATA COLLECTION

The analysis in this thesis has, thus far, assumed that clickthrough data is col- lected passively, or at best with minimal intervention as in the previous chapter. In effect, we simply infer relevance judgments from recorded interactions that take place anyway. We now describe techniques to guide users, in order to combat evaluation bias and provide more useful training data for a learning search engine. This research was originally published in (Radlinski & Joachims, 2007).

6.1

Introduction

When learning to rank, we have seen that two alternatives for obtaining training data are expert relevance judgments or relevance judgments collected implicitly by observing user behavior. Assuming that we wish to avoid the difficulties associated with collecting judgments from experts, as described in Chapter 1, consider once more the properties of user behavior described in Chapter 2.

We saw that users usually execute a query, and then perhaps consider the first two or three results presented by the search engine (Granka et al., 2004). The feedback (clicks) on these results can be recorded and used to infer relevance judgments. These judgments can then be used to train a learning algorithm such as a Ranking Support Vector Machine, as described in Chapter 4. In particular, the eye tracking study showed that users very rarely even look at results beyond the first few. Similarly, other researchers have previously noticed that users click

predominantly on search results at high ranks (for example, see Agichtein et al. (2006)).

Hence clickthrough data is strongly biased toward documents already ranked highly. Highly relevant results that are not initially ranked highly for any query may never be observed and evaluated. This means that if the ranking function used by a search engine initially performs poorly for some class of queries, training examples that identify truly relevant results for these queries may never be observed. This would make it difficult for a learned ranking to ever converge to an optimal ranking.

To avoid this evaluation bias in which documents are evaluated, this chapter presents a new formulation for learning to rank, where the ranking presented to users is optimized to obtain useful data rather than strictly in terms of estimated document relevance. The goal this formulation addresses is to minimize the total loss from presenting poor rankings over all time.

There are many approaches by which more useful training data could po- tentially be collected. For example, one possibility would be to intentionally present unevaluated documents in the top few positions of search engine results, aiming to collect more feedback on them. However, such an ad-hoc approach is unlikely to be useful in the long run, and would hurt user satisfaction sub- stantially in the short run by often presenting suboptimal results. We instead introduce principled modifications that can be made to the rankings presented. These changes, which do not substantially reduce the quality of the ranking shown to users, produce much more informative training data and quickly lead to higher quality rankings being shown to users. In contrast with previous work by Chu and Ghahramani (2005a), we do not simply ask which relevance

judgments should be obtained to reduce uncertainty in establishing which is the correct ranking. Rather, we consider how to obtain training data that will quickly improve the quality of rankings using metrics suitable for measuring search engine performance.

We will now formalize the learning problem as an optimization task, present a suitable Bayesian probabilistic model and discuss inference and learning. Fol- lowing this, we present strategies to modify the rankings shown to users so that performance of learned rankings improves rapidly over time. An evaluation of this approach is then presented, using both synthetic data and TREC-10 Web data. In particular, we see the improvements using our exploration strategies are much faster than with passive or random data collection.