The unlimited number of possible instantiations of each component of xQuAD precludes an exhaustive experimentation in this thesis. In order to conduct a thorough yet feasible investigation, in this thesis, we adopt a fractional factorial design (Box et al.,2005), by evaluating a limited number of instantiations (factor levels) of each framework component (factor) that are both potentially effective and feasible for a practical deployment. As part of the validation of xQuAD in this chapter, Section 5.2 investigates alternative instantiations of the document relevance component. In turn, Chapter 6 will investigate multiple instantiations of the sub-query generation and importance components, while Chapter 7 will investigate the document coverage and novelty components. A deeper look into the role of the novelty component will be the focus of Chapter8. Lastly, Chapter9
will investigate alternative regimes for estimating the diversification trade-off. While different chapters of this thesis have different experimental setups, in the remainder of this section, we describe the basic experimental methodology that is common to all these chapters. In particular, Section5.1.1describes the test collections used in our experiments, including their associated document corpus, queries, and relevance assessments, while Section 5.1.2 describes the procedures for training and evaluating all approaches investigated in this thesis.
5.1.1
Test Collections
Our experiments are based on the evaluation paradigm provided by the TREC 2009, 2010, and 2011 Web tracks (Clarke et al., 2009a, 2010, 2011b), henceforth denoted WT09, WT10, and WT11, respectively. The TREC Web track provides test collections for the assessment of adhoc and diversity search approaches in a web setting. As a document corpus, it uses the ClueWeb09 dataset,1 a web crawl
comprising over 1.2 billion documents in different languages. In our experiments, we use two subsets of ClueWeb09, as used in TREC: the ClueWeb09 A corpus (CW09A), comprising the English portion of ClueWeb09, with 500 million doc- uments; and the ClueWeb09 B corpus (CW09B), a subset of CW09A with 50
1
million documents, aimed to represent the first tier of a commercial search en- gine index (Santos et al., 2011b). We index these corpora using the Terrier IR platform2 (Ounis et al.,2006;Santos et al.,2011g;Macdonald et al.,2012a), after
applying Porter’s weak stemmer (Porter,1980) and without removing stopwords. As of 2011,3 the TREC Web track provides a total of 150 queries, sampled
from the query log of a commercial search engine. In our experiments, we discard the queries numbered 20, 95, 100, 112, and 143, as they do not have any document in the ClueWeb09 B corpus judged relevant for either the adhoc or the diversity task. The statistics of the resulting test collections with a total of 145 queries are provided in Table 5.1. As described in Section 3.4.1, for each query, TREC as- sessors identified multiple sub-topics, representing different aspects of the query, with relevance assessments conducted at the sub-topic level (Clarke et al.,2009a,
2010,2011b). In some of our experiments, these sub-topics will be used as an ora- cle aspect representation. While alternative representations will be proposed and investigated in both Section 5.2 and Chapter 6, this oracle provides a controlled environment for evaluating the effectiveness of different diversification approaches while isolating the impact of any particular aspect representation.
Table 5.1: Statistics of the test collections used in this thesis. Relevance assessment
figures are broken down by corpus (CW09A or CW09B) and task (adhoc or diversity).
WT09 WT10 WT11 #queries 49 48 48 #sub-topics 228 194 158 C W 09 A adhoc #judged 23,205 23,898 18,362 #relevant 6,858 5,233 3,157 diversity #judged 25,833 46,553 1,9381 #relevant 4,895 6,553 5,030 C W 09 B adhoc #judged 12,859 15,130 12,132 #relevant 4,002 3,090 1,662 diversity #judged 14,951 43,960 12,599 #relevant 3,026 3,960 2,764 2 http://terrier.org 3
The TREC 2012 Web track is ongoing at the time of writing. 4
5.1.2
Training and Evaluation
Several supervised machine learning approaches—including learning to rank, clas- sification, and regression approaches—are deployed in this thesis, which require some form of training data. A natural direction for producing training examples from the test collections described in Section 5.1.1 is to partition the available queries into training and test sets. In our experiments, two alternative regimes are considered. In particular, the experiments in this chapter as well as those in Chapters 6 and 9deploy a cross-validation regime, mixing together the available queries and randomly splitting these queries into multiple folds. In each of the cross-validation rounds, we organise the available queries into training (60%), validation (20%), and test (20%) queries. As discussed in Section 2.2.3.1, the use of validation data reduces the possibility that the learned parameters are overfit- ted to the training data. Our second training regime is used for the experiments in Chapters 7 and 8, where we deploy a cross-year validation, with the avail- able queries split into year-oriented folds, as opposed to randomly. To ensure a fair evaluation with a complete separation from training and test, all results in this thesis are reported as an average across the test queries from the different cross-validation (or cross-year) rounds. A breakdown of the corpus, queries, and training regime used in each experimental chapter is provided in Table 5.2.
Table 5.2: Corpus, queries, and training regime used in each chapter.
Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter9
Corpus CW09B CW09A CW09B CW09B CW09B Queries WT09 WT09 WT09 WT09 WT09 WT10 WT10 WT10 WT10 WT11 WT11
Training 5-fold 5-fold 2-fold 2-fold 5-fold
cross-valid. cross-valid. cross-year cross-year cross-valid.
To evaluate the various approaches investigated in this thesis, we deploy the two primary metrics used in the diversity task of the TREC Web track (Clarke et al., 2009a, 2010, 2011b): ERR-IA (Equation (3.28)) and α-nDCG (Equa- tion (3.29)). As discussed in Section 3.4.2, these metrics implement a cascade
model (Craswell et al.,2008), which penalises redundancy across multiple query aspects, by assuming a diminishing probability that the users will continue to examine the ranking once they find relevant information (Clarke et al., 2011a). Following the standard TREC setting, both metrics are reported at rank 20, re- flecting web searchers’ interest for documents at early ranks (Jansen et al.,1998). Lastly, in order to ensure that our findings are not a mere reflection of chance, all results reported in this thesis are validated statistically. As a sta- tistical hypothesis test, we use Student’s t-test to contrast pairs of ranking ap- proaches (Sanderson & Zobel,2005;Smucker et al.,2007). In particular, through- out this thesis, we use the symbols △ (▽) and N (H) to denote a statistically significant increase (decrease) at the p < 0.05 and p < 0.01 levels, respectively, while the symbol ◦ is used to denote no significant difference. The baseline against which significance is reported will be made clear in each case. In addition, we report the number of queries negatively affected (−), positively affected (+), and unaffected (=) by each tested approach compared to this baseline.