2.4 Chapter Summary
3.1.2 Pooling Paradigm
The Pooling technique has been widely used in TREC tracks to evaluate the test collec- tions before submitting it as test benchmarks for IR researchers (TREC, 2016a). The Pooling Paradigm starts with crawling the web or a specific web domain for creating the document set of the test collection. Then, TREC organiser committee uses their IR sys- tem with the crawled document set and expert human annotators to produce the query set and its relevance judgement. It begins with using a retrieval method of their IR system for retrieving the documents responding to the queries (topics) created by the expert annota- tors. Then, the top-k documents retrieved (k pool depth) are judged by the TREC human expert annotators to determine the relevant and irrelevant documents to each query. Fi- nally, the test collection becomes ready for TREC track competitions by using multiple IR systems to validate the test collections and the research outcomes. (Buckley et al., 2007) argued that the test collection size affects the degree of bias related to the relevance judgement values. If the test collection is large the pool size should be large respectively to provide an accurate relevance judgement for each query. This issue inspired the need for various statistical analysis of the results using various retrieval methods on pooling benchmarks such as in TREC Disks 4 and 5 (Soboroff, 2007; Sanderson, 2010). The TREC Disks 4 and 5, .GOV, ClueWeb09 and ClueWeb12 test collections are well-known document collections used in various TREC and SIGIR tracks (TREC, 2016b;Soboroff et al., 2003; Habernal et al., 2016). The TREC and SIGIR are the most well-known in- ternational conferences that produce standardised test collections for IR research. The most widely used standardised pooling collection from TREC and SIGIR is TREC Disk 4 and 5, while ClueWeb12 is the newest textual pooling test collection. The range size of TREC Disks 4 and 5 is about half a million documents, while the set of queries and
their relevance judgements varied between various TREC tracks (TREC, 2016b,c). The TREC Disks 4 and 5 document set was crawled and created from news and Broadcast websites such as Financial Times website for FT document set in TREC Disks 4 and 5. They were combined by multiple relevance judgement set and query set for multiple pur- pose of IR research in TREC tracks such as TREC 1 to 8, Robust-2003 and Robust-2004 (TREC, 2016b,c). ClueWeb12 has been used in TREC 2014 and contains 733,019,372 pages acting as documents with 27.3 Terabytes of storage on hard disks (TREC, 2016d; Lemur, 2016a). The comparison between TREC Disks 4 and 5 with ClueWeb12 shows that the TREC Disks 4 and 5 test collections are more accurate for evaluating IR research than ClueWeb12 (Urbano,2016). The common research issues in the ClueWeb12 to act as real IR test collection are as follows:
• ClueWeb12 is a pooling judged collection which has only 50 queries in the query set. However, the small pool size compared to the collection size causes the bias to the retrieval method used to assess the relevance judgement of the test collection as discussed above. The total unique number of relevant/irrelevant documents existing in relevance judgement for ClueWeb12 of Web Track 2014 is only 5666 documents, while the document set contains 733,019,372 documents. This confirm the limita- tion and bias for having an accurate evaluation with small pool size containing only 5666 relevant/irrelevant documents of 733,019,372 documents existing in the doc- ument set. This means that there may be a lot of unjudged relevant documents in the test collection which are corresponding to some queries in the query set but they are not appear in the relevance judgement file. The unjudged relevant documents will be considered as irrelevant documents in the evaluation procedure of a new IR model which causes inaccurate evaluation for IR research. On the other side, TREC Disks 4 and 5 were judged and evaluated for various TREC tracks by multiple IR systems. The pool size for these tracks are reasonable comparing to the document set size. According to the statistical analysis based on results produced by multiple IR systems in (Urbano,2016), the Disks 4 and 5 with the relevance judgement pro- duced in Robust 2004 track are the most stable and accurate pooling test collection for assessing and comparing between IR research.
Table 3.2: Characteristics of the Pooling test Collections Used in this Thesis.
ID Description No. of
Docs
No. of Queries TREC Disks 4&5 (Ro-
bust 2004)
News and Broadcast Web-
Pages 472525 230
TREC Disks 4&5 (Crowdsource 2012)
News and Broadcast Web-
Pages 18260 10
• ClueWeb12 document set has been collected using five instances of the Internet Archive Heritrix web crawler that were running on five Dell PowerEdge R410 ma- chines with 64GB RAM (Lemur,2016b). Furthermore, a huge computational cost will be required for adapting ClueWeb12 to be a fully judged collection for a large number of queries.
In this thesis, TREC Robust 2004 and Crowdsource 2012 relevance judgements for Disks 4 and 5 and Cranfield paradigm test collections were used in Chapter 5 and 6 (TREC, 2004; Smucker et al., 2012; Hersh et al., 1994; Glassgow, 2014). The reason is that they have stable and accurate relevance judgements between various IR systems which are proved and validated by (Urbano, 2016), TREC tracks (TREC, 2016b,c) and the previous IR research (Cleverdon, 1960; Sanderson, 2010). In addition, the Cran- field paradigm collections are the most suitable test collections to simulate the real expert user feedback without bias. Moreover, they can be used on normal PC for IR systems. The detailed pooled test collection characteristics used in this thesis are shown in ta- ble 3.2. On the other hand, we identified the limitation of TREC Disks 4 and 5, and Cranfield Paradigm test collections to act as supervised EML datasets. This limitation is discussed in Section3.1.3. These collections can only simulate the early stage of the IR test collections before having extensive relevance feedback by user interactions for un- supervised learning techniques. The later stage of IR systems can use supervised EML technique when having fully judged test collection from extensive historical user interac- tions. The fully judged collections can be used to extract fully judged query-document pairs for applying supervised Learning to Rank and creating ranking models for query auto-completion searching (Kharitonov,2016;Liu et al.,2007;Qin et al.,2010).