The field of information retrieval is both vast and varied with a great number of significant contributions in recent years. Such advances can be found in fundamental principles, such as the similarity metrics used; and interesting search system augmentations, such as query difficulty prediction algorithms and prior-based result enhancement. Improvements in disk capacities, processor performance, and the vast quantities of data that are becoming available have contributed to these advances. However, these factors also continue to drive the demand for faster and more accurate search.
The studies of query logs discussed in Section 2.1.1 have shown that users have high expectations of search systems. Providing on average less than three query terms, and rarely
examining more than two page of search results, search engines must maximise the limited information provided by users to produce high quality results. These studies also suggest a high volume of redundancy that can be taken advantage of at various phases of the query evaluation process.
As the scale of collections grow, demands for access to disk based inverted lists, vocabu- lary and document mapping table entries, and collection files also increases. Previous caching techniques have focused on the caching of inverted lists and result sets, but have not consid- ered other data structures. Index organisations that allow dynamic query time pruning have the potential to further improve inverted list caching, yet to date have not considered the history of queries that users pose to the search system. Similarly, while the most successful techniques to predict query difficulty have relied on the relationship between the query and the collection, they have yet to consider past queries.
In subsequent chapters we build upon the background work presented in this chapter. We examine the repetition in query logs and propose techniques to reorganise the index for more efficient query evaluation. In this regard we explore query evaluation optimisation techniques such as static and dynamic list pruning, index compression and collection reordering. Con- tinuing our exploration of search engine efficiency techniques we explore the effects of search engine caching and propose a model to measure the impact of differing caching techniques at search time using real query logs and collections. Finally, we derive information from query logs for use as external evidence and consider this first, as a form of document prior, and second, for the task of difficulty prediction.
Access-Ordering
In this chapter, we investigate whether indexes can be reorganised based on past query usage patterns so that they better support future queries. Our aim in developing this approach is to explore whether past queries can be used to identify documents that are likely to be relevant responses to future queries, and whether this can be used to organise an index for faster or more accurate query evaluation.
We begin by presenting observations on the access frequencies of documents in Sec- tion 3.1. In Section 3.2, we propose an index design that takes advantage of document access frequency. In Section 3.3, we examine alternate approaches that can be used to derive the index structure. List pruning techniques that allow efficient evaluation of user queries are discussed in Section 3.4. Compression techniques are discussed in Section 3.5. We present experimental evaluations of our methods in Section 3.6.
3.1 Document Access Frequencies
The frequency of distributions of many items in web retrieval are often very skew. For example, the frequency of word occurrences in large newswire collections is such that the commonest word, “the”, occurs twice as often as the second most common word, “of”, which in turn occurs 1.1 times as often as “to”, and so on; the ratios appear to vary slightly between collections, but the trends are the same [Williams and Zobel, 2005]. Similar behaviours are seen in the frequency of new words occurring in collection document sizes, where the number of words grows sublinearly with the collection size [Williams and Zobel, 2005]; the use of query terms, where 9% of all query terms are drawn from a set of only 0.05% of unique terms [Spink et al., 2001]; and in the caching behaviour of web documents [Breslau et al., 1999].
0 500000 1000000 1500000 Documents 0 10 100 1000 10000 Access Count n=1000 n=100 n=10
Figure 3.1: A plot showing the number of times each document in the WT10g collection is ranked in the top 10, 100 and 1,000 responses for around 1.9 million Excite queries. Documents are sorted and labeled in access count order.
To investigate whether this skew distribution extends to documents that are returned in response to queries, we examined more than 1.9 million ranked queries on a collection of around 1.6 million documents. The queries were taken from the Excite 1997 and 1999 query logs after removing any queries that contained terms that were deemed to be offensive, while the collection used was the TREC WT10g collection (described in Table 2.3, page 19).
For each query, we evaluated the similarity of the query to the documents in the collection using the Okapi BM25 formulation described in Section 2.6.2 (page 30), and counted the number of times each document appeared in the top n documents of any query (we used n = 10, n = 100, and n = 1, 000). We refer to this count as the access count of each document.
A plot of the document access counts for the query set is shown in Figure 3.1. On the y-axis, the graph shows the access count (the frequency with which the document appeared in the top n results). On the x-axis, the documents are organised by decreasing rank of frequency so that the most-frequently retrieved document (with the highest access count) is shown at 0 and the least-frequently retrieved (with the lowest access count) at 1.6 million.
0 500000 1000000 1500000 Documents 0 10 100 1000 10000 Access Count
Figure 3.2: A plot showing the number of times each document in the WT10g collection is ranked in the top 1,000 responses for around 1.9 million Excite queries. Points on the line are documents that have been judged as relevant for any of the TREC-9 queries used in our experiments.
Not surprisingly, Figure 3.1 shows that a small fraction of the documents in the collection are frequently ranked highly in response to the queries. At n = 1, 000, the most frequently accessed document (TREC document WTX030-B33-366) has an access count of 76,740, that is, it appears in the top 1,000 results for around 4% of the queries. Further, the 10% most frequently accessed documents account for almost 40% of all accesses.
Many documents in the collection are accessed infrequently or not at all. For example, even for n = 1, 000, 13% of the documents in the collection were accessed by 100 queries or less, that is, by less than 0.005% of queries, while 1.8% of the collection is not accessed at all. With smaller values of n, these effects are more pronounced: for n = 10, almost 40% of the collection is not accessed, and for n = 100 around 8% of documents are never retrieved. The line plotted in Figure 3.2 does not show all documents from the collection. Instead, we have only shown the 2,605 documents that have been judged as relevant to any of the fifty TREC-9 queries that are used to compare the behaviour of web retrieval techniques; the TREC framework and queries are discussed in Section 2.2.4 (page 18). This plot illustrates
that documents that have been judged as relevant to user information needs are also those with high access counts: around 39% of the relevant documents are in the 10% of those with highest access counts. Only 7% of the relevant documents are in the half of the collection with the lowest access counts. Note that due to the pooling technique used to judge relevant documents in the TREC environment, a total of 59,720 unique documents are judged over the 50 topics, with the rest of the collection unjudged.
We conclude from this initial investigation that document accesses are highly skew when averaged over a large number of queries. In addition, we empirically observed that there is a correlation between the access count of a document and its likelihood of being relevant to the user. With this motivation, we propose access-ordered indexes in the next section.