To efficiently service user queries, a text search engine makes use of several components [Arasu et al., 2001; Brin and Page, 1998; Zobel and Moffat, 2006], in particular a vocabulary struc- ture, inverted lists, a document map, and documents. The vocabulary structure contains the terms that appear in the indexed collection of documents, and maintains statistics about each term such as the number of documents in which it appears. For each term in the vocabulary, an inverted list of postings is stored, identifying the locations where that term appears in the collection. The document map holds, for each document, statistics such as document length, the number of distinct terms in the document, and the location of the document on disk.
At search time, a similarity metric makes use of statistics from the vocabulary structure, inverted lists, and document map to compute a score between a query and the documents in the collection. Result pages comprised of the top-scoring documents are presented to the user once a query has been processed; typically a summary or snippet of the document is presented with the results [Tombros and Sanderson, 1998]. Each of these items must be either held in memory or fetched from disk.
For small collections, the vocabulary structure could plausibly be held in main memory, but, as collection size increases, the rate at which new terms are encountered remains more or less constant [Williams and Zobel, 2005]. Thus for large collections the greater part of the vocabulary must reside on disk. Inverted lists form the largest part of an index, and typically exceed memory size by orders of magnitude. The lists for individual terms vary in size, with the lists for common terms such as “the”, “and”, and “as” comprising a significant part of the total index. The document map, with a few bytes per document, is directly proportional to the number of documents in the collection and can reside on either disk or in main memory. Finally, generation of result summaries of matching documents requires access to the document collection itself. The document collection is significantly larger than the index and must be stored on disk [Brin and Page, 1998].
Regardless of the similarity metric employed, query evaluation involves the following stages. First, for each query term the associated entry is retrieved from the vocabulary. Sec- ond, the term’s inverted list is loaded from disk. Third, for each posting in the inverted list a partial similarity score is calculated between the document to which the posting refers and the query. If the document that the posting refers to has not been previously encountered, an accumulator is created, initialized with the partial score for the document. If the doc- ument has been previously encountered, its accumulator is updated with the partial score. Fourth, after all query terms have been processed the accumulator values are adjusted us- ing document based statistics that reside in memory. Finally, the documents are partially sorted by accumulator value, and the top R documents are accessed through the document map, summarized and returned to the user. The search engine query evaluation process is discussed further in Section 2.5 (page 24).
Table 6.1 shows the estimated disk cost per query for the components, averaged over our test collections, and described in further detail in Section 6.4. The cost of accessing disk can be broken up into two distinct components. First, the disk head must seek to the location of the data to be read from the disk. Second, the data must be physically read from the disk. Yiannis [2005], in work on external sorting optimizations, presents an overview of the costs
Table 6.1: The sequence of components accessed during query evaluation. Estimated disk costs are in average bytes read and average disk seeks per query for our test collections and query logs. Times are in milliseconds, using disk specifications from out test machine. We assume ten documents per results set.
Query → Vocabulary Entries →
Inverted
Lists → Accum. →
Doc. Map.
Entries → Documents → Results
Disk seeks 2.03 2.03 n/a 10.00 10.00
Bytes read 89.10 1,687,667 n/a 440.00 114,517
Seek time 8.42 8.42 n/a 41.60 41.60
Read time 2.83 ×10−3 5.36 n/a 0.14 ×10−3 0.36
associated with disk access in a retrieval environment. These figures suggest that disk seek time during query evaluation is dominated by access to the documents during summarisation, while disk read time is dominated by access to the inverted lists.
We speculate that, in a distributed environment, the situation does not differ significantly. Seek time is akin to establishing a connection with, or initiating a request from a peer or server, while read time is bound by the bandwidth of the network. The size of inverted lists and documents remain the same, and access to each is limited by their distribution within the network.
As discussed in Section 2.11 (page 43), various works have explored search engine caching strategies. Some have focused on the caching of inverted lists [Brown et al., 1994], others on result pages [Markatos, 2001; Lempel and Moran, 2003], and others on combinations of the two [Saraiva et al., 2001; Long and Suel, 2005]. However, to date, no work has considered a single heterogeneous cache capable of dealing with all the components utilised during the query evaluation process. Further, there has been no exploration of the costs associated with maintaining such a cache.
In this chapter we present a structured approach to search engine caching. We examine the effects of caching the various components used throughout the query evaluation process and provide cost models that can be adjusted to the underlying system architecture to estimate the savings that can be gained by effective caching.