A complement to altering the composition of inverted lists to achieve efficiency is to use caching. For many applications, retaining frequently accessed data from disk in memory reduces the cost of disk access. Search engine query optimisation using an in-memory cache has been explored by several authors. Xie and O’Hallaron [2002] examine several search engine query logs and highlight the repetition and locality of the queries within their data. While they do not propose an approach to caching, they suggest that any caching approach for search engines should account for the popularity of queries and the locality of repeti- tion.
Work on search engine caching falls into three broad categories. Some work focuses on the benefits of caching individual data structures used during the search process. These structures can include items such as inverted lists, or result pages. A second branch of research builds on the above by combining the data structures that are considered by the cache. Finally, a third branch of research has investigated caching policies that best manage the different components that can be cached by a search engine.
2.11.1 Single Structure Caching
An early contribution to search engine caching research is that of Brown et al. [1994], who attempt to cache inverted lists with a least recently used (LRU) replacement policy. In their work they maintained three manually chosen fixed-size caches for small, medium, and large lists respectively. They found that by tuning the separate cache sizes correctly they could
achieve speed improvements of up to 25% on their test collections, which they attributed to reduced disk usage costs. The approach of Brown et al. relies on manually tuning lists sizes for inclusion into each of the small, medium, and large caches, and is unlikely to work for dynamic collections, where the distribution of list sizes change over time.
J´onsson et al. [1998] take a lower-level approach. They propose a buffer management technique that caches the pages of inverted lists, and favours pages from the inverted lists that contain high weight postings. At query time, inverted lists are processed in an order that minimises the number of disk reads by fetching those lists that contain the most pages in the buffer. Using their approach, they show a 70% reduction in disk reads over non- caching. However, as their approach approximates results, a variance of ±5% in effectiveness is observed.
Inspired by the caching of result pages by web proxy servers [Markatos, 1996], Markatos [2001] explored the caching of search engine result pages. Markatos compared the benefits of caches composed of fixed pre-calculated result pages, to dynamic caches that adapt based on the stream of queries posed to the search system. The findings suggest that, for small cache sizes, a fixed cache of the most frequently requested result pages outperformed a dynamic cache. However, as the cache size was increased, the performance of dynamic caching improved. This outcome was attributed to the fact that, for small caches, dynamic policies were not able to maintain popular pages within the cache long enough for them to be reused but, as the cache size grew, the probability of retaining a popular result page in the dynamic cache increased.
In similar work, Luo et al. [2000] cache result pages, but assume a Boolean environment where the results of individual queries can be combined to form the results for larger queries. They label this active caching and report a two-fold improvement in cache hit ratios over the results of Markatos. However, this approach does not account for the complexity of most modern similarity ranking functions, where the weight of query terms has a significant influence on their contribution to the final results.
2.11.2 Multi Structure Caching
Saraiva et al. [2001] combined the above approaches, proposing a two-level cache for search engines that at one level caches result sets, and at a second level caches inverted lists. The caching of result sets allows query processing for cached queries without accessing disk, while caching inverted lists reduces the I/O costs associated with query processing for queries that
contain popular terms, but do not have a result set in the cache. In their work they found an increase in query throughput of up to a factor of three.
Baeza-Yates and Saint-Jean [2003] propose a similar two-level cache with both inverted lists and pre-computed answer sets in memory. In their approach, lists are selected such that the largest set of the most commonly queried terms are placed in the cache. Result sets are also retained in the cache, and the proportion of space allocated to each data type is trained. Their results show a 7% reduction in query time evaluation.
Long and Suel [2005] extended the work of Saraiva et al. [2001] and proposed a three-level cache, in which result sets and inverted lists are cached in memory, and a representation of merged inverted lists that are computed during query evaluation are stored on disk. For combinations of query terms where a pre-computed merged list is in the on-disk cache, a single disk access fetches the required information, rather than a fetch to the inverted list of each term in the query. They report a 75% reduction in disk blocks read at query time, with a cache that is 2.5% the size of the index.
2.11.3 Advanced Cache Management Policies
Focusing on result caching, Lempel and Moran [2003] propose a more specialised cache re- placement policy targeted at result pages. In their work, they present a cache replacement policy based on the probability of a result page being requested. The probability of a page being requested is a function of the history of the result page’s prior access, and the number of users that are currently viewing result pages for the same topic. Further, Lempel and Moran introduce the pre-fetching of subsequent result pages. Working with a query log from AltaVista, the authors measured the number of queries that were satisfied by the results stored in the cache. They found that their approach up to doubled the cache hit ratio over a cache using a least recently used queue.
Fagni et al. also propose a result page oriented cache that is divided into static and dynamic components [Fagni et al., 2004; 2006]. The static part of the cache is populated with the results for the most common queries observed, while the dynamic component caches results for queries that do not appear in the static component. Further, the dynamic cache may be managed by any cache management policy. In their work, Fagni et al. experiment with least recently used, segmented least recently used, frequency based replacement, two-queue, and probabilistic driven cache policies [Karedla et al., 1994; Robinson and Devarakonda, 1990; Johnson and Shasha, 1994; Lempel and Moran, 2003]. While the optimal proportion of
cache allocated to the static and dynamic caches varied with the data-set, the authors found that their combined approach outperformed an individual static or individual dynamic cache. Further improvements were obtained by applying pre-fetching of query results, however the authors suggest that only a few pages of results — three pages are shown to work well — per query is optimal.
Other work has examined distributed caching. In some cases the caching of results is suggested on the client side [Alonso et al., 1990], in others within web proxy servers [Meira Jr. et al., 1999]. However, such work is beyond the scope of this thesis.
2.11.4 Limitations of Caching
To date, search engine caching research has focused on the caching of result sets and inverted lists. The caches for inverted lists and results have been maintained independently, and the ratio between the two has been manually tuned with regard to the query log and collection in use. To our knowledge, there has been no examination of the costs of caching each data type with respect to underlying system architecture, or of how to tune cache sizes in practice. Further, there has been no exploration of the use of a single heterogeneous cache for all the various kinds of data used during query processing. We address these issues in Chapter 6.