Discussions - Site-Based Dynamic Pruning for Query Processing

4.6 Site-Based Dynamic Pruning for Query Processing

4.6.3 Discussions

We present a dynamic query pruning technique based on incremental-CBR that eliminates relatively less promising sites (and Web pages) during retrieval. The results are encouraging in that the top-k results returned by the site-based pruning strategy exhibit strong similarity to those of the no-pruning case, while the proposed strategy achieves significant reductions in processing times.

As it is mentioned before, there are other efficient dynamic pruning techniques such as those based on the quit-continue approach [79] and impact-sorted lists [14] for FS. In Section 4.5.2.3, we have shown that incremental-CBR outperforms one of these approaches, as well. Nevertheless, it is possible for both FS and our approach to benefit from such earlier techniques. For instance, the impact-based pruning may be coupled with our site-based pruning, for further improvements in efficiency (e.g., postings for each site in CS-IIS can be sorted with respect to im- pacts). Exploring such possibilities is left as a future work. Another future work direction involves exploiting URL hierarchy to obtain (possibly) more coherent groups of Web pages.

4.7 Conclusions and Future Work

We introduce an incremental-CBR strategy and enhanced CS-IIS for ranking- queries. The new file organization incorporates both cluster membership and centroid information along with the usual document information into a single inverted index. In the incremental-CBR strategy, for each query term, the com- putations required for selecting the best-clusters and selecting the best-documents of such clusters are performed in an interleaved manner. The proposed strategy is essentially introduced for providing efficient CBR in compressed environments. We adapt multiple posting list compression parameters and a cluster-based document id reassignment technique that best fits the features of CS-IIS. We ex- perimentally show that the proposed strategy is superior to FS for a retrieval scenario using automatically clustered datasets. Furthermore, we also show that incremental-CBR strategy can also serve as a dynamic pruning technique for FS

in a site-based pruning scenario.

The future research possibilities among others include the following. In this thesis, we concentrated on term-at-a-time query processing mode. It is also possible to use another efficient alternative, document-at-a-time processing mode, along with the proposed strategy. The proposed skip structure provides interesting data fusion [84] opportunities (i.e., merging FS and CBR results) since both of these processes can be carried out at the same time. Another interesting direction can be making the proposed system adaptive to query characteristics; during query evaluation, the number of best-clusters to be selected and the centroid term weighting schemes can be determined according to the query length or the weight distributions of the query terms. Clearly, updating our data structure is an interesting challenge. We can apply a “distributed free space” technique for future additions to posting lists. Then, given an incremental clustering algorithm (e.g., the incremental version of C3_{M [35]), the complexity of updating CS-IIS}

is not much higher than the complexity of a typical IIS update. Yet another possible direction for improving storage and efficiency can be using skips in only “longer lists” but not in the lists of only a few words. Finally, the caching of posting lists is another topic that currently takes serious attention [18] and can be investigated in our framework, as well.

Chapter 5 Static Index Pruning with Query

Views

Static index pruning techniques permanently remove a presumably redundant part of an inverted file, to reduce the file size and query processing time. In this chapter, we propose using query views in the static pruning strategies for Web search engines to improve the quality of the top-ranked results compared against the original results. The query view based strategies avoid pruning those postings that associate a term with a document, if this document has appeared among the top results of a previous query including that particular term. We incorporate query views in a number of static pruning strategies, namely term- centric, document-centric and access-based approaches, and show that the new strategies considerably outperform their counterparts especially for the higher levels of pruning and for both disjunctive and conjunctive query processing.

The rest of this chapter is organized as follows. In the next section, we provide the motivation for our research. In Section 5.2 we review the related work in the literature. In Section 5.3, we first describe the baseline pruning algorithms for this work, as discussed in [29, 43]. Next, we present an adaptive variant of the access-based pruning algorithm [58], and also propose a document-centric version. Section 5.4 introduces the new pruning strategies that exploit the query views. Section 5.5 provides an experimental evaluation of all strategies in terms of top- ranked result quality. Finally, we conclude and point to future research directions in Section 5.6.

5.1 Introduction

An inverted index is the state-of-the-art data structure for query processing in large scale information retrieval systems and Web search engines (WSEs) [122]. In the last decades, several optimizations have been proposed to store and access inverted index files efficiently, while keeping the quality of the search relatively stable (see Chapter 3.2.1). One particular method is static index pruning, which aims to reduce the storage space and query execution time.

The sole purpose of a static pruning strategy is staying loyal to the original ranking of the underlying search system for most queries, while reducing the index size, to the greatest extent possible. This is a non-trivial task, as it would be impossible to generate exactly the same results as produced by an unpruned index for all possible queries. Most pruning strategies attempt to provide quality guarantees for only top-ranked results, and try to keep in the pruned index those terms or documents that are the most important according to some measure, hoping that they would contribute to the future query outputs uttermost. The heuristics and measures used for deciding which items should be kept in the index and which of them should be pruned distinguish the static pruning strategies. Many proposals in the literature are solely based on the features of the collection and search system. For instance, in one of the pioneering works, Carmel et al. sort the postings in each term’s list with respect to the search system’s scoring function and remove those postings with the scores under a threshold [43]. This is said to be a term-centric approach. In an alternative document-centric strategy, instead of considering posting lists, pruning is carried out for each document [29]. These two strategies, as well as some others reviewed in the next section essentially take into account the collection-wide features (such as term frequency) and search system features (such as scoring functions).

However, in the case of Web search, additional sources of information are also available that may enhance the pruning process and final result quality, which is the most crucial issue for search engines. In this sense, query logs serve as an invaluable source of information: in the world of (theoretically) infinitely many combinations of possible query terms, the query logs highlight those terms and

CHAPTER 5. STATIC INDEX PRUNING WITH QUERY VIEWS 125

combinations that are important enough to be searched in the past. Thus, these logs can provide further insight and evidence on which terms or documents should be kept in a pruned index to answer the future queries.

In a recent pruning strategy that explicitly makes use of the previous query logs [58, 59, 60] the notion of access frequency is employed. That is, the pruning strategy is guided by the number of appearances of a document in the query outputs. In this work, we propose a new pruning heuristic that exploits query views. That is, the pruning process is also guided by considering the actual query terms that access to the documents.

In the literature, the idea of using query terms to represent a document is known as query view [44]. In the scope of our work, all queries that rank a particular document among their top-ranked results constitute the query view of that document. For static pruning purposes, we exploit the query views in the following sense. We envision that, for a given document d and a term t in d, the appearance of t in d’s query view is the major evidence of its importance for d; i.e., it implies that t is a preferred way of accessing document d in the search system. Thus, any pruning strategy should avoid pruning the index entry d from the posting list of term t to the greatest extent possible.

In this work, our goal is improving the quality of the results obtained from a pruned index, which has vital importance for the WSEs in a competitive market. To this end, we introduce new pruning approaches that incorporate the query view idea into the term-centric [43], document-centric [29] and access-based [58] strategies in the literature. We show that, the pruning strategies with the query view significantly improve the quality of the top-ranked results, especially at the higher levels of pruning. More concretely, our contributions in this chapter are as follows:

• First, we fully explore the potential of a previous strategy, namely access- based pruning, that also makes use of the query logs in the static index pruning context. To this end, we provide an adaptive version of the term-centric

pruning algorithm provided in [58]. We also introduce a new document- centric version of the access-based algorithm, and show that the latter outperforms its term-centric counterpart.

• Second, we provide an effectiveness comparison of these access-based approaches to the term-centric approach [43] and document-centric approach [29], for their best performing setups reported in the literature. Our experimental findings reveal that, although the access based methods are inferior to the latter strategies for disjunctive query processing (as shown in the literature [58]), they turn out to be the most effective strategies when the queries are processed in the conjunctive mode. This is a new result that has not been reported before. Furthermore, the document-centric version of the access-based strategy as described here is found to be superior to all other strategies for conjunctive query processing, which has utmost importance for WSEs.

• Finally, the main contribution of this chapter is exploiting query views to tailor more effective static index pruning strategies for both disjunctive and conjunctive query processing; i.e., the most common query processing modes in WSEs [55]. More specifically, the terms of a document that appear in the query view of this particular document are considered to be privi- leged and preserved in the index to the greatest possible extent during the static pruning. The query view heuristic is coupled with all three pruning approaches in the literature (term- and document-centric approaches as proposed in [29, 43], and the access-based term-centric method adapted from [58]) as well as the document-centric version of the access-based method that is introduced here.

Our findings reveal that for both disjunctive and conjunctive query processing, the query view based pruning strategies reveal an excellent performance in terms of the similarity of the top-ranked results to the original results (i.e., those obtained by using the original index) and significantly outperform their counterparts without query views. The gains are especially emphasized at the higher levels of pruning. We also verify our findings using training logs of varying number

CHAPTER 5. STATIC INDEX PRUNING WITH QUERY VIEWS 127

of queries and a very large test set including 100,000 queries.

Furthermore, the improvements provided by the query view based strategies also apply to the cases where the pruned index is not used to replace the original index, but rather used as a list cache (as in the ResIn framework [104]) for efficiency purposes. In the latter setup, the essential requirement for a pruning strategy is being able to provide correctness guarantee (i.e., producing exactly the same results as the main index) for the highest number of queries. We show that our query view based pruning strategies can output the correct result for a considerably more number of queries than the baseline algorithms; i.e., those without query views. This means that pruned index files that are created using the query view based strategies can either replace the original index, say, at the back-end servers, or serve as a front-end cache in WSEs.

5.2 Related Work

In document Improving the efficiency of search engines : strategies for focused crawling, searching, and index pruning (Page 141-147)