Early Termination Heuristics - Search engine optimisation using past queries

To take advantage of an access-ordered index, a strategy is required to heuristically abandon postings list processing before entire lists are processed. Perhaps the simplest approach is to only process a fixed number of postings P from each list, with the motivation that short lists that may have a larger impact on the ranking function (because of smaller ft values) will be processed in their entirety, and longer lists will be partially processed. We refer to this scheme as maxpost.

For example, consider a threshold value of P = 2. For the single word query “storm” from Figure 3.4, only the postings for documents 386 and 408 would be processed. The computational saving is that the posting for document 22 is not decoded and processed, but

the disadvantage of a fixed threshold is that document 22 can never be returned in response to the query; we return to this discussion in Section 3.6.4.

There are several other possible approaches to partial list processing. These include: • minaccess: Processing only those postings with an access count greater than a thresh-

old value M . This scheme favours postings with a high access count. Like the maxpost scheme, this approach has the disadvantage that postings with an access count below the specified threshold cannot be returned in response to their respective terms. • avgaccum: Calculating an average accumulator contribution for the postings that have

been processed in a list, and stopping when this falls below a specified threshold A. Although the lists are not directly ordered by accumulator contribution, we assume that for documents to be highly ranked in an access-ordering, their postings on average contribute more to accumulators than do postings with low access counts. We report results with this avgaccum scheme in Section 3.6.4.

• avgrollaccum: Extending avgaccum so that the average is computed over a moving window of D documents, with the aim of detecting more local change in accumulator contributions. We call this avgrollaccum.

• two-phase: Setting two threshold values based on the access count of the next posting processed. The first threshold determines when list processing will stop adding new accumulators. The second determines when to stop updating existing accumulators. The two-phase scheme is described in detail below.

Based on the work of Persin et al. [1996], described in Section 2.9.1 (page 36), we propose a modified version of the frequency-ordered termination heuristic that can be applied to access- ordered indexes. Using this approach, values of ci and ca respectively are chosen based on access counts. However, we propose two minor variations to the threshold calculation. First, the initial work on frequency-ordering was based on long-topic style queries drawn from early TREC experiments, where the within-query frequency of a term, fq,t, can have a significant effect on the final similarity score. In our work, we focus on short web queries that typically contain one to three terms, and we therefore assume a value of fq,t equal to one. Second, the formulation for term weight wt varies with the selected similarity metric. In our work, we use the Okapi BM25 function described in Section 2.6.2 that can result in zero-valued term weights, and so, to prevent divide-by-zero errors we add a constant value of 1 to the threshold calculation denominator.

Figure 3.5: two-phase pruning scheme on a three term query: king and magic. Terms are processed in ascending IDF order. Prior to processing each inverted list, the thresholds values aa and ai are updated. As each term is processed the threshold values increase and further restrict the postings to be processed. The plot above each inverted list shows the access count distribution of the postings in that list, and the effect of the increasing thresholds on the processing of that list.

The results of adapting the termination heuristics are the following functions:

ai = ci.Sm w2 t + 1 , aa= ca.Sm w2 t + 1 ,

where ci and caare the tunable constants that determine when to cease accumulator creation

and when to cease accumulator updates, Sm is the maximum accumulator value seen so far,

We refer to this early termination heuristic as two-phase pruning. Query processing proceeds as follows: first, query terms are ordered by inverse document frequency, the value Sm is initialised to 0, and the first query term is selected to be processed. Second, for the selected query term, the two threshold values ai and aa are calculated. Third, the postings list for the term is processed with accumulators being created or updated while the access counts for each posting remains greater than or equal to ai and aa respectively; processing of the current list ceases when an access count is less than aa. Last, after processing a list, the thresholds are recalculated for the next term and the process repeated from the second step for that term. If during the processing of the postings the accumulator contribution of a posting is greater than Sm, the value of Smis updated to reflect the new highest accumulator contribution. This process is illustrated in Figure 3.5. Although the threshold values aa and ai increase as more terms are processed, the proportion of postings processed per individual list is variable and dependent on the access count distribution of the postings therein.

The continue scheme described in Section 2.8 (page 34) is an effective technique for saving memory and reducing computational cost. Anh and Moffat [2002b] experimented with the combination of various pruning techniques for impact-ordered indexes. They reported that a combination of their impact-order prune schemes with continue produced the best overall compromise in results between accuracy and efficiency.

To restrict the memory consumption of query evaluation, we limit the number of accumulators initialised at query time using the continue scheme in combination with our pruning approaches. When the threshold amount of accumulators initialised has been reached by any of the proposed processing schemes, postings are processed in update-only mode, allowing existing accumulators to have their scores incremented, but not allowing the addition of new accumulators.

In their impact-ordered work, Anh and Moffat [2002b] proposed several index pruning schemes that build on the quit and continue strategies of Moffat and Zobel [1996]. They defined a block-fine approach where the product of transformed document and query im- pacts are sequentially penalised until the accumulator contribution of the processed postings block is zero, at which point processing of the list is stopped. As the ordering of our access- ordered indexes is not directly related to the effect of a posting on the similarity function this approach cannot be applied to our index. However, the avgaccum and avgrollac-

cum schemes allow for a similar granularity of pruning by terminating the processing of a

list when the average contribution of preceding postings falls below a given threshold. Anh and Moffat also defined a variation of the block-fine scheme, term-fine, where a penalty

is applied to each sequential term—as opposed to each block in a single list—therefore increasing the probability of pruning postings lists for terms that are processed later during the query evaluation process. We have not applied a penalty at a term level as in the term-fine approach, and leave this open as an area of further work.

Recently, Lester et al. [2005a] showed that the continue scheme is biased towards documents that appear early in the collection. They proposed an accumulator management strategy that reduces this bias and limits the number of accumulators initialised at query time. In their approach, an accumulator threshold is determined by sampling the contribution of postings in each inverted list, and is updated as postings are processed. When accumulators are encountered that do not exceed the current threshold, they are removed from the accumulator set. For an access-ordered index, such an approach is unlikely to work. As postings are ordered by their likelihood of occurrence in the result set, a correlation between accumulator contribution and position in the list is present. Therefore, any estimates of the expected contribution of postings based on a sequential segment of the inverted list will be skew and not representative of the entire list. As such, the application of the scheme proposed by Lester et al. would likely result in a regular reduction in the adaptive threshold (as lower contributing postings are processed), and a bias against the frequently accessed documents (which are likely to be skipped by an artificially high threshold early in the list processing).

In document Search engine optimisation using past queries (Page 78-82)