Index Compression - Search engine optimisation using past queries

We now we consider the techniques used to compress postings lists and propose techniques for the effective storage of access-ordered indexes. Section 2.4.1 (page 22) describes standard approaches to inverted list compression. We use a variable-byte oriented technique based on the work of Scholer et al. [2002] that has been shown to provide effective compression with the benefit of rapid decompression.

Integer compression schemes are applied to postings lists, after the postings are organised using document-, frequency-, or impact-ordering, and compaction techniques applied. For example, Persin et al. [1996] proposed grouping postings in each list with the same fd,t. This has two advantages for index compaction: first, grouping postings into blocks of postings with the same fd,t permits storing each fd,t value once; and, second, postings within a block can be sorted by document identifier to permit document differences to be stored within each block. However, overall, the largest contribution to index compaction in frequency-ordered

indexes is the elimination of the redundant storage of fd,t values. With the organisation in place and compaction applied, variable-byte integers can be used to represent the list.

Access frequencies are unrelated to both the position of a document in a collection and the frequency of the term in each document. Therefore, since the postings are not ordered by a property stored in the lists, taking differences between adjacent values does not yield a list of small integers. Accordingly, as we show in Section 3.6.6, without additional compaction, access-ordered indexes are around 160% of the size of a conventional document-ordered index. This large size affects index space requirements on disk, query evaluation speed, and memory caching. We therefore need to consider approaches to reducing index size, based on storing postings in blocks where document order can be maintained.

3.5.1 Access-Block Compaction

We now consider techniques for compaction of access-ordered indexes. Each scheme we describe aims to form blocks of postings, and then to organise the postings within the block by document order to permit differences to be stored. We report results with the schemes in Section 3.6.6.

Basic access-block compaction. Our first approach to compacting access-ordered indexes is motivated by the work of Persin et al. [1996]. It is likely that many documents within a collection share the same access count — particularly for low access counts — and so postings can be blocked together by that access count. Then, within each block, postings can be sorted by document identifier and differences taken. However, unlike blocking by fd,t values, this does not reduce the information that is required to be stored per block. Indeed, an additional integer must be stored per block that indicates the number of postings in the block; we refer to this value as fb.

Consider an example. The term “fish” that has the following document-ordered postings list:

h283, 1i, h386, 6i, h430, 1i, h436, 1i, h480, 2i, h750, 1i .

Assuming that the access counts in the collection are 2, 14, 1, 2, 1 and 0 for documents 283, 386, 430, 436, 480 and 750 respectively, a basic access-block compacted inverted list for the term “fish” would appear as:

[1]h386, 6i, [2]h283, 1i, h153, 1i, [2]h430, 1i, h50, 2i, [1]h750, 1i .

Each block begins with a block-size value fb shown in square brackets, and where the block- size is greater than one, the differences between adjacent document identifiers are stored.

For short lists, the net benefit to index compaction through blocking is outweighed by the requirement to store the extra fb value per block.

Fixed access-block compaction. A simple extension of the basic access-blocked technique is to use a constant block size, that is, to set fb = k for a block size k. This avoids the requirement to store fb for each block, since all blocks (except, in most cases, the last in each list) contain fb postings. However, it relies on careful choice of k: when k = 1, exact access-ordering is maintained and no compaction achieved, while when k is set to the length of the largest list in the index, strict document-ordering is enforced.

Returning to our example list for the term “fish”, assuming k = 3, and access count values as above, the fixed access-block inverted list becomes:

h283, 1i, h153, 6i, h50, 1i, h430, 1i, h50, 2i, h270, 1i .

Each block is shown within square brackets. Fixed access-block compaction does not guar- antee that all postings with the same access count are stored in the same block. Similarly, postings in a block may have different access counts; this is particularly likely for the postings with high access counts in each list or for short lists. Because of this, early termination heuristics must be carefully considered.

The two-phase and minaccess pruning approaches rely on an absolute ordering of postings that have decreasing access count values. However, several of the compaction schemes proposed in this section rely on breaking this ordering. For these schemes, pruning requires that the thresholds are no longer compared to the access count of each posting, but instead are compared to the highest access count value in the block of postings.

In Section 3.6.6, we investigate choices of k, and show the effect of this compaction technique when combined with the early termination heuristics proposed above.

Exponential access-block compaction. Another approach to block-based compaction is to allocate block sizes based on a function with zero or more parameters. We have ex- perimented with one approach in this class, though there are many options that can be considered. As we have shown, document access counts follow an inverse power law distri- bution: this has the effect that few documents have unique access counts, some share access counts with a few other documents, and the majority share low access counts with most others. Therefore, to permit blocks to contain most postings with identical access counts and for identical access counts not to span blocks, we propose a global scheme, where the first block in each list has fb = 1 postings, the second fb = 2, the third fb = 4, and so on, with each block storing twice as many postings as its predecessor.

For the example list “fish”, the exponential access-block compacted list is as follows: h386, 6i, h283, 1i, h153, 1i, h430, 1i, h50, 2i, h270, 1i .

This approach permits early termination between blocks, with reduced likelihood that postings with identical access counts span blocks. The drawback is that compaction will be most effective for long inverted lists, while, for short lists, compaction will be limited by small sized blocks. Compaction of short lists can be improved by setting an initial block size greater than 1. However, in experiments not reported here, we were unable to further reduce index size significantly by increasing the initial block size.

3.5.2 Other approaches

We have considered other approaches to achieving compaction. For example, we experi- mented with schemes where the postings at the beginning of each list — those with access counts greater than a threshold — are access-ordered and remaining postings are document- ordered. We did this based on the observation that absolute access-ordering permits careful application of termination heuristics for postings with high access counts, while compaction can still be achieved for the majority of postings with low access counts. Unfortunately, such approaches work well only for long lists, achieving low compaction overall for reason- able parameter choices. We therefore report experiments with only the schemes described previously in this section.

In document Search engine optimisation using past queries (Page 82-85)