The CTS Snippet Engine - Document representation for efficient search engines

The above baseline can be optimised in several ways. The first is to use a word-based semi- static compression scheme over the entire collection, which allows fast decompression speed with minimal loss in compression [Witten et al., 1999]. Using a semi-static approach involves mapping words and punctuation produced in the parsing stage to a single integer code. Words and non-words strictly alternate in the compressed file, which always begins with a punctuation. This is to store the capitalisation of the first word in the document. Detailed description of how capitalisation code is stored can be found in Section 3.4.

Encoding a text collection using a word-based semi-static model requires two passes over the text. During the first pass, two models are constructed: one for the set of words and another for the non-words (punctuations). A model in this instance is a dictionary of unique symbols — a punctuation or a word — that also records the frequency of the symbol in the collection. Symbols in a dictionary are sorted in decreasing order of their frequency. Each symbol is then assigned an integer codeword, which is its ordinal number in the corresponding dictionary. Frequent symbols receive a small integer codewords, while rare symbols receive larger integer values.

The process of computing the semi-static model is compounded by the high number of words and non-words appearing in large web collections; a large number of which occur infrequently. For instance, analysis of two large web collections, discussed in Section 3.4, revealed that words that occur once or twice across a collection constitute just under 80% of the entries in the words model. Discarding such words is not practical, as they may be required to reconstruct the original text when generating snippets. On the other hand, if we were to store all words and punctuation appearing in the collection, and their associated frequency, many gigabytes of memory or a B-tree or similar on-disk structure would be required [Williams and Zobel, 2005].

Moffat et al. [1997] examined schemes for pruning models during compression using large sets of symbols. They conclude that rarely occurring words need not reside in the model. Instead, they can be spelt out in the final compressed file, using a special word token (escape symbol), to signal their occurrence. The obvious practical advantage to limiting the number of symbols is that it reduces the size of the model, perhaps even to the extent that it fits in RAM. On the other hand, as terms not in the model are now spelled out, and therefore potentially assigned long codes, the size of the compressed collection can be expected to increase. We examine the trade-off between model pruning, compression effectiveness, and

3.3. THE CTS SNIPPET ENGINE 53

Figure 3.3: A diagram illustrating the components of the CTS system. Gray rectangles indicate that output produced.

decompression speed in Section 3.8.

In the second pass, the encoder replaces each symbol in the collection with its integer codeword. Where a word does not occur in the model, a sequence of escape symbol, the length, and its ASCII representation are used instead. Similarly, each non-word sequence is replaced with its codeword, or the codeword for a single space character if it is not in the model. The sequence of integers in each document are then coded using the variable-byte (v-byte) scheme [Williams and Zobel, 1999].

While the use of the v-byte scheme should allow faster decompression than the baseline, the intermediate representation of words and punctuation as integers also permits fast search for snippet sentences. The baseline system searches for query terms in a document by performing exhaustive character-by-character matching. CTS on the other hand requires a single test to determine whether an unspelt word in a document matches a given query term. Another optimisation that can applied to the baseline is to eliminate the overhead of opening and closing documents stored in individual files. The CTS system stores all documents contiguously in one file, and an auxiliary table of 64-bit integers indicating the start offset of each document in the file. Where the snippet generator is part of a search engine, we envisage, that the offsets can be stored in the Document Map data-structure. Loading a

document in memory for summarisation, using the single file approach, entails disk seek and a read operation.

Further, when constructing a caption, the snippet generator must have access to the model of words. This allows query terms to be mapped to their correct integer codes, and to recover those words not spelt in the document back to their string form. Figure 3.3 provides a high-level representation of the components used, and outputs produced, when compressing a collection with the CTS system.

At snippet generation time, the query terms — encoded as integers or spelt — and the offsets of the documents to be summarised are passed to the CTS engine. The CTS engine then locates those documents on disk and fetches their content. Each document is uncompressed to recover its integer tokens or spelt representation. A linear search of the tokens in the documents is carried out to locate sentences that contain query terms, and a score is then assigned to each sentence as per Algorithm 1. While scanning through the document we also maintain the location of the EOS markers, to determine sentence boundaries.

In document Document representation for efficient search engines (Page 64-66)