The Inverted Index - Search engine optimisation using past queries

For search systems to service requests efficiently, they must have rapid access to candidate results. An inverted index is a data structure that provides direct access to the locations at which terms appear within a document collection. Akin to the index found in a text book, an inverted index records, for each term in the collection, the documents in which the term occurs. Inverted indexes are an essential component of all web search engines and text retrieval systems [Zobel and Moffat, 2006].

An inverted index has two components: first, a vocabulary of the terms that occur in the searchable collection, and second, a set of postings that list the locations of occurrence of the search terms. Typically, searchable terms are words extracted from the text [Williams and Zobel, 2005].

For each term, there is one posting for each document that contains the term. Zobel and Moffat [1998] proposed a notation to formally describe such postings: each term t has postings hd, fd,ti, where fd,t is the frequency of t in document d. An inverted list is the set of postings for a single term of the collection. Inverted lists take the form:

hd1, fd1,ti, hd2, fd2,ti, ..., hdn−1, fdn−1,ti, hdn, fdn,ti ,

where dn is the largest document identifier in the set of postings.

Although other data structures, such as signature files [Faloutsos, 1985], have been proposed for search systems in the past, inverted indexes have been shown to be more functional and efficient. Zobel et al. [1998] compare signature files with inverted indexes, and present results showing the benefits of the later for most search related tasks.

Figure 2.2 shows a fraction of the inverted index for the Tempest collection from Sec- tion 2.3. In the inverted list for the term “storm”, we can see that the term appears once in document 22, thrice in document 386, and twice in document 408.

The internal document identifiers assigned by the search system can be resolved into disk locations using a document mapping table. The document mapping table is constructed as the collection is parsed at indexing time.

This basic document-level inverted index structure is sufficient to support the popular ranked and less-popular Boolean query evaluation models. However, a document-level index cannot support query types such as phrase queries, where the ordering and adjacency of words determines which documents are matches. Nor can it support ranking methods based on term proximity. To support such fine-grained measures, word position information is

fairy h642, 2i h7, 1i farewell h47, 2i h387, 3i h30, 1i fish h283, 1i h103, 6i h44, 1i h6, 1i h44, 2i h270, 1i state h77, 2i h2, 1i h392, 1i h135, 1i storm h22, 1i h364, 3i h22, 2i swords h565, 1i h1, 2i sycorax h128, 1i h4, 1i h4, 1i h23, 2i h352, 2i

Figure 2.4: Selected inverted lists in a document-level index from Tempest collection with difference compaction.

also stored in the postings lists. Extending the notation of Zobel and Moffat, each posting becomes a triple of the form:

hd, fd,t, [o1, . . . , ofd,t]i ,

where d, t, and f are as previously described, and the o values are the positions in d at which t occurs. We refer to an inverted index containing term offset information as a word-level index.

Figure 2.3 shows a fraction of the word-level inverted index for the Tempest collection. We can now see that the term “storm” appears once in document 22, at the 19th position, thrice in document 386, at the 16th, 217th and 254th positions, and finally twice in document 408, at the 29th and 44th positions.

For the majority of terms, the number of postings are small. It is not uncommon for a large collection to have a large proportion of terms that occur only once. In the Tempest collection, 1,935 terms, that is 59.9% of the unique terms in the collection, appear only once. Conversely, for common terms the number of postings in each inverted list can be sig- nificantly large. Returning to the Tempest collection, the five most common terms: “and”, “i”, “the”, “to” and “a”, appear in 26.1% to 35.1% of the documents in the collection, and individually account for 1.8% to 3.0% of all term occurrences.

2.4.1 Inverted List Compression

The efficiency of ranked querying is highly dependent on the effective organisation and storage of postings lists. Conventionally, postings lists are organised so that the postings are sorted by increasing document order. As postings are processed sequentially, this permits

differences to be taken between adjacent document identifiers prior to storage on disk. Stor- ing differences between document numbers improves the ability to compress lists by reducing both the magnitude and range of values stored [Zobel and Moffat, 2006]. An inverted list with difference compaction takes the form:

hd1, fd1,ti, hd2− d1, fd2,ti, ..., hdn−1− dn−2, fdn−1,tihdn− dn−1, fdn,ti .

An original value is restored by summing all previous values. Figure 2.4 shows document- level lists for the Tempest collection, with difference compaction. For example, for the term “storm” the second posting document is obtained by taking the sum of 364 and 22, which gives the original value of 386.

A similar approach can be followed for the offsets used in a word-level index: hd, fd,t, [o1, o2− o1, o3− o2. . . , ofd,t−1− ofd,t−2, ofd,t − ofd,t−1]i .

For the word-level list of the term “storm”, the list becomes:

h22, 1, [19]i, h364, 3, [16, 201, 37]i, h22, 2, [29, 15]i .

This indicates that, for example, the term occurs in document 386 at word positions 16, 217 and 254.

With differences taken, various coding techniques can be employed to compress the in- tegers that compose the index. Traditional bitwise coding techniques [Golomb, 1966; Elias, 1975] have been shown to effective, but less efficient [Trotman, 2003]. We use a purpose-built variable-byte integer compression scheme that is designed for very fast decoding [Williams and Zobel, 1999; Scholer et al., 2002]. More recently, Anh and Moffat [2005a] have proposed word aligned codes that combine the benefits of bitwise compression with fixed-width align- ment, allowing for rapid access through the encoded data. While we do not use this more recent coding technique, we note that it is compatible with the work reported here.

Improving compression improves query evaluation times for two reasons: first, since disk bandwidth is a bottleneck in modern retrieval engines, compression allows more data to be transferred per disk read than when data is uncompressed; and, second, compression permits more data to be stored in main-memory, improving caching effects. Indeed, Scholer et al. [2002] show that compressing postings lists reduces average query evaluation times to around one-third of that of an uncompressed representation. However, a disadvantage of document- ordered lists is that each must be decoded in its entirety in response to a query. As document identifiers are assigned incrementally as documents are observed, the position of a document in the postings list has little correlation with its similarity to most queries.

Documents Parser Indexer Queries Process Query Terms Create Summaries for Top Ranked Docs.

Query Results with Document Summaries Inverted Lists Replicated Collection Vocabulary Document Map

?

Normalise by Doc. Length

Accumulators Top Ranked Accumulators Query Processor

Figure 2.5: Interaction of components within a search system.

In document Search engine optimisation using past queries (Page 35-38)