Processing
2.2.1 Information Retrieval Models
Several models have been proposed to address different information retrieval needs. The choice of a retrieval model depends highly on the application. The vector space model [147], the Okapi BM25 model [104], and the language model [100] are all commonly used models.
The representation of a set of documents as vectors in a common vector space is known as the vector space model. Both documents and queries are represented as vectors, where each dimension corresponds to each distinct term. If a term is present in a document, its value along the corresponding dimension, referred to as the term weight, can be computed using different schemes. One of the most commonly used schemes is the TF·IDF metric [147]. The TF (Term Frequency), TF(d, t) counts how many times the term t appears in a document d, and the IDF (Inverse Document Frequency) IDF(O, t) = log|o∈O,TF(d,t)>0||O| measures the importance of t with respect to all of the documents in an object collection O. The text similarity between a document d and a query q is calculated as:
TS(d, q) = X
t∈q∩d
In information retrieval, the language model is a query likelihood model. Given a query, the documents are ranked based on the probability that the document’s language model would generate the terms of the query. Usually, smoothing is necessary to assign probability for unseen terms in a document. Finally, the Okapi BM25 Model [104] takes the document length into account to compute the text relevance. Unlike the TF·IDF metric, the IDF score in BM25 may give negative scores for terms that appear in a large number of documents.
2.2.2 Text Indexing
The inverted file [147] is widely used in many of the state-of-the-art large-scale information retrieval systems, such as web search engines. An inverted file consists of a vocabulary of all distinct terms in a collection of documents. Each term is associated with an inverted list where each inverted list is a sequence of postings. Each posting contains the identifier of such a document d whose description contains the term and usually the frequency of that term in d as well. In general, the postings in each inverted list are sorted by the document ID, and the lists are compressed.
The signature file [37] and the bitmap are two other text indexes that can be used for document indexing. The key idea of signature files is to create a quick filter to obtain all documents that match with the query, though some additional documents that do not match may pass the filter also, as ‘false hits’. A signature, typically a hash coded version, is generated for each document for this purpose. A bitmap in which each bit represents the occurrence of a keyword can be viewed as a special type of signature file. As demonstrated by the empirical results in the work by Zobel et al. [148], signature files are not competitive with the inverted file in terms of efficiency and space, specially for large datasets. However, a recent work [48] on signature files, denoted as the BitFunnel uses Bloom filter [6] to represent the set of terms in each document as a fixed sequence of bits to address the limitations of signature files.
2.2.3 Text Query Processing
The two most commonly used traversal techniques in inverted files are: (i) Document-At-A- Time (Daat) and (ii) Term-At-A-Time (Taat) processing [119]. In Daat, each inverted list has a pointer that points to the ‘current’ posting in the list. A cursor maintaining the current position in each list is moved forward as a query is being processed. The total score of the current qualifying document for the query terms are computed before proceeding to the next one.
Information Retrieval: Models, Indexes, and Query Processing 21
black
blue
red
green
174 212 316 722 max = 2.3 max = 1.8 max = 3.3 max = 4.3Figure 2.3: An example of the Wand traversal
In Taat, the scores for all of the qualifying documents are computed concurrently for one query term at a time. The entire inverted list for the query term that is the rarest in the collection is usually processed first, and then the next rarest term is processed. An accumulator data structure is used to keep track of the partial scores of the documents. When the lists of all of the query terms have been processed, k documents with the highest scores are returned. The inverted lists are required to be stored in order of the document IDs for Daat traversal; Taat allows other orderings of the lists. While there are advantages and disadvantages to both processing regimes, Daat processing tends to be favoured in current IR systems as non-textual features can be more easily integrated into the scoring process. We present a Daat traversal algorithm, denoted as the Wand [8] next.
WAND algorithm. In Wand, objects (text documents) are sorted in ascending order of their IDs in each posting list. For each query term, the algorithm maintains the pointers to identify the next candidate document that might need to be scored. In each iteration, the maximum textual similarity score for each posting list is summed in an ascending order of the corresponding document IDs of the pointers, until the sum becomes greater than or equal to a threshold,Rk(q) for the query q. Here,Rk(q) is the lowest score of the current top-k documents
found so far. The term where this happens is called the “pivot term”, and the document ID of the corresponding pointer is called a “pivot”. For example, in Figure 2.3, the posting lists are sorted by the corresponding document IDs of the pointers, where the query terms are ‘black, blue, red, green’. Let, the current threshold Rk(q) = 2.6. After summing up the maximum textual similarity scores from the top, the summation exceeds the threshold for the term ‘blue’. So the pivot term is ‘blue’, and the pivot is ‘212’.
The crucial observation is that since the pivot is determined using the maximum textual similarity score which represents the upper bound score that any query - document pair can achieve, the pivot is the smallest document ID that might be a candidate. Thus, no un-scored document with an ID smaller than the current pivot can be a top-k result, and can be safely skipped. As a query is being processed, Wand applies a document skipping pointer movement strategy based on the “pivot”. At any point, it is guaranteed that the documents to the left of the pointers have been processed. If the “pivot” document is a candidate, the total score of that document is computed for the query. If the total score is greater than the threshold, the current top-k documents and the threshold are updated accordingly. As more documents are processed, more future documents are likely to be skipped based on the refined threshold score.