Practical considerations - Statistical Machine Learning For Information Retrieval Adam Berger

Conventional high-performance retrieval systems typically decompose the task of ranking documents by relevance to a queryq={q1, q2, . . . qn}into a retrieval and a query expansion

stage. For instance, in the automatic relevance feedback approach, the system first ranks just those documents whose content overlaps withq, and assumes the other members ofD are not relevant to the query. In part because there may exist relevant documents which have no words in common with the query, the system then expands q to include a set of words which appeared frequently among the top-ranked documents in the first step, and ranks documents whose content overlaps with this expanded query.

The crucial aspect of this two-step process is that in each step, the ranking algorithm can disregard documents containing none of the words in the query. This approximation— ignoring the potentially large subset of the document collection with no words in common with the query—makes the difference between a usable and impractically slow ranking algorithm.

A retrieval system that only considers documents containing words in the query can organize the collection efficiently into an inverted index, which lists for each word the documents containing that word. Processing a query with an inverted index is then a simple matter of visiting just those lists in the inverted index corresponding to words in the query. Inverted indices are a nearly ubiquitous component of large-scale document retrieval systems.

At first glance, it would seem that models which capture the semantic proximity between words are incompatible with the use of an inverted index. After all, when using Model 1 or Model 1’, all documents are “in play” for a given query: a document not containing a query word might, after all, still generate that word with high probability. But forfeiting the use of an inverted index entirely and explicitly computing a relevance score for every document, asNaiveRankdoes, is too inefficient. Calculating the relevance of a document to a query using Model 1 (equation (3.13)) requires time proportional to |q| × |d|: the product of the size of the query and the size of the document. In practice, it appears that NaiveRankcan require an hour or more per queryfor a TREC-sized document collection of a few hundred thousand documents, on a modern-day workstation.

The remainder of this section presents a set of heuristics to allow efficient (albeit approximate) ranking of documents by their probabilitypα(q| d) of generating the query in

a distillation process. This section also shows how, by using a data structure similar to an inverted index, one can achieve near real-time retrieval performance.

The key observation is that the ranking algorithm offers a time-space tradeoff. Rather than calculating the sum in (3.13) during ranking, one can precompute p(q | d) for every known word q and each document d∈ D, and store the results in a matrix, illustrated in Figure 3.6. Denote this “inverted matrix”—similar to an inverted index, but containing an entry forevery(d, q) pair—by the symbolI. (As a reminder: p(q |d) is just one component

qu er y wo rd s q1 ) | ( ) ˆ | ( w q w l w σ

∑

d q2 q3

...

d o c u m e n t s d1 d2 d3

..

.

) ˆ | ( d i q p ˆ d

..

.

Figure 3.15: NaiveRankcomputespα(qi |d) according to (3.16) for each wordqi in the

query q = {q₁, q₂. . . qm}. Avoiding this costly process is not difficult: just precompute,

once and for all, p(q |d) for all words q and documents d. Calculatingpα(q |d) is then

a matter of multiplying the precomputed p(qi | d) together, factoring in the smoothing

terms p(q | D) along the way. This figure depicts a data structure I which stores these precomputed values.

of the smoothed probabilitypα(q |d) ofq givend. By inspection of (3.16), one can see also

a contribution from the document-wide language modelp(q| D).)

Precomputing the cells ofI and then using these values inNaiveRankreduces the cost of ranking from| D | × |q| × |d| to | D | × |q| operations.

Unfortunately, the matrix I, with as many columns as documents in the collection and as many rows as there are distinct words recognized by the system, can be prohibitively ex- pensive to compute and store. A 100,000 document collection and 100,000 word vocabulary would require a matrix 400GB in size. One can therefore make the following approximation to (3.16):

pα(q|d) ≈αp(q | D) + (1−α) X w∈Tn_(q)

l(w|d)σ(q|w) (3.20) where Tn(q)def={w:σ(q|w) is among the nlargest σ-values for any w}

Roughly speaking,Tn(q) is the set ofnwords most likely to map toq. In other words, (3.20) assumes that each document covers at most nconcepts. In the performed experiments, n was set to 25. Making this approximation results in most valuesp(q|d) dropping to zero, yielding a sparseI matrix—easy to store and precompute using conventional sparse-matrix techniques.

Of course, any approximation runs the risk of gaining speed at the cost of accuracy. To address this, the new ranking algorithm therefore rescores (and reranks) the top-scoring documents according to (3.16).

Algorithm 6: “FastRank”: efficient document ranking

Input: Query q={q1, q2, . . . qm};

Collection of documents D={d1,d2, . . .dN};

Word-relation probability σ(q|w) for all word pairs q, w

Inverted mapping ψ from words to documents Output: Relevance score ρq(d) for each document d

1. Do for each document d∈ D in the collection 2. Set ρq(d)←1

3. Do for each query word q∈q 4. Do for each document d∈ D

5. Set pα(q| d)←αp(q | D) (precomputed)

6. If d∈ I(q) then pα(q |d)←pα(q|d) + (1−α)p(q |d) (precomputed)

7. Rescore the top-ranking documents according to (3.16).

Figure 3.16 shows that, on the AP subset of the TREC dataset, the precision/recall performance of the fast, approximate algorithm is essentially indistinguishable from the naive, exact algorithm. But the former algorithm is considerably faster: on a 266Mhz workstation with 1.5GB of physical memory, NaiveRankrequired over an hour per query while FastRankrequired an average of only 12 seconds per query2.

In document Statistical Machine Learning For Information Retrieval Adam Berger pdf (Page 81-84)