Okapi (Term Weighting Based on Two-Poisson Model)

7. Probabilistic Approach

7.4 Bayesian Probability Models

7.4.5 Okapi (Term Weighting Based on Two-Poisson Model)

Robertson et al. [SIGIR ‘94] has developed a term weighting scheme based on the Poisson distribution. This scheme was fist presented in the City University of London Okapi system. As it has proved to be one of the most successful weighting schemes in TREC competitions, it has been adopted by other TREC participants, and is generally identified by the system in which it was introduced, as Okapi weighting.

The Okapi approach starts with the view of a document as a random stream of term occurrences. Each term occurrence is a binary event with respect to a given term t. That is, there is a (typically small) probability p that the event will be an occurrence of t, and a probability q = 1-p that the event is not an occurrence of t. Then, the probability of x occurrences (commonly called “suc-

cesses”) of t in n terms is given by the binomial distribution. [Hoel, 1971] For very small p and very large n, the binomial distribution is well approximated by the Poisson distribution:

whereµ is the mean of the distribution. To incorporate within-document term frequency tf, Rob- ertson makes the fundamental assumption that the term frequency of a given term t is also given by a Poisson distribution, but that the mean of this distribution is different depending on whether the document is “about” t or not. It is assumed that each term t represents some “concept,” and that any document in which t occurs can be said to be either about or not about the given concept. Documents that are about t are said to be “elite” for t. Hence, Robertson assumes that there are two Poisson distributions for a given term t, one for the set of documents that are elite for t, the other for documents that are not elite for t. (This is why the Okapi weighting is said to be a 2-Pois- son model.) The Poisson distribution for a given term t becomes:

where m, the mean of the distribution, is eitherµorλdepending on whether the distribution is for documents elite for t (mean =µ), or documents that are not elite for t (mean =λ). Note that these two Poisson distributions give the probability of a given term frequency for a given term t in terms of document eliteness to t, not in terms of relevance to a given query. A query can contain multi- ple terms. A document contains many terms, and may be about multiple concepts. The usual assumptions about term independence, or Cooper’s “linked dependence,” are extended to elite- ness; that is the eliteness properties of any term t_iare assumed to be independent of those for any other term t_j.

Robertson defines the weight w for a given term t in terms of a logodds function:

where p_tfis the probability of t being present with frequency tf given that the document is relevant to a given query, and q_tfis the probability of t being present with frequency tf given that the docu- ment is non-relevant to the given query. The p₀and q₀ are the corresponding probabilities with t absent. Hence, P_tf/P₀is not the odds of t being present in a relevant document as before, but the odds of t being present with a given tf as compared to not being present in a relevant document at all. (And similarly, for q_tf/q₀with respect to non-relevant documents.) When the Poisson distribu- tions of t relative to document eliteness/non-eliteness given above are incorporated into this logodds function of t relative to document relevance/non-relevance, the result is a rather complex function in terms of four difficult-to-estimate variables: p’, q’,µandλ.Here, p’ is the probability

p x( ) e,µµ x x! --- = p tf( ) em m tf tf ( )! --- = w ptfq0 qtfp0 --- log =

that a given document is elite for t, given that it is relevant, i.e., P(document elite for t|R). Simi- larly, q’ = P(document elite for t| not R).

Robertson converts this difficult-to-compute term weight function into a more practical function His basic strategy is to replace complex functions by much simpler functions of term frequency that have approximately the same shape, e.g., the same behavior at tf=0, the same behavior as tf increases, and grows large, etc. His approximation starts with the traditional logodds function for presence/absence of t, as derived from the relevance/non-relevance contingency table in 7.4.1 (Binary independence). This is multiplied (in effect, “corrected” or “improved”) by a simple approximation function for term weight in a document as a function of tf, a function that approxi- mates the shape of the true 2-Poisson function. The approximation contains a “tuning constant,” k₁, in the denominator, whose value (determined by experimentation) influences the shape of the curve. Then, the weight function is multiplied by a similar approximation function for the query, i.e., a function of within-query term frequency, qtf. This function also contains a tuning constant, k₃.

To improve the approximation further, Robertson takes document length into account. He offers two broad hypotheses to account for variation in document length: The “Verbosity hypothesis” is the hypothesis that longer documents simply cover the same material as corresponding shorter documents but with more words, or (more fairly) cover the same topic in more detail. (This is the hypothesis that underlies most document vector normalization schemes discussed above.) The “Scope hypothesis” is the hypothesis that longer documents deal with more topics than shorter documents. (This is the hypothesis that underlies most work with document “passages.”) Obvi- ously, each hypothesis can be correct in some cases, and indeed, in other cases, both hypotheses may be correct, i.e., a document may be longer than another both because it uses more words to discuss a given topic, and because it discusses a greater number of topics. Hence, Robertson refines his approximation to allow the user to take either or both hypotheses into account, as appropriate. First, on the basis of the Verbosity hypothesis, he wants the weight function to be independent of document length. On the simple common assumption that term frequency is pro- portional to document length, he multiplies k₁ by dl_i, the length of the i-th document, the docu- ment D_iunder consideration, so that all terms will increase proportionally with document length, and the weight function will remain unchanged. Then, on the assumption that the value of k₁has been chosen for the average document, he further normalizes k₁, dividing it by dl_avgthe average document length for the collection under consideration. Then, he modifies this normalization fac- tor with another tuning constant, b, into a composite constant K = k₁((1-b) + b(dl_i/dl_avg)). The constant, b, also determined by experiment, controls the extent to which Verbosity hypothesis applies (b=1) or does not apply (b=0).

To compute document-query similarity for a given document, D_i, the term weights determined by the above approximation function are added together for all query terms that match terms in D_i. Finally, to this sum Robertson adds a “global correction term” that depends only on the terms in the query, and not at all on whether they match terms in D_i. This correction term reflects the influ- ence of document length variation, departure from the average length, with respect to the weight of each query term. The correction term contain yet another tuning constant, k₂.

The final result, first used in TREC3 [Robertson et al., 1995] and TREC4 [Robertson et al., 1996], is called BM25; the BM stands for “Best Match” and the 25 is the version number, reflecting the evolution of this term weighting scheme. The BM25 function for computing the similarity between a query Q, and a document D_i is:

where

Summation is over all terms t in query Q

r = number of documents relevant to Q containing term t R = number of documents relevant to Q

n = number of documents containing t

N = number of documents in the given collection tf = frequency (number of occurrences) of t in D_i qtf = frequency of t in Q

avdl = average document length in the given collection

dl = length of D_i, e.g., the number of terms, or the number of indexed terms, in D_i |Q| = number of terms in Q

k₁, k₂, k₃, and K are tuning constants as described above.

K = k₁((1-b) + b(dl_i/dl_avg)) where b is another tuning parameter.

Varieties of Okapi BM25 have continued to be used down through TREC-9, both by its origina- tors [Robertson et al., 2000] and others, due to its effectiveness. According to Robertson et al., “k₁ and b default to 1.2 and 0.75 respectively, but smaller values of b are sometimes advantageous; in long queries k₃is often set to 7 or 1000 (effectively infinite).” k₂has often been set to 0, e.g., in TREC-4 and TREC-9.

In document Information Retrieval: A Survey - Free Computer, Programming, Mathematics, Technical Books, Lecture Notes and Tutorials (Page 68-71)