7. Probabilistic Approach
7.4 Bayesian Probability Models
7.4.5 Okapi (Term Weighting Based on Two-Poisson Model)
Robertson et al. [SIGIR ‘94] has developed a term weighting scheme based on the Poisson distri- bution. This scheme was fist presented in the City University of London Okapi system. As it has proved to be one of the most successful weighting schemes in TREC competitions, it has been adopted by other TREC participants, and is generally identified by the system in which it was introduced, as Okapi weighting.
The Okapi approach starts with the view of a document as a random stream of term occurrences. Each term occurrence is a binary event with respect to a given term t. That is, there is a (typically small) probability p that the event will be an occurrence of t, and a probability q = 1-p that the event is not an occurrence of t. Then, the probability of x occurrences (commonly called “suc-
cesses”) of t in n terms is given by the binomial distribution. [Hoel, 1971] For very small p and very large n, the binomial distribution is well approximated by the Poisson distribution:
whereµ is the mean of the distribution. To incorporate within-document term frequency tf, Rob- ertson makes the fundamental assumption that the term frequency of a given term t is also given by a Poisson distribution, but that the mean of this distribution is different depending on whether the document is “about” t or not. It is assumed that each term t represents some “concept,” and that any document in which t occurs can be said to be either about or not about the given concept. Documents that are about t are said to be “elite” for t. Hence, Robertson assumes that there are two Poisson distributions for a given term t, one for the set of documents that are elite for t, the other for documents that are not elite for t. (This is why the Okapi weighting is said to be a 2-Pois- son model.) The Poisson distribution for a given term t becomes:
where m, the mean of the distribution, is eitherµorλdepending on whether the distribution is for documents elite for t (mean =µ), or documents that are not elite for t (mean =λ). Note that these two Poisson distributions give the probability of a given term frequency for a given term t in terms of document eliteness to t, not in terms of relevance to a given query. A query can contain multi- ple terms. A document contains many terms, and may be about multiple concepts. The usual assumptions about term independence, or Cooper’s “linked dependence,” are extended to elite- ness; that is the eliteness properties of any term tiare assumed to be independent of those for any other term tj.
Robertson defines the weight w for a given term t in terms of a logodds function:
where ptfis the probability of t being present with frequency tf given that the document is relevant to a given query, and qtfis the probability of t being present with frequency tf given that the docu- ment is non-relevant to the given query. The p0and q0 are the corresponding probabilities with t absent. Hence, Ptf/P0is not the odds of t being present in a relevant document as before, but the odds of t being present with a given tf as compared to not being present in a relevant document at all. (And similarly, for qtf/q0with respect to non-relevant documents.) When the Poisson distribu- tions of t relative to document eliteness/non-eliteness given above are incorporated into this logodds function of t relative to document relevance/non-relevance, the result is a rather complex function in terms of four difficult-to-estimate variables: p’, q’,µandλ.Here, p’ is the probability
p x( ) e,µµ x x! --- = p tf( ) em m tf tf ( )! --- = w ptfq0 qtfp0 --- log =
that a given document is elite for t, given that it is relevant, i.e., P(document elite for t|R). Simi- larly, q’ = P(document elite for t| not R).
Robertson converts this difficult-to-compute term weight function into a more practical function His basic strategy is to replace complex functions by much simpler functions of term frequency that have approximately the same shape, e.g., the same behavior at tf=0, the same behavior as tf increases, and grows large, etc. His approximation starts with the traditional logodds function for presence/absence of t, as derived from the relevance/non-relevance contingency table in 7.4.1 (Binary independence). This is multiplied (in effect, “corrected” or “improved”) by a simple approximation function for term weight in a document as a function of tf, a function that approxi- mates the shape of the true 2-Poisson function. The approximation contains a “tuning constant,” k1, in the denominator, whose value (determined by experimentation) influences the shape of the curve. Then, the weight function is multiplied by a similar approximation function for the query, i.e., a function of within-query term frequency, qtf. This function also contains a tuning constant, k3.
To improve the approximation further, Robertson takes document length into account. He offers two broad hypotheses to account for variation in document length: The “Verbosity hypothesis” is the hypothesis that longer documents simply cover the same material as corresponding shorter documents but with more words, or (more fairly) cover the same topic in more detail. (This is the hypothesis that underlies most document vector normalization schemes discussed above.) The “Scope hypothesis” is the hypothesis that longer documents deal with more topics than shorter documents. (This is the hypothesis that underlies most work with document “passages.”) Obvi- ously, each hypothesis can be correct in some cases, and indeed, in other cases, both hypotheses may be correct, i.e., a document may be longer than another both because it uses more words to discuss a given topic, and because it discusses a greater number of topics. Hence, Robertson refines his approximation to allow the user to take either or both hypotheses into account, as appropriate. First, on the basis of the Verbosity hypothesis, he wants the weight function to be independent of document length. On the simple common assumption that term frequency is pro- portional to document length, he multiplies k1 by dli, the length of the i-th document, the docu- ment Diunder consideration, so that all terms will increase proportionally with document length, and the weight function will remain unchanged. Then, on the assumption that the value of k1has been chosen for the average document, he further normalizes k1, dividing it by dlavgthe average document length for the collection under consideration. Then, he modifies this normalization fac- tor with another tuning constant, b, into a composite constant K = k1((1-b) + b(dli/dlavg)). The constant, b, also determined by experiment, controls the extent to which Verbosity hypothesis applies (b=1) or does not apply (b=0).
To compute document-query similarity for a given document, Di, the term weights determined by the above approximation function are added together for all query terms that match terms in Di. Finally, to this sum Robertson adds a “global correction term” that depends only on the terms in the query, and not at all on whether they match terms in Di. This correction term reflects the influ- ence of document length variation, departure from the average length, with respect to the weight of each query term. The correction term contain yet another tuning constant, k2.
The final result, first used in TREC3 [Robertson et al., 1995] and TREC4 [Robertson et al., 1996], is called BM25; the BM stands for “Best Match” and the 25 is the version number, reflecting the evolution of this term weighting scheme. The BM25 function for computing the similarity between a query Q, and a document Di is:
where
Summation is over all terms t in query Q
r = number of documents relevant to Q containing term t R = number of documents relevant to Q
n = number of documents containing t
N = number of documents in the given collection tf = frequency (number of occurrences) of t in Di qtf = frequency of t in Q
avdl = average document length in the given collection
dl = length of Di, e.g., the number of terms, or the number of indexed terms, in Di |Q| = number of terms in Q
k1, k2, k3, and K are tuning constants as described above.
K = k1((1-b) + b(dli/dlavg)) where b is another tuning parameter.
Varieties of Okapi BM25 have continued to be used down through TREC-9, both by its origina- tors [Robertson et al., 2000] and others, due to its effectiveness. According to Robertson et al., “k1 and b default to 1.2 and 0.75 respectively, but smaller values of b are sometimes advantageous; in long queries k3is often set to 7 or 1000 (effectively infinite).” k2has often been set to 0, e.g., in TREC-4 and TREC-9.