2.3 Document Retrieval
2.3.1 Probabilistic Information Retrieval
In this section we will first review basic probability theory and then give an overview of the most influential ideas in probabilistic IR. Generally speaking, the probabilis- tic model is theoretically sound and can therefore be a viable alternative to vector space approaches.
Probability Theory Revisited
Many of the later chapters will rely on probabilistic models and language modelling techniques. Therefore, we will first give a short introduction to basic probability theory to give a foundation for the more advanced concepts to be described later on.
In general, probability theory is concerned with events and how probable they are to occur. An event can be represented by variable A and its likelihood or probability5 is denoted by 0≤ P (A) ≤ 1.
In the more interesting case of two or more events (let us say A and B), the probability of both occurring can be explained with their joint probability P (A, B). The conditional probability P (A|B), on the other hand, denotes the probability of event A given that event B occurred. The relationship between both is given by the chain rule:
P (A, B) = P (A∩ B) = P (A|B)P (B) = P (B|A)P (A). (2.14)
The probability of a joint event can be expressed as the probability of one of the events multiplied by the conditional probability of the other. Another important rule is the partition rule, explaining that the probability of an event B is the sum of the probabilities of all disjoint subclasses:6
P (B) = P (A, B) + P (A, B). (2.15)
5The terms likelihood and probability are used as synonyms in this context. Likelihood usually
denotes an observed set of events on which probability calculations are based on.
These rules suffice to derive Bayes’ rule for inverting conditional probabilities:
P (A|B) = P (B|A)P (A)
P (B) . (2.16)
The two most important concepts here are prior probability and posterior probabil-
ity. The prior probability P (A) provides the probability of an event having no other
information. The posterior probability of an event P (A|B) denotes its probability after having seen the evidence B.
Probability Ranking
In the context of documents, collections and queries, we can assign a random variable Rd,q representing the relevance of a document d given a query q. In the
binary case, R has a value of 0 if a document is not relevant and a value of 1 if it is. The main idea behind probability is to rank documents according to their estimated probability of relevance P (R = 1|d, q).
The Binary Independence Model (BIM) The binary independence model introduces some assumptions making the estimation of P (R|d, q) feasible. Both documents and queries are represented by boolean vectors, 1 indicating a term is present in a document/query and 0 indicating it is not. This also means that a term independence assumption is made, reflecting the fact that the model does not recognise associations or relations in between terms. This assumption is far from correct but simplifies the process and provides satisfactory results in other applications such as probabilistic classification. The term independence assumption also correlates to how documents are modelled in the vector space model. The BIM is formalised as:
P (R = 1|x, q) = P (x|R = 1, q)P (R = 1|q)
P (x|q)
P (R = 0|x, q) = P (x|R = 0, q)P (R = 0|q)
P (x|q) (2.17)
where x denotes the document vector of document x, q denotes the query vector
of query q. P (x|R = 1, q) is the probability of x being the representation of a relevant document being retrieved (this is accordingly expressed for a non-relevant document being retrieved as P (x|R = 0, q)). P (R = 1|q) and P (R = 0|q) denote the prior probabilities of query q retrieving relevant and non-relevant documents, respectively. Both of these probabilities sum up to 1.
OKAPI BM25
The binary dependence model was initially developed to rank documents of rather consistent length (i.e., the individual documents vary little in size). However, this assumption can not be made in many contexts, and document lengths should be taken into account in the retrieval model. BM25 is designed to handle varying document lengths. A comprehensive overview of the probabilistic relevance frame- work and BM25 methods is presented in [94]. BM25 uses the average document length component to score documents, distinguishing it from classic tf -idf scoring. Further, two internal parameters are needed, b and k1 .
We assume no relevance information available. As such, a close approximation of the classical idf is commonly used, compensating for 0-values:
idft= log
N− dft+ 0.5 dft+ 0.5
. (2.18)
Additionally, soft document normalisation is introduced:
B := (1− b) + b dl avgdl , 0≤ b ≤ 1, (2.19)
taking into account the relative length of the current document (the dl
avgdl com-
ponent, referring of the ratio between the document length dl and the average document length avgdl ). Setting b = 0 will not make use of normalisation at all;
b = 1 performs full document length normalisation.
Substituting B from Eq. (2.19), we use the sum over all query terms t to score document d as follows: scoreBM25(q, d) = t∈q tft ,d k1 (1− b + bavgdldl ) + tft ,d idft, (2.20)
where k1 is a tuning parameter calibrating the document term frequency scaling (a k1 value of 0 corresponds to a binary model, large values for k1 correspond to using raw term frequencies).
BM25 has a strong advantage over tf -idf based methods because it takes into account the document lengths, but also comes with disadvantages in that it is more difficult to tune the additional parameters k1 and b.
Language Models for Information Retrieval
Language models for information retrieval have become a popular and competitive way of performing document ranking. The underlying idea is that a document is a good match for a query if this document is likely to generate the query. The idea of “generating a query” stems from the ability of a finite automaton to generate language. While this is more out of tradition, similarities exist between the two
concepts. A language model does put a probability measure over strings drawn from some vocabulary (single terms in the case of unigram language models). The resultant language model is a probability distribution, i.e., it adds up to 1. It should be noted that the event space, i.e., the set of all possible outcomes or occurrences of word sequences, is infinite since any natural language term can occur in any order. This way, each term in a document is assigned a probability and the probabilities of sequences of terms can be computed simply by multiplying these individual probabilities (assuming a uniform language model ignoring the context of terms):
Puni(t1t2t3) = P (t1)P (t2)P (t3). (2.21)
The query likelihood language model is only one instance of the family of language modelling approaches. In this model, we construct a language model θd for each
document in the collection. The main goal is to rank documents by P (d|q); the probability of a document is interpreted as the likelihood that it is relevant to the query. Using Bayes’ rule from (2.16), we have:
P (d|q) = P (q|d)P (d)
P (q) . (2.22)
P (q) is the same for all documents, as such cannot influence rankings, and thus
can be dropped. The prior probability of a document P (d) is often treated as uniform over all documents, but could be assigned on a document level reflecting the importance of individual documents. The query likelihood can now be computed as
P (q|θd) =
t∈q
P (t|θd)tft,q, (2.23)
where tft,q denotes the (raw) frequency of term t in the query.7 The main open
question now is how to estimate the probability of a term given a document lan- guage model P (t|θd). It is important to point out that this probability must never
be 0 since that would lead to 0 as the total probability (which in turn would make partial matches impossible). This is where smoothing comes into play to guarantee that we never encounter zero-probabilities. We employ Bayesian smoothing using Dirichlet priors which has been shown to achieve superior performance on a variety of tasks and collections [125]:
P (t|θd) =
tft,d+ μP (t|θc)
|d| + μ , (2.24)
where tft ,d is the raw frequency of term t in document d and |d| is the size of a
document, i.e., ttft ,d. P (t|θc) represents the global term probability, e.g., the 7For short keyword queries this virtually always equals to 1 since they hardly ever contain the
probability of the term occurring in the collection, i.e.,dtft,d
|d|. The smoothing
parameter μ is subject to optimisation, but setting it to the average document length is usually a good starting point [59]. Practically, all terms are smoothed by their probability in the whole collection and therefore 0 values are avoided.8
Language Models vs. Classical Approaches
The language modelling approach has gained much interest and is theoretically sound and computationally tractable. On the other hand, the relations to tra- ditional tf -idf models are significant [48]. The effect of smoothing by collection frequency can be compared to idf weighting. Even though LMs provide good per- formance, comparable or even better than other weightings such as BM25, it is not definitive that this fact is a general one, or that proper parameter tuning can not be used to tune traditional models to similar performance.