• No results found

Probabilistic Objective

4.2 Explicit Query Aspect Diversification

4.2.1 Probabilistic Objective

As limited representations of information needs and information items, respec- tively, queries and documents naturally incur an uncertainty to the estimation of relevance. By adding to these the representation of multiple information needs, the estimation of diversity exacerbates the problem. In order to leverage an appro- priate groundwork for reasoning under uncertainty, we devise a ranking objective for search result diversification in light of probability theory (Good,1950).

Recalling the greedy approximation in Algorithm 3.1, given a query q and a ranking Rq of documents retrieved for this query, our goal is to iteratively build

a new ranking Dq, with |Dq| ≤ τ , by selecting, at each iteration, the highest

scored document d ∈ Rq\ Dq. To this end, we devise xQuAD’s scoring function

according to the following probability mixture model:

fxQuAD(q, d, Dq) = (1 − λ) p(d|q) + λ p(d, ¯Dq|q), (4.1)

where p(d|q) models the probability of observing the document d given the query q, and p(d, ¯Dq|q) models the probability of observing d but none of the documents

already in Dq, selected in previous iterations, given q. These probabilities can be

interpreted as estimations of the relevance and the diversity of d, respectively, with the parameter λ controlling the balance between the two.

The probability of relevance, p(d|q), is defined in general terms, without any assumption regarding the underlying statistical mechanism used for estimation. In fact, any ranking approach can be used for this estimation, including the probabilistic ranking approaches as well as the machine-learned approaches in- troduced in Section 2.2, provided that they produce probabilistic scores. In turn, the probability of diversity, p(d, ¯Dq|q), models the contribution of a document

d towards answering the query q, when d is provided jointly with the already selected documents in Dq, which are assumed to be non-relevant. In practice,

this formulation models the marginal utility of the document d in light of the documents Dq, selected in the previous iterations of the greedy algorithm. As

a result, maximising the probability p(d, ¯Dq|q) increases the chance that at least

one relevant document is retrieved in response to the query, even when different users have different perceptions of this relevance (Sanner et al.,2011).

While estimating p(d|q) is comparatively simpler, the estimation of p(d, ¯Dq|q)

requires further development. To this end, it is useful to consider a sample space comprising features (e.g., terms) representing the information carried by the doc- uments in Rq, initially retrieved for q. As a result, d, Dq, and q can be seen as sets

of such features or, equivalently, events in this sample space. In order to derive p(d, ¯Dq|q), we further partition the sample space into a set of pairwise disjoint

sub-queries Sq = {s1, s2, · · · , sk}, with each sub-query s ∈ Sq representing one of

the possible information needs underlying q. The resulting probability space is illustrated by the Venn diagram in Figure 4.2 for k = 4 sub-queries.

q s1 s2 s3 s4 Dq d Rq

Figure 4.2: Sample space partitioned by sub-queries.

In Figure 4.2, we can identify the three events of interest, denoting the ob- servation of the query q, the document d, and the already selected documents in Dq. The thicker line in the figure restricts the sample space given the observation

of q. As a result, the intersection between this region and the region covered by the observation of a document can be seen as a measure of the probability that the document is relevant to the query. In particular, the intersection between the events d and q is highlighted in different shades: the darkest shade denotes the information represented by d that is also covered by the documents already selected in Dq; the lighter shades denote the novel information covered by docu-

ment d, split across the considered sub-queries. Our goal is then to estimate the probability associated with the event (d \ Dq) ∩ q or, equivalently, p(d, ¯Dq|q).

After defining our target probability space, we can derive the probability of diversity, p(d, ¯Dq|q), in a series of steps, according to:

p(d, ¯Dq|q) = X s∈Sq p(d, ¯Dq, s|q) (4.2) = X s∈Sq p(s|q) p(d, ¯Dq|q, s) (4.3) ≈ X s∈Sq p(s|q) p(d|q, s) p( ¯Dq|q, s) (4.4) ≈ X s∈Sq p(s|q) p(d|q, s) Y dj∈Dq p( ¯dj|q, s) (4.5) = X s∈Sq p(s|q) p(d|q, s) Y dj∈Dq (1 − p(dj|q, s)). (4.6)

In order to derive Equation (4.2), we apply the sum rule and marginalise the probability p(d, ¯Dq|q) over the sub-queries s ∈ Sq. Equation (4.3) follows trivially

from the product rule (Good,1950). The resulting probability p(s|q) can be seen as modelling the importance of the sub-query s with respect to the other sub- queries in Sq. This notion could reflect, for instance, users’ preferences or the

context of their search (Clarke et al., 2008; Agrawal et al., 2009).

In order to derive p(d, ¯Dq|q, s) in Equation (4.3), we assume that the observa-

tion of the document d is independent of the observation of the documents already selected in Dq (and, by extension, of ¯Dq), conditioned on the observation of the

query q and the sub-query s. While this assumption is also present in the formu- lation of other diversification approaches in the literature (e.g., Agrawal et al.,

2009; Carterette & Chandar, 2009), in reality, the knowledge of the documents that have already been selected affects the selection of the next document. On the other hand, this knowledge affects all candidate documents d ∈ Rq\ Dq equally,

since Dq is fixed at each iteration. As a result, it seems plausible to refactor the

probability p(d, ¯Dq|q, s) into a more tractable form. Note, however, that such a

refactoring does not at all imply that redundancy is ignored in our formulation. Instead, it results in separate models of the coverage of each document d with respect to the sub-query s, i.e., p(d|q, s), and its novelty in light of how poorly this sub-query is covered by the already selected documents in Dq, i.e., p( ¯Dq|q, s).

The conditional independence assumption in Equation (4.4) has a subtle but important implication: it turns the computation of novelty from a direct compari- son between documents into an estimation of the marginal utility of any document satisfying each sub-query. In other words, instead of comparing a document d to all documents already selected in Dq, as implicit novelty-based diversification

approaches would do (see Section 3.3.1), we estimate the utility of any document satisfying the sub-query s, as the probability that none of the already selected documents in Dq satisfy this sub-query. Although we achieve the same goal of

promoting novelty, we do so in a much more efficient way. In particular, our approach does not require looking up all the terms contained in all documents from the initial ranking Rq, so as to enable their direct comparison. Instead, we

just need to update the novelty estimation of a given sub-query, based on the es- timation of how much this sub-query is already covered by the documents in Dq.

In contrast to implicit approaches, this estimation only incurs a few additional inverted file lookups for the documents matching each of the sub-query terms.

In order to derive p( ¯Dq|q, s) in Equation (4.4), we make a second conditional

independence assumption. In particular, we assume that the documents already selected in Dq are independently relevant to the sub-query s. This assumption

seems reasonable, since novelty is estimated as the probability of the entire set Dq (as opposed to any particular document in Dq) not satisfying s. Lastly, for

convenience, Equation (4.5) is derived into Equation (4.6), by replacing p( ¯dj|q, s)

with its complementary probability, subtracted from 1, i.e., 1 − p(dj|q, s). It is

interesting to observe that this simple algebraic transformation emphasises the similarity of the probabilities p(d|q, s) and p(dj|q, s), which must be estimated as

part of the computation of each document’s coverage and novelty, respectively. The derivation of xQuAD’s relevance and diversity components in Equa- tion (4.1) is further illustrated by the graphical models in Figures4.3(a)and (b), respectively. Finally, by replacing Equation (4.6) into (4.1), the final diversifica- tion objective of xQuAD can be expressed according to:

fxQuAD(q, d, Dq) = (1 − λ) p(d|q) + λX s∈Sq p(s|q) p(d|q, s) Y dj∈Dq (1 − p(dj|q, s)). (4.7)

d q relevance (a) p(d|q) dj i -1 s k d q coverage novelty importance (b) X s∈Sq p(s|q) p(d|q, s) Y dj∈Dq p( ¯dj|q, s)

Figure 4.3: xQuAD’s graphical models of (a) relevance and (b) diversity, which are

mixed for the selection of a document d ∈ Rq\Dq at the i-th iteration of Algorithm3.1.