Model Learning - Learning Intent-aware Ranking Models

7.2 Intent-aware Search Result Diversification

7.2.3 Learning Intent-aware Ranking Models

7.2.3.1 Model Learning

In order to produce an intent-aware model p(d|q, s, ι) for each intent ι underlying the sub-query s, we once again resort to machine learning. In particular, we deploy a large set of document features, and leave it to a learning to rank algorithm to generate ranking models optimised for different intents. To achieve this goal, each model is learned using the entire feature set, but with a different training set of queries for each target intent. Given the intents considered in our investigation (i.e., informational and navigational), we use two intent-targeted query sets from the TREC 2009 Million Query track (Carterette et al., 2009b). The first set contains 70 informational queries and the second set contains 70 navigational queries, as judged by TREC assessors. As a learning algorithm, we use AFS (Metzler, 2007), as described in Section 2.2.3.2. In our experiments, it is deployed to optimise mean average precision (MAP; Equation (2.50)).

7.2.3.2 Document Features

To enable the generation of effective intent-aware ranking models, we deploy a total of 60 document features, summarised in Table 7.2. Besides the query- dependent features previously described in Table5.3, we include field-based exten- sions of BM25 and PL2, namely, BM25F (Zaragoza et al.,2004) and PL2F (Mac- donald et al., 2006). As additional query-independent features, we include URL features—UD and UW, denoting the number of digits in the URL of the document and whether this URL comes from Wikipedia, respectively—and link analysis features—ER (Becchetti et al.,2006), denoting the likelihood that the outlinks of the document are reciprocated, and the score produced by the Absorbing Model (AM, Plachouras et al. (2005)), a link analysis algorithm based on absorbing Markov chains (Kemeny & Snell, 1960). In particular, each feature is computed for a sample of 5000 documents retrieved by DPH (Equation (2.31)).

Table 7.2: Document features used for learning intent-aware ranking models.

Feature Description Equation Total

Q u er y -d ep en d en

t CLM Full and per-field CLM score (2.5) 5

BM25 Full, per-field, and field-based BM25 score (2.13) 6

LM Full and per-field LM score (2.25) 5

MRF Full MRF score (2.20) 8

PL2 Full, per-field, and field-based PL2 score (2.29) 6

DPH Full, and per-field DPH score (2.31) 5

pBiL Full pBiL score (2.32) 8

Q u er y -i n d ep en d en t

UC Presence of host, domain, path, and query string (2.33) 4

UL Length of URL host, path, and query string (2.35) 3

UD Number of digits in the host and domain 2

UW Whether the URL is from Wikipedia 1

HL Ham (non-spam) likelihood (2.42) 1

ID Indegree (2.43) 1

OD Outdegree (2.44) 1

PR Original and transposed PageRank score (2.45) 2

AM Absorbing Model score 1

ER Edge reciprocity score 1

Grand total 60

Table 7.3 lists the top 10 features as they were selected by AFS for each of our produced intent-aware models. For each feature, we show its attained performance in terms of MAP when combined with the features selected before it. From the table, we observe that the top features are generally intuitive. For instance, DPH (which is used to generate the learning sample) is the top feature for both models. Likewise, as expected, various URL and link analysis features (e.g., UW, UL, AM, PR, IL) are ranked high in the navigational model. Besides producing intuitive intent-aware models, we believe that our data-driven approach based on a large set of features provides a more robust alternative to hand-picking features traditionally associated with each intent. Lastly, it is worth noting that, although the choice of appropriate feature sets naturally depends on how learning instances (i.e., sub-queries) and labels (i.e., intents) are represented, our approach is agnostic to these representations. Indeed, while instantiating it for a different aspect representation or a different set of intents may require devising different features, no modification to the approach itself would be necessary.

Table 7.3: Top 10 document features in the informational and navigational models.

Informational Navigational

Feature MAP Feature MAP

1 DPH 0.261 DPH 0.211 2 UD 0.275 MRF (body) 0.227 3 PL2 (title) 0.282 BM25 (title) 0.241 4 BM25 (field-based) 0.291 UW 0.252 5 pBiL (body) 0.296 CLM 0.259 6 pBiL (anchor) 0.298 UL 0.263 7 ER 0.300 AM 0.267 8 LM (title) 0.301 PR (transposed) 0.269 9 CLM (body) 0.302 IL 0.272 10 CLM 0.303 pBiL (body) 0.274

7.3 Experimental Evaluation

In this section, we address the third claim from our thesis statement: “By maximising the relevance of the retrieved documents to multiple sub-queries, a high coverage of these sub-queries can be achieved.”

To address this claim, we evaluate the effectiveness of our intent-aware approach to improve the coverage estimates leveraged by the xQuAD framework.4

In particular, we aim to answer the following research questions:

Q1. Can we improve diversification performance with our model selection regime? Q2. Can we improve diversification performance with our model merging regime?

In the following, Section7.3.1details the experimental setup that supports the investigation of these questions, including the test collections, the diversification baselines, and the classification approaches used by the two regimes, as well as the procedure carried out for training and evaluating all approaches. The results of this investigation are discussed in Section 7.3.2.

While the estimated relevance of a document with respect to a sub-query also impacts xQuAD’s estimation of novelty, we leave the analysis of this component to Chapter 8.

7.3.1 Experimental Setup

In this section, we describe the specific setup that supports our investigation in Section7.3.2, as an extension of the general methodology described in Section5.1. 7.3.1.1 Test Collections

Our analysis is based on the WT09 and WT10 test collections, described in Table 5.1, comprising 49 and 48 queries from the diversity task of the TREC 2009 and 2010 Web tracks (Clarke et al., 2009a, 2010), respectively. For each of these 97 queries, we consider both the TREC Web track sub-topics (WT) as well as query suggestions provided by Bing (BS) as alternative sub-query sets. Both the WT and BS sub-query sets are described in Section 5.2.1.3. In particular, as discussed in Section 7.2.2.2, the WT sub-query set provides judged intent labels for each sub-query, which can be contrasted to our performance-oriented labelling of training data. Finally, as a document corpus, we consider the category B portion of ClueWeb09, as described in Section 5.1.1.

7.3.1.2 Diversification Baselines

As diversification baselines for the experiments in Section 7.3.2, we consider two deployments of our xQuAD framework. Each of these deployments uniformly ap- plies one of the informational (Uni(inf)) or the navigational (Uni(nav)) models described in Section7.2.3.1for all sub-queries, regardless of the intent of each sub- query. Using either the Uni(inf) or the Uni(nav) model, xQuAD is deployed to diversify the top 1000 documents retrieved by the DPH ranking model (Equa- tion (2.31)), which serves itself as a non-diversification, relevance-only baseline. 7.3.1.3 Classification Approaches

In Section 7.2.2, we introduced two regimes for exploiting the inferred intents of different sub-queries: model selection and model merging. The model selection regime builds upon a hard classification of intents. To enable a thorough evaluation, we consider variants of this regime of the form Sel(c,l), where c and l denote a classifier and a set of classification training labels, respectively. In

particular, c can be one of three classifiers: an oracle (ora), which simulates a perfect classification of the intent of each sub-query, a support vector machine (svm) classifier with a polynomial kernel (Platt, 1998), and a multinomial logistic regression (log) with a ridge estimator (le Cessie & van Houwelingen,

1992). Regarding the classification labels l, as described in Section 7.2.2.2, we consider both human judgements (judg) as well as the selection with best diversification performance (perf) on the training data. In all cases, the single most likely intent is chosen for each sub-query, in a typical selective fashion. To enable our second regime, model merging (Mrg(c,l)), we fit the output of the SVM classifier to a logistic regression model, hence obtaining a full probability distribution over intents for each aspect underlying the query (Witten & Frank,

2005). To cope with the high dimensionality of our sub-query feature set, classification is performed after a dimensionality reduction via principal component analysis (Pearson, 1901). All classification tasks are performed using Weka.5

In document Explicit web search result diversification (Page 176-180)