Automatic Reformulation with Syntactic Operators

PART I Modeling Ambiguous Search Intent

CHAPTER 5 QUERY REFORMULATION WITH SYNTACTIC OPERATORS

5.3 Automatic Reformulation with Syntactic Operators

We formulate the automatic syntactic reformulation of keyword queries as a supervised learning problem. Particularly, we cast the problem as to predict the benefit in performance for each candidate syntax query given a keyword query.

Formally, given a keyword query q, a syntactic operator op and a target performance metric M , our goal is to find a list of syntactic reformulations of q with op, denoted as Sop(q) = q1, q2, , qn, which is ranked according to M in descending order: M (q1) > M (q2) > ... > M (qn). When it is required to output top m suggestions, the system will respond with a list consisting of q1, q2, qm. When a syntactic reformulation is required to directly optimize the search results, the system will output the top ranked query q1 if M (q1) > M (q), or the original query q otherwise. The training process is to learn a function f , which takes a set of features of a syntax query qi, to predict the performance improvement M0(qi) = M (qi) − M (q).

Since there are an exponential number of possible syntax queries for each syntactic operator, we limit the system to consider single appearance of each operator. Although we only explore the limited scope of use of these operators, the proposed methodology is general and can be potentially applicable to other filtering-based operators and multiple uses of each operator.

To solve the learning problem, a natural way is to use a regression model. In this method, a model will be learned to predict the performance for any possible syntax query. One problem with this method is that the performances of syntax queries reformulated based on different keyword queries are usually not directly comparable. For instance, from the example of training data in Table 5.2, we see that query 302 and query 313 have much different performances due to their intrinsic difficulties. To circumvent the problem, we adopt the learning to rank framework. In this setting, syntax queries generated from each keyword query are considered as a group; the loss function is defined on the ranking order of members in each group instead of on the absolute value of the performance. This loss function is more natural in our problem setting and avoids the issue of

Table 5.2: Examples of Training Samples for Syntactic Reformulation with Necessity Operator

QID Syntax Query NDCG@10

302 disease poliomyelitis polio under control world 0.163

302#1 +disease poliomyelitis polio under control world 0.105 302#2 disease poliomyelitis +polio under control world 0.203 302#3 disease poliomyelitis polio under +control world 0

313 commercial uses magnetic levitation 0.081

313#1 +commercial uses magnetic levitation 0.077

313#2 commercial uses +magnetic levitation 0.081

313#3 commercial uses magnetic +levitation 0.333

incomparability. We then use the learned model to predict the potential benefit in performance for each candidate in syntactic reformulation. Particularly, we use Ranking SVM [41] as the learning method in our study.

5.3.1 Features

We propose three types of features, namely difficulty, distinguishability and negativity. All three types of features are defined in a general way so that they can be computed for any type of filtering- based operators. As used in previous discussions, we denote q as the keyword query, op as the target operator and qxas a syntactic reformulation of q with operator op.

Difficulty. The difficulty feature aims to measure the intrinsic difficulty of a syntax query. Query difficulty prediction has been widely studied in recent years. We modify the clarity feature proposed by Cronen-Townsend et al. [20] slightly to make it applicable to syntax queries:

Clarity(qx) = KL(θmkθC) = X w∈V p(w|θm) log p(w|θm) p(w|θC) (5.1)

where θm is the language model estimated from the set of matched documents Smof qx, θC is the language model estimated from the entire collection. Queries with high clarity are more likely to work well.

In addition to clarity, we propose another difficulty feature, which is inspired by the inverse document frequency (IDF). We generalize this concept to compute the specificity of a syntax query:

GIDF (qx) = log

Nc− NSm+ 0.5

NSm+ 0.5

(5.2) where Nc and NSm are the number of documents in the entire collection and matched set,

respectively. Intuitively, a query matching fewer documents is more specific and meaningful. To understand how these two features work, let us take a look at an example. Suppose we are given a query Oscar winner selection. To reformulate with necessity operator, we evaluate the queries +Oscar winner selection, Oscar +winner selection and Oscar winner +selection. It can be expected that the first query would have much higher clarity and GIDF, as the word “Oscar” is associated with fewer documents, which are more likely to be focused on a particular topic. In reformulation with phrase operator, we consider the candidates “Oscar winner” selection and Oscar “winner selection”. Similarly, we could imagine the first query will have higher clarity and GIDF as the matched documents on “Oscar winner” is much fewer and more specific than documents matching “winner selection”.

Distinguishability. The idea of the distinguishability feature is to quantify the changes a syntax query brings to the original query. For this purpose, we define cross clarity between θmand θq, the language model estimated from the search results of q:

CrossClarity(qx, q) = KL(θmkθq) = X w∈V p(w|θm) log p(w|θm) p(w|θq) (5.3)

This feature measures the change in the topical formation of the syntax query qxand the original query q.

Besides measuring content changes, we use another feature to measure the change in the ranking of documents. Particularly, we measure the correlation between the document rankings of q and qx:

Cor(qx, q) =

#concord(qx, q) − #discord(qx, q) 1

2Nq(Nq− 1)

(5.4) where concord(qx, q) and discord(qx, q) are the sets of concordant pairs and discordant pairs between the two ranking lists of search results of qx and q. The correlation feature quantifies the changes qxbrings to the original ranking of q.

In the example of reformulating Oscar winner selection with phrase operator, we have two candidate queries “Oscar winner” selection and Oscar “winner selection”. The first query will have lower cross clarity and higher correlation with the original keyword query, as it captures the essential topic of the query. It is therefore more suitable for a suggestion.

Negativity. The negativity feature measures the similarity between the syntax query qxand the negative documents (e.g. skipped documents in browsing). A query less similar to the negative documents naturally has higher chances of working well.

Again, we define the content based negativity feature as the cross clarity between the language model estimated from the matched documents Smand the language model estimated from the negative feedback documents Sn.

In addition, we define another negative feature as the generalized inverse negative frequency (GINF):

GIDF (Sm, Sn) = log

NSn+ 0.5

NSm∩Sn+ 0.5

(5.5) In effect, this feature indicates the necessity of requiring a particular operator in the original keyword query.

By computing negativity features with different operators, we are actually “diagnosing” why the original query does not work well. For instance, if the negative documents of query Oscar winner selectiondiscuss a lot about “Oscar winner”, but rarely mention “selection”, then it is probable that the user intend to focus on the “selection” of “Oscar winner”, while the system is somehow biased towards other popular topics on “Oscar winner”. In this case, the query Oscar winner +selection might be able to amend the mistake. Or if the negative documents mention the three keywords fairly well, but the phrase “Oscar winner” seldom occurs, it probably indicates that the retrieval system overlooked the fact that “Oscar winner” must be matched as a phrase to maintain its meaning. Therefore, the query “Oscar winner” selection might work well to serve the users purpose.

5.3.2 Combining Operators in Prediction

Our proposed algorithm not only works for predicting each operator separately, but can also be applied to predict different operators jointly, as filters can be applied additively. We refer to this joint prediction method as Operator-Combination.

An alternative method is Result-Combination, in which we predict each operator separately and select the reformulation with the best predicted performance. In practice we find this method not only more effective but also more efficient than the Operator-Combination, as it considers much fewer candidates in prediction.

In document Intent modeling and automatic query reformulation for search engine systems (Page 67-71)