Query-dependent Multiple Kernel Learning - Learning-To-Rank with Semantic Kernels

7.2 Learning-To-Rank with Semantic Kernels

7.2.4 Query-dependent Multiple Kernel Learning

In Sec. 7.1.3, we utilize each clause kernel as a base kernel and given a hypothesis H = {c1, ..., cn} as

a set of clauses we consider the learning problem as multiple kernel learning where combined kernel is specified as:

K(xi, xj) =

c∈H

dcKc(xi, xj)

Here dcdenotes the weight of clause c, which reflects the importance of the clause in the combination

of multiple kernels and will be optimized in the learning process. When we adapt this MKL method to our LTR domain, this introduces some difficulties, since under this condition the diversity among the queries has not been taken into consideration, so that the MKL method can not apply to our case. One problem for most of the LTR approaches is that they employ a unified function as the ranking model according to the retrieved documents for all queries without considering the diversity among the queries. Queries can differ from each other according to their association and closeness to various clauses, which can result in a biased ranking model according to the queries. Instead, we aim to capture the semantic information contained in the queries according to the weights they assign to every clause and to reveal the characteristics of different queries. Recalling our example above, one query can be more related to movie actors while the other can give more weight to location information. This is, in fact, in line with the recent proposals of LTR [128, 129] which point out that the key for addressing the LTR problem is to look at it from the viewpoint of query. Thus a query-level stability and generalization is needed unlike most of the LTR approaches that assume and treat every data point in the training to be independent and identically distributed.

For this purpose we localize the weights of every clause kernel by making them query-specific. Actually, this results in a localized multiple kernel learning which has been introduced by Gönen et al. [83]. Instead of specifying the weight vector d which is identical for all the data points, it proposes a gating function to assign different weights to each base kernel for every particular data point. Formally, we define a query-dependent weighting function, ηc : Q → R, qi ∈ Q, which assigns a value to every

clause c. That is to say, a query q may prefer one clause rather than the other, where this preference is reflected in the weights {ηc(q)}c∈H. Therefore, the multiple kernel in our LTR problem is defined as:

K(xi, xj) = X c∈H ηc(qi)ηc(qj)Kc(xi, xj) s.t. X c∈H ηc(qi) = 1, ∀qi∈ Q

where xi, xj ∈ X in the form of xi =< qi, dk, dl >, xj =< qj, dp, dq >, and Kc is a pairwise

clause kernel of hypothesis H. Here the input data will be grouped by queries and ηc(qi) is the query-

dependent weight function, which assigns the same query weight to all the input data belonging to one query group qi. In this way, the query-dependent multiple kernel K(xi, xj) takes not only the generality

of different queries but also the specialty of each query into consideration: The generality refers to the extracted relevant features, i.e. clauses, in learning process, whereas the specialty refers to the learned

query weights that reflect the importance of each query as a combination of multiple kernels. The query-dependent weight function ηc(qi), qi∈ Q can be expressed as:

ηc(qi) =

exp(vci)

c0exp(vc0_j)

(7.8)

where c0 denotes all of the clauses inclusive clause c, vci denotes the weight value for clause c w.r.t

query qi.

To solve the query dependent multiple kernel problem, a two-step alternating optimization algorithm can still be used as specified in Sec. 7.1.3. In the first step, the SVM coefficients {αi}ni=1are optimized

with ηc(qi) fixed. In the second step, the query-dependent function ηc(qi) is updated with {αi}ni=1fixed.

The dual form of this problem for the first step is:

max α Pn i=1αiyi=0,C≥αi≥0 −1 2 n X i,j=1 αiαjyiyjK(xi, xj) + n X i=1 αi

In the second step we replace the clause kernel weights dc with ηc(qi) and optimize according to the

following: max ηc(qi) P c∈Hηc(qi)=1,ηc(qi)≥0 −1 2 n X i,j=1 α∗iα∗jyiyjK(xi, xj) + n X i=1 α∗i

where α∗are the coefficient results got from the first optimization step. In the gradient descent optimization process, we take a simple differentiation of the dual function w.r.t. every vcias:

∂J ∂vci = −1 2 n X k,l=1 X c0_∈H α∗_kα∗_lykylηc0(q_k)η_c0(q_l)K_c0(x_k, x_l)(2δ(c = c0) − η_c(x_k) − η_c(x_l))

and apply an updating scheme such as vci← vci+ γ_∂v∂J_c_i.

Since we build the model based on the pairwise training data with the label only indicating whether the first document is preferred over the second, our final prediction function is also in the same form. This function can then only predict the preference between two documents of a given query in the test data. In other words, given a test data point xt=< qt, dt1, dt2 >, the following function:

f (xt) = n X i=1 αiyi X c∈H ηc(qi)ηc(qt)Kc(xi, xt) (7.9)

returns positive if dt1is preferred more than dt2 and negative otherwise. However, in our LTR task we

need to produce a ranked list of documents in descending order, aiming to find the relevant documents as much as possible. In order to achieve this, we develop an algorithm (i.e. Alg 5) that uses the Eq. 7.9 to generate a ranked list of documents. It is a greedy ordering algorithm that generates an approximate ordering in which every document is assigned a relevance value. After every document in the test data is assigned a relevance value, a ranked list is created according to their final score.

7.3 Experiments

We divide the experiments into two parts: In the first part we evaluate the SKM approach to evaluate its classification performance based on two different datasets. In the second part, we evaluate the LTR with semantic kernels by training a discriminative ranking function in an IR evaluation setting.

Algorithm 5: Order-by-Preference algorithm Input: T // Test data set of documents

for each di, dj ∈ T and di6= dj do

π(di) ←Pdi∈V P REF (di, dj) − P di∈V P REF (dj, di) Score.add(di, π(di)) bubble(Score) end for

Output: ordered list of Score

In document Search Relevance based on the Semantic Web (Page 152-154)