In this section, primal-dual methods have been employed to solve the optimisation problem (5.4). Primal-dual methods are based on the theory of Fenchel Duality (Corollary 1), which shows that the primal objective is always the upper bound of the dual’s. FenchelRank follows the genetic algorithmic framework to solve the ranking optimisation problems of the form: ∗(− ) + ∗( ) where ∗ is a convex loss function and g∗ is the regularization
term, such as the ℓ constrain ‖ ‖ ≤1. In this paper, we extend it to a more general form:
∗(− ) + ∑ ∗( )
We show that this general form also satisfies the theory of Fenchel Duality. The detail can be found in Theorem 1 and Appendix A.
The proposed sparse cross-domain learning to rank algorithm referred as TFRank, is described in Algorithm1.
99
Algorithm 1 TFRank algorithm
Input: pairwise data matrix , desired accuracy , maximal iteration number , the radius of ball, the source model .
Output: linear ranking predictor .
Initialize: = , =
For = , … ,
(1)Check the early stopping criterion Let = ∗ (− )
If ‖ ^ _ ‖ + 〈 , − 〉 ≤
return as ranking predictor End If
(2)Greedily choose a feature to update Choose = ( ) Finding = ,‖ | (( − ) + ( ) ) where = ( − ) + (( ) )
(3)Select a new feature according to the source model
Update = ∗ (− ) Choose = ( ) × ( ) Finding = (( − ) + ( ) ) where = − + ( )
100 If ( ) >
Let = − ( )
End If End For
The input of the algorithm includes a pairwise data matrix , a desired accuracy , the number of iterations , the radius of -ball and the source model . The dual solution is initialized to be , and the algorithm stops when the desired accuracy is met or the maximal iteration number reached.
In the iteration: two variables, dual solution and primal solution , are updated to find the solution. Specifically, our algorithm has three main steps: (1) Check the early stopping criterion. (2) Greedily choose a feature to update. (3) Select a new feature according to the source model. Note that Step (1) is an identical step in the FenchelRank algorithm, and Step (2) is similar to the FenchelRank algorithm, except that we also require the new constraint ‖ − ‖ is less than 1. Step (3) is to transfer the most confident prior knowledge from the source domain. If the feature in the source domain is important, it should have the high probability to be selected in the target domain. We will discuss these issues in the following subsections, respectively.
5.5.1 Checking the early stopping criterion
Theorem 4 establishes the stopping criterion for our algorithm. It shows that the difference between the best solution and the solution obtained at iteration is less than ‖ ‖ + 〈 , − 〉.Hence, if‖ ‖ + 〈 , − 〉 < we obtain a -accuracy solution. This is also verified in the lemma 11 of [123].
5.5.2 Greedily choose a feature to update
In this subsection, we describe how to choose a feature and compute an appropriate step size to update the weight of this feature. At iteration, given the dual variables , we compute the primal vector as following
101 ( ) = ∗((−Φωt) )= 2ρ2 p 1 ρ−(Φωt) if 1 ρ−(Φωt) ≥ 0 0 otherwise (5.18)
Where σ ( ) denotes the i coordinate of σ . Since σ = ∂f∗(−Φω ), we have f∗(−Φω ) + f(σ ) = 〈σ , −Φω 〉 (Lemma 1). Then, the algorithm selects a feature j , which has the largest absolute value of (Φ σ ) as a weak learner. This step is a common way to obtain a weak learner in boosting algorithms [51], which choose the most violated feature to update.
Given the selected feature, we set ω to be the convex combination of ω and the selected feature
ω = (1 − )ω + n (Φ σ ) ℯ (5.19) Where the sign function n(x) = 1 if x ≥ 0 and otherwise n(x) = −1, and ℯ denotes the vector with all zeros except the i element is 1.
The coefficient is calculated so as to maximize the increase of
Γ ω − Γ(ω ) = Γ(1 − )ω + n (Φ σ ) ℯ ) − Γ(ω ) Denoting = Φ(( n ΦT σt j t ℯ jt− ω ) and =1 ρ− Φω
The problem can be simplified as
=
,‖ |
( − )
And it can be solved analytically, which has been shown in FenchelRank. 5.5.3 Select a new feature according to the source model
In the following updated step, we select the feature with the help of the model parameter learned in the source domain. We set = ∗ (− ). The feature employs the best-
102
weighted edge by = ( ) × ( ) . Note that the large value in source model
will have the largest change to be selected. After that, we use the similar method to
update = 1 − + ( ) , and = ((1 − ) +
( ) ) respectively. Finally, ω is adjusted to weaken the impact of the selected element( ) . This strategy guarantees that the ranking model can approach ω in each round of the training process.
We show that the above update rule can satisfy the constraint of ‖ ‖ ≤ and ‖ − ‖ ≤ . Since we initialize the weight vector to be the zero vector and restrict the range of coefficient to be in[ , ], it is easy to verify that for any ,
= ( − ) + ≤ ( − )‖ ‖ + ‖ ‖ ≤ Since ‖ ‖ ≤ , we have ‖ ‖ = − + ( ) ≤ − + = Furthermore, ‖ − ‖ ≤ − − + ( ) − ≤ − + = 1
103 5.5.4 Discussion
Compared to the FenchelRank algorithm, our innovations are making the following differences. 1) FenchelRank utilised Fenchel-dual inequality as a tool to explain the upper bound of the algorithm. However, Fenchel-dual inequality only has two terms inequality, which is not useful for our proposed framework with three terms. Here, we propose a more general form of Fenchel-dual inequality, the correctness of which is proved by us in this paper. Based on the new Fenchel-dual inequality, we confirm the upper bound of our proposed algorithm. 2) Since an additional term learned from the source domain leads to the framework, the proposed algorithm has to be adjusted against FenchelRank. Accordingly, the convergence rate of our proposed algorithm needs to be analysed.