91
∑ ( , ; ) + ( ) (5.1) where ( , ; ) is the ranking loss function and is a parameter to control the trade-off between training error and model complexity.
To utilise the knowledge in the source domain, model adaptation has shown great success. It is reasonable to assume that the target model and the source model should have similar shapes in the function space. For example, the Ranking Adaptation SVM (RA-SVM) models the similarity as:
∑ ( , ; ) + ⁄2‖ ‖ + (1 − ) 2‖ −⁄ ‖ (5.2) A new adaptation regularisation term ‖ − ‖ is added into the objective function, which make sure the distance between the target model and source model are close.
RA_SVM focuses on the optimisation of norm adaptation regularization. In contrast, we adopt ‖ ‖ + ‖ − ‖ as the regularizer. Sparse regularization has been proved to be effective in many applications. In this paper, interests lie in the sparse adaptation regularization. By replacing the norm distance, the optimization problem can be stated as:
∑ ( , ; ) . . ‖ ‖ ≤ ; ‖ − ‖ ≤ (5.3) It is well known that norm usually leads to a sparse solution while norrm does not. Hence the optimal solution of E.q (5.2) and E.q (5.3) are expected to be different:
We obtain a sparse model while RA_SVM has a dense model.
We only require most of the ranking weights in the source domain are similar to that of the target domain and a few of them can be far away. While RA_SVM requires all the ranking weights between the source model and target model are close.
92
The above two observations show that when there are a lot of noise in the data and large different distribution between two models, our method shows superior to RA_SVM. The set
{ |‖ ‖ ≤ ∩ ‖ − ‖ ≤ } should not be empty so that the solution of the problem 3 can be obtained. We set = = for simplicity in this paper, and in order to analyze the optimization problem from the Fenchel primal dual view, we select the square hinge loss as the loss function and rewrite the Eq.(5.3) as
( ) = − ∑ 0, − ( ) ‖ ‖ ( ) − ‖ ‖ ( )
(5.4)
The primal problem of Eq.(5.4) can be written as ( ) = 4 ‖ ‖ − 1 ‖ ‖ + ‖ ‖ +1 2⟨ | ⟩ (5.5)
More details about Fenchel primal and dual are presented in 5.4.1. 5.4.1 Algorithm Analysis Statement
Some concepts and lemmas which will be used for the analysis of our algorithm are presented here. Fenchel primal-dual view of sparse cross-domain learning to rank
A function is convex if for all , ∈ ( ), the domain of is denoted as ( ).0 ≤ δ ≤ 1 and (δ + (1 − δ) ≤ δ ( ) + (1 − δ) ( ). A vector is a sub-gradient of a function at if
∀ , 〈 , − 〉 ≤ ( ) − ( ) (5.6) We define the set of subgradients of function at as ∂ ( ). Note that if is convex and differentiable at , and then ∂ ( ) consists of only a single vector, the gradient of at is denoted by ∇ ( ).An important concept used in this paper is the Fenchel conjugate.
93
Definition 1: The Fenchel conjugate of a function f is defined as:
f∗(θ) = max
∈ ( ), 〈θ, θ 〉 − f(θ ) (5.7)
Note that if f(θ ) is a convex and closed function, then the Fenchel conjugate of f∗is f itself. Lemma 1: (Proposition 3.3.4[20], Fenchel-Young inequality) Any points : ∈ (ℎ) and
: ∈ (ℎ∗) satisfy the inequality:
ℎ(θ ) + ℎ∗(θ) ≥ 〈θ , θ 〉 (5.8) Equality holds if and only if ∈ ∂ℎ( )
Lemma 2 (Theorem 3.1.8[20]) If the function : ℛ → (−∞, +∞] is convex, then any points in core ( ( )) and any direction in ℛ satisfy
( ; ) = {〈 , 〉| ∈ ( )} (5.9) In particular, the sub-differential ∂ ( )is noneempty.
Note that the core of a set C(written ( ( ))) is the set of points in ( ) such that for any direction in ℛ , + δθ lies in ( ) for all small real δ.
Lemma 3 (Lemma 3.2.6[20]) If the function ℎ: → (−∞, +∞] is convex and some point x in core ( ℎ) satisfies ℎ( ) > −∞, then ℎ never takes the value −∞
The next theorem is an important property for interpretation of the multi-Fenchel primal- dual framework and plays an important role in our analysis.
Theorem 1For function : ℛ → (−∞, +∞], : ℛ → (−∞, +∞], = 1, … , be a closed and convex functions. Φ is a ℝ × matrix, then
94
sup − f∗(−Φω) − ∑ g∗(ω) ≤ inf f(σ) + ∑ g (Φ σ/ )
The equality holds when
0 ∈ ⋂ ( ( ( ) − ( ))) Proof. The weak duality inequality holds immediately from the Fenchel-Yong inequality (Lemma 1). To prove the equality, we define a function ψ: ℛ → (−∞, +∞] by
( ) =
∈ℛ
( ) + ∑
It is easy to check is convex and ( ) = ( ) − ( ). If the condition0 ∈
⋂ ( ( ( ) − ( )))holds and ( ) + ∑ is finite, then it can observed from Lemma 2 and Lemma 3 that there exists a sub-gradient − ∈ (0).
We have:
(0) ≤ ( ) + 〈− , 〉, ∈ , ≤ ( ) + + + 〈− , 〉,
∈ , ∈ = { ( ) − 〈− , 〉} + + − 〈− , + 〉
Taking the infimum over all points v, and then over all points gives the inequalities
(0) ≤ − ∗(− ) − ∑ ∗( ) ≤ − ∗(− ) − ∑ ∗( ) ≤ ( ) +
∑ = (0).
95 When
0 ∈ ( ( ( ) − ( )))
When = , Theorem 1 can be stated as
Corollary 1Let : (−∞, +∞], , : (−∞, +∞] be closed and convex functions Φ is a
ℝ × matrix, then
− ∗(− ) − ∗( ) − ∗( ) ≤ ( ) + ( ⁄ ) +2 ( ⁄ ) 2 (5.10)
The equality holds when
0 ∈ 2 ( ) − ( ) ∩ 2 ( ) − ( )
When = 1, Theorem 1 can be stated as
Corollary 2 Let : (−∞, +∞], : (−∞, +∞]be closed and convex functions. Φ is a
ℝ × matrix, then
− ∗(− ) − ∗( ) ≤ ( ) + ( ) (5.11)
The equality holds when
0 ∈ ( ) − ( )
Through Corollary 1, we can get the dual presentation of the problem (5.4) in the following manner. We denote ∗( ) = I (ω) and g∗(ω) = I (ω) as the regularization terms of
Equation (5.4), where = { |‖ ‖ ≤ 1}, = { |‖ − ‖ ≤ 1}.
Denoting
∗( ) = ( (0,1 + )) = |1 + | ,
96
∗(− ) = (0,1 − ( ) ) ,
Which is the loss term in (4).
The problem (4) can be rewritten in its dual form:
( ) = − ∗(− ) − ∗( ) − ∗( ) (5.12)
Accordingly, we can also define the primal problem of Eq. (4). The following theorems state the definitions of ( ), ( ) and ( ).
Theorem 2(Lemma 3 [80])Let ∗(− ) = ∑ (0, − ( ) ) , then the Fenchel conjugate of
∗(−Φω) is ( ) = ∑ ( − + ( )).
Theorem 3 The Fenchel conjugates of ∗( ) = ‖ ‖ ( ) and ∗( ) = ‖ ‖ ( ) are
( ⁄ ) = ‖2 ‖ and ( ⁄ ) = ‖2 ‖ + 〈 , 〉 respectively. Proof. The Fenchel conjugate of ∗( ) is
( ⁄ ) =2 〈 ⁄ , 〉 −2 ∗( ) = ‖ ‖ 〈 ⁄ , 〉2 =1 2 | | =1 2‖ ‖
And the Fenchel conjugate of ∗( ) is
97 = ‖ ‖ 〈 ⁄ , 〉2 = ‖ ‖ 〈 ⁄ , 〉 + 〈2 ⁄ ,2 〉 = |( ⁄ ) | + 〈2 ⁄ ,2 〉 = 1 2‖ ‖ + 〈 , 〉.
Combining Theorem 2 and Theorem 3, the primal problem of Eq. (4) can be written as
( ) = ( ) + ( ⁄ ) + 2 ( ⁄ ) =2 ‖ ‖ − ‖ ‖ + ‖ ‖ + 〈 , 〉 (5.13)
According to Theorem 1, the Fenchel primal objective is always the upper bound of the dual’s:
( ) ≤ ( ) (5.14) We analyse the convergence properties of the proposed algorithm here. We set ∗ =
( ) as the best solution of the problem (4). At each iterationt, we denote = ( ∗) − ( ) as the difference between the best solution and the solution obtained at
iterationt. Theorem 4 establishes the stopping criterion for our algorithm.
Theorem 4: For all t, we have ≤ ‖ ‖ + 〈 , − 〉.Theorem 5 establishes the upper bound of required iterations to obtain a -accurate solution.
Theorem 5: The TFRank algorithm terminates after at most 16 ⁄ − 1 iterations and returns a
ϵ-accurate solution, where ≥ 0.125.Proof. Since ( ) > ( ), we have
− = ( ∗) − ( ) − ( ∗) − ( )
= ( ) − ( ) ≥ − ( )
98
= − (5.15) The proof of the convergence rate about FenchelRank (Theorem 1 [80]) shows that
− ≥ ⁄16 (5.16)
Combining equation (5.15) and equation (5.16), we can obtain
− ≥ ⁄16 (5.17)
According to Eq. (19), the result ϵ ≤ holds from the proof of the convergence rate about FenchelRank [80]. In summary, when the iteration t ≥ − 1, TFRank algorithm can obtain a
-accurate solution. This concludes our proof.