Self-tuning of Transfer Parameters - Multiclass Incremental Transfer Learning

7.3 Multiclass Incremental Transfer Learning

7.3.3 Self-tuning of Transfer Parameters

M“ „ XJX_`λI 1 1J 0 ´1 . (7.4)

The solution of the transfer learning problem is completely defined once we set the parameters β. In the next section we describe how to automatically tune these parameters.

7.3.3 Self-tuning of Transfer Parameters

Our goal is to tune the transfer coefficients β to improve the performance of the linear model for the new K_`1-class by exploiting only relevant source models while preventing negative transfer. We optimize the coefficients β automatically using an objective based on the Leave-One-Out (LOO) error, which is an almost unbiased estimator of the generalization error of a classifier [20]. An advantage of RLS, used as a basis to our approach, over other methods is that it allows the LOO error to be computed efficiently in analytic form. Specifically, we cast the optimization of β as the minimization of a convex upper bound of the LOO error. The LOO predictions for the entire training set with respect to hyperplane w_kis given by (derivation is available in the appendix).

yloo_k _“y_k_{´ p}M_˝I_q´1_pak´asrc_k q @kP rKs, (7.5)

yloo_K_`₁_pβ_{q “}y_K_`₁_{´ p}M_˝I_q´1_paK`1´Asrcβq.

We stress that (7.5) is a linear function of β. We now need a convex multiclass loss to measure the LOO errors. A fairly standard choice would be a convex multiclass loss as in [28], which keeps samples of different classes at the unit marginal distance. Slightly abusing notation in (7.5), such multiclass loss

7.4. Experiments

function would look like ℓmc_i _pβ_{q “}max

r‰yi ”

1`yloo_r,i pβ_{q ´}yloo_yi_,ipβ_qı

` . (7.6)

However, from (7.5) observe that changing β will only change the score of the target K`1-th class. Thus, when using this loss, almost all examples are neglected during optimization with respect to β. We address this issue by proposing a modified version of (7.6),

ℓmc-mod_i _pβ_{q “}

$ & %

”

1_`y_Kloo_`_1,i_pβ_{q ´}yloo_yi_,i_pβ_q

` : labeli‰K`1

max

r‰yi ”

1_`y_r,iloo_pβ_{q ´}yloo_yi_,i_pβ_q

` : labeli“K`1

The rationale behind this loss is to enforce a margin of 1 between the target K_`1-th class and the correct one, even when the K`1-th class does not have the highest score. This has the advantage of forcing the use of all examples during the tuning of β. Given the analytic form of LOO predictions (7.5) and the multiclass loss function above, we can obtain β by solving the convex regularized problem

min βPΩ # 1 m m ÿ i“1 ℓmc-mod_i _pβ_q + , with Ω_{“ t}x_{| }}x_}2ď1^ xľ0u. (7.7)

Constraining β within a unit L2ball is a form of regularization imposed on β, which prevents overfitting

as was shown in theoretical works on HTL [77, 76]. This optimization procedure can be implemented elegantly using projected subgradient descent, which is not affected by the fact that the objective function in (7.7) is not differentiable everywhere. The pseudocode of the optimization algorithm is summarized in Algorithm 2.

Finally we make a few comments on the computational complexity of the entire approach. The computational complexity for obtaining A, Asrc, and M is in O_pm3_`m2_pK_`1_qq, which comes from matrix operations (7.3)-(7.4). Algorithm 2 is in O_pmK_pT_`1_qq, where we assume that most terms in (7.5) are precomputed. Each iteration of the algorithm is efficient since it depends linearly on both the training set size and number of classes.1

7.4 Experiments

We present here a series of experiments designed to investigate the behavior of our algorithm when (a) the source classes and the target class are related/unrelated, and when (b) the overall number of classes increases. All experiments were conducted on subsets of two different public datasets, and the results were benchmarked against several baselines. In the rest of the section we first describe our experimental setup (section 7.4.1), then we describe the chosen baselines (section 7.4.2). Section 7.4.3 reports our findings.

Chapter 7. Class-incremental Hypothesis Transfer Learning

Algorithm 2 Projected subgradient descent to find β Input: M , Y , A, Asrc, T

Output: β

1: yloo_k _Ðy_k_{´ p}M_˝I_q´1_pak´asrc_k q @kP rKs 2: β₁_Ð0

3: for t_{P r}T_sdo _ŹIterations of subgradient descent. 4: yloo_K_`₁_“y_K_`₁_{´ p}M_˝I_q´1_paK`1´Asrcβtq

5: ∆_Ð₀

6: for i_{P r}m_sdo _ŹPassing through the training set. 7: if label_pyiq ‰K`1 then

8: if 1_`y_Kloo_`_1,i_´y_yloo

i,ią0 then 9: ∆_Ð∆_`_diag_p_M_q´1 i a src i 10: end if

11: else if maxr‰yip1`y_r,iloo´yloo_yi_,iq ą0 then 12: ∆_Ð∆_´_diag_p_M_q´1 i a src i 13: end if 14: end for 15: β_Ðβ_t_´ ∆ M?t 16: βÐ rβs` 17: if_}β_}2ą1 then 18: β“_}_ββ_}₂ 19: end if 20: β_t_`₁_Ðβ 21: end for

7.4. Experiments

In document Theory and Algorithms for Hypothesis Transfer Learning (Page 84-87)