• No results found

Self-tuning of Transfer Parameters

7.3 Multiclass Incremental Transfer Learning

7.3.3 Self-tuning of Transfer Parameters

M“ „ XJX`λI 1 1J 0 ´1 . (7.4)

The solution of the transfer learning problem is completely defined once we set the parameters β. In the next section we describe how to automatically tune these parameters.

7.3.3 Self-tuning of Transfer Parameters

Our goal is to tune the transfer coefficients β to improve the performance of the linear model for the new K`1-class by exploiting only relevant source models while preventing negative transfer. We optimize the coefficients β automatically using an objective based on the Leave-One-Out (LOO) error, which is an almost unbiased estimator of the generalization error of a classifier [20]. An advantage of RLS, used as a basis to our approach, over other methods is that it allows the LOO error to be computed efficiently in analytic form. Specifically, we cast the optimization of β as the minimization of a convex upper bound of the LOO error. The LOO predictions for the entire training set with respect to hyperplane wkis given by (derivation is available in the appendix).

ylook yk´ pM˝Iq´1pak´asrck q @kP rKs, (7.5)

ylooK`1pβq “yK`1´ pM˝Iq´1paK`1´Asrcβq.

We stress that (7.5) is a linear function of β. We now need a convex multiclass loss to measure the LOO errors. A fairly standard choice would be a convex multiclass loss as in [28], which keeps samples of different classes at the unit marginal distance. Slightly abusing notation in (7.5), such multiclass loss

7.4. Experiments

function would look like mci pβq “max

ryi

1`yloor,i pβq ´ylooyi,ipβqı

` . (7.6)

However, from (7.5) observe that changing β will only change the score of the target K`1-th class. Thus, when using this loss, almost all examples are neglected during optimization with respect to β. We address this issue by proposing a modified version of (7.6),

mc-modi pβq “

$ & %

1`yKloo`1,ipβq ´ylooyi,ipβq

ı

` : labeliK`1

max

ryi

1`yr,iloopβq ´ylooyi,ipβq

ı

` : labeliK`1

The rationale behind this loss is to enforce a margin of 1 between the target K`1-th class and the correct one, even when the K`1-th class does not have the highest score. This has the advantage of forcing the use of all examples during the tuning of β. Given the analytic form of LOO predictions (7.5) and the multiclass loss function above, we can obtain β by solving the convex regularized problem

min βPΩ # 1 m m ÿ i“1 mc-modi pβq + , with Ω“ tx| }x}2ď1^ xľ0u. (7.7)

Constraining β within a unit L2ball is a form of regularization imposed on β, which prevents overfitting

as was shown in theoretical works on HTL [77, 76]. This optimization procedure can be implemented elegantly using projected subgradient descent, which is not affected by the fact that the objective function in (7.7) is not differentiable everywhere. The pseudocode of the optimization algorithm is summarized in Algorithm 2.

Finally we make a few comments on the computational complexity of the entire approach. The computational complexity for obtaining A, Asrc, and M is in Opm3`m2pK`1qq, which comes from matrix operations (7.3)-(7.4). Algorithm 2 is in OpmKpT`1qq, where we assume that most terms in (7.5) are precomputed. Each iteration of the algorithm is efficient since it depends linearly on both the training set size and number of classes.1

7.4 Experiments

We present here a series of experiments designed to investigate the behavior of our algorithm when (a) the source classes and the target class are related/unrelated, and when (b) the overall number of classes increases. All experiments were conducted on subsets of two different public datasets, and the results were benchmarked against several baselines. In the rest of the section we first describe our experimental setup (section 7.4.1), then we describe the chosen baselines (section 7.4.2). Section 7.4.3 reports our findings.

Chapter 7. Class-incremental Hypothesis Transfer Learning

Algorithm 2 Projected subgradient descent to find β Input: M , Y , A, Asrc, T

Output: β

1: ylook Ðyk´ pM˝Iq´1pak´asrck q @kP rKs 2: β1Ð0

3: for tP rTsdo ŹIterations of subgradient descent. 4: ylooK`1yK`1´ pM˝Iq´1paK`1´Asrcβtq

5: ∆Ð0

6: for iP rmsdo ŹPassing through the training set. 7: if labelpyiq ‰K`1 then

8: if 1`yKloo`1,i´yyloo

i,ią0 then 9: ∆Ð`diagpMq´1 i a src i 10: end if

11: else if maxryip1`yr,iloo´ylooyi,iq ą0 then 12: ∆дdiagpMq´1 i a src i 13: end if 14: end for 15: βÐβt´M?t 16: βÐ rβs` 17: if}β}1 then 18: β}ββ}2 19: end if 20: βt`1Ðβ 21: end for

7.4. Experiments