• No results found

Representation Function Fixed by Source Task

Suppose samples from source task S are abundant, samples from target task T are scarce, and there exist some f,gS,gT such thatgS◦f andgT◦f are accurate hypothe-

ses for tasks SandT respectively. A natural approach to leveraging the source data is to learn ˆgS◦fˆ∈ Husing data from taskS, from which we assume we may recover

ˆ

f F, then perform empirical risk minimization over Gfˆ := {gfˆ : g G}on T yielding ˆgT ◦fˆ. While in general we cannot recover ˆf with knowledge of ˆgS◦ fˆ

alone, in the case of feedforward neural networks which we focus on, knowing the weights learned onSis sufficient for recovering ˆf.

Theorem 3.1 upper-bounds RT(gˆT◦fˆ)using four terms:

1. a function ω measuring a transferrability property obtained analytically from

the problem setting;

2Pentina and Lampert [2014] extend this analysis to stochastic hypotheses (i.e. distributions over

deterministic hypotheses), where for each task we learn a posterior given a prior and training data. The quality of the prior affects the learner’s performance. The study proposes using source tasks to learn a ‘hyperposterior’, a distribution over priors which is sampled to give a prior for each task. Such a hyperposterior may focus the learner on a representation function shared across tasks. The study gives a PAC-Bayes bound on the expected risk of using a hyperposterior to learn a new task drawn from the environment, in terms of the average empirical risk obtained using the hyperposterior to learn the source tasks.

§3.3 Representation Function Fixed by Source Task 37

2. the source task empirical risk ˆRS(gˆS◦fˆ);

3. the generalization error of an hypothesis inHlearned frommS samples; and

4. the generalization error of an hypothesis inGlearned frommT samples.

Note that we do not settle for boundingRT(gˆT◦fˆ)in terms of ˆRT(gˆT◦fˆ), which

may be large.

Theorem 3.1. Letω :RR be a non-decreasing function. SupposeµXY, µ0XY, f and Gˆ

have the property that

∀gˆS∈G, min g∈GRT(g◦

ˆ

f)ω(RS(gˆS◦fˆ)). (3.2)

Let gˆT := arg min g∈G

ˆ

RT(g◦fˆ). Then with probability at least1−δover pairs of training

sets for tasks S and T,

RT(gˆT◦ fˆ)≤ ω(RˆS(gˆS◦fˆ) +2 s 2VC(H)log(2emS/VC(H)) +2 log(8/δ) mS ) +4 s 2VC(G)log(2emT/VC(G)) +2 log(8/δ) mT .

Proof. Letg∗T :=arg min

g∈G

RT(g◦fˆ). With probability at least 1−δ,

RT(gˆT◦fˆ) ≤RˆT(gˆTfˆ) +2 s 2VC(G)log(2emT/VC(G)) +2 log(8/δ) mT (3.3) ≤RˆT(g∗ T◦fˆ) +2 s 2VC(G)log(2emT/VC(G)) +2 log(8/δ) mT (3.4) ≤RT(g∗T◦fˆ) +4 s 2VC(G)log(2emT/VC(G)) +2 log(8/δ) mT (3.5) ≤ω(RS(gˆS◦fˆ)) +4 s 2VC(G)log(2emT/VC(G)) +2 log(8/δ) mT (3.6) ≤ω(RˆS(gˆS◦fˆ) +2 s 2VC(H)log(2emS/VC(H)) +2 log(8/δ) mS ) + 4 s 2VC(G)log(2emT/VC(G)) +2 log(8/δ) mT . (3.7)

38 Risk Bounds for Transferring Representations With and Without Fine-Tuning

Using m training points and an hypothesis class of VC-dimension VC(·), with probability at least 1−δ, for all hypothesesh ∈ Hsimultaneously, the riskR(h)and

empirical risk ˆR(h)satisfy

|R(h)Rˆ(h)| ≤2

r

2VC(·)log(2em/VC(·)) +2 log(4/δ)

m (3.8)

[Mohri et al., 2012]. Applying (3.8) toGyields (3.3) and (3.5) with probability at least 1 δ

2. Applying (3.8) to H, and using the fact thatω is non-decreasing, yields (3.7)

with probability at least 1 δ

2. (3.4) holds by the definition of ˆgT and (3.6) follows

from the assumption (3.2). Applying the union bound achieves the result.

While we refer to ω in a general form, we give an example in Section 3.3.1 and

expect that others exist. We defineωby relatingRS(gˆS◦fˆ)to min g∈GRT(g◦

ˆ

f), since we expect this may be feasible analytically as in our example in Section 3.3.1. However, because we only observe ˆRS(gˆS◦ fˆ), in Theorem 3.1 we use this to boundRS(gˆS◦fˆ)

and then applyω.

It is instructive to compare Theorem 3.1 to a standard VC-dimension based bound on the target task risk of an hypothesis h drawn from H learned using mT training

points [Mohri et al., 2012]: with probability at least 1−δ, for all hypotheseshsimul-

taneously, RT(h)≤ RˆT(h) +2 s 2VC(H)log(2emT/VC(H)) +2 log(4/δ) mT . (3.9)

We conclude that if ω(R) = O(R); ˆRS(gˆS◦ fˆ) is a small constant; mS mT,

i.e. labeled source task data is abundant while labeled target task data is scarce; and VC(H)VC(G), i.e. transferring the representation function ˆf simplifies target task learning by virtue of the smaller hypothesis space it induces compared to searching F; then consequently, the VC-dimension-based upper bound on target task risk is smaller by transferring ˆf from Scompared to learningTfrom scratch using H.

We observe that a smaller upper bound on risk does not imply smaller risk; indeed, sinceG H, it follows that

min

h∈HRT(h)≤ming∈GRT(gˆ◦

ˆ f)

and hence we may ‘get lucky’ and find a low risk hypothesis h learning just with samples from task T. In this case we may not be able to verify that the hypothesis is low risk, however, given the scarcity of samples fromTand the expressiveness of H. Conversely, transferring ˆf from S and applying Theorem 3.1, we may more tightly bound target task risk with high probability. We observe that Theorem 3.1 can be used to select source taskSgiven several options by picking the task corresponding to the lowest risk upper bound.

§3.3 Representation Function Fixed by Source Task 39 LearnT from scratch Learn ˆgS◦fˆ onS ˆ f ˆ gS Transfer ˆf fromS, learn ˆgTonT ˆ f ˆ gT

Figure 3.2: Neural network example learningT from scratch (left) and with weights transferred from S(right). Thin blue and thick red lines show weights trained on S andT respectively. Under certain assumptions, weight transfer yields low risk onT.

3.3.1 Neural Network Example with Fixed Representation

In Theorem 3.5, we give an example of the property required by Theorem 3.1, which is specific to a particular problem setting. We consider a feedforward neural network with a single hidden layer (see Figure 3.2). We propose transferring the lower-level weights (corresponding to ˆf) learned onS, so that only the upper-level weights (cor- responding to G) have to be learned on T. We want to show ˆf is also useful for T, i.e. that for someg∈ Gwe have smallRT(g◦fˆ).

We introduce several assumptions required to show Theorem 3.5. Assumption 3.4 requires that some lower-level weights perform well on both tasks, which is clearly a necessary condition for the specific fˆ we are transferring to perform well on both tasks. Our other two assumptions together guarantee that a point x ∈ X for which

ˆ

f(x)contributes to the risk on T cannot be ‘hidden’ from the risk of using ˆf on S, either through low magnitude upper-level weights (prevented by Assumption 3.2) or low µ0X(x) (prevented by Assumption 3.3). Hence RS(gˆS◦fˆ) reliably indicates the

usefulness of ˆf on T.

Assumption 3.2 (Restricted class of feedforward neural networks). Let X = Rn,

Z =Rk and a:RRbe a fixed activation function satisfying

a(−x) =−a(x), (3.10)

i.e. a is an odd function (examples include tanh, sign and identity). Let

F:= {f :X → Z : f(x) = [a(w1·x), . . . ,a(wk·x)],wi ∈Rnfor1≤i≤k} (3.11)

and

G:={g:Z → Y :g(z) =sign(v·z),v ∈ {−1, 1}k}, (3.12) where the symbol·denotes the dot product.

40 Risk Bounds for Transferring Representations With and Without Fine-Tuning

Assumption 3.3 (Relative rotation invariance between source and target unlabeled distributions). Letwˆi ∈ Rnfor1≤i≤ k and let fˆ∈F be defined as

ˆ

f(x):= [a(wˆ1·x), . . . ,a(wˆk·x)]. (3.13)

Suppose there exist finite nonzero constants c,α1,· · · ,αk andβ1,· · · ,βk such that

kwik=kαiiβiwik, (3.14) wi·(αiiβiwi) =0, (3.15) the2k×n matrix M:=        w1 α1wˆ1−β1w1 .. . wk αkkβkwk        (3.16)

is full rank,3and

∀x1,x2∈ X such thatkMx1k=kMx2k,µX(x1)≤cµ0X(x2), (3.17)

which we call relative rotation invariance and impliesµ0XandµX have the same support.4

Assumption 3.4(Shared representation exists). Suppose there exist some f F : f(x):= [a(w1·x), . . . ,a(wk·x)], gS∈G: gS(z):=sign(vS·z), gT ∈G: gT(z):=sign(vT·z), e≥0 such that max[RS(gS◦f),RT(gT◦f)]≤ e. (3.18)

We now state the target task risk bound for transferring representations in our neural network example.

3To see that this condition is necessary, consider the following example where Mis not full rank.

Letn=4,k=2,y=sign(x1)underµ0XYandy=sign(x2)underµXY. For f(x) = [x1+x2,x1−x2],

gS(z) = sign(z1+z2) and gT(z) = sign(z1−z2), we have RS(gS◦f) = RT(gT◦f) = 0. On S we

learn ˆf(x) = [x1+x3,x1−x3] and ˆgS(z) = sign(z1+z2), so that RS(gˆS◦fˆ) = 0, but in general

min

g∈GRT(g◦

ˆ

f)>0 since ˆf ignoresx2. 4If M is an orthogonal matrix then x

1,x2 ∈ X such thatkx1k = kx2k,µX(x1) ≤ cµ0X(x2). For

example, this equation is satisfied ifµXandµ0Xare spherical Gaussians. Note that a zero-mean multi-

variate Gaussian distribution can be converted to a spherical Gaussian by the whitening transformation x→Λ−1/2UTx, where the columns ofUand entries of the diagonal matrixΛare the eigenvectors and