Bounds for Linear Multi-Task Learning

(1)

Bounds for Linear Multi-Task Learning

Adalbertstrasse 55

D-80799 M¨unchen, Germany

Editor: Nello Cristianini

Abstract

We give dimension-free and data-dependent bounds for linear multi-task learning where a common linear operator is chosen to preprocess data for a vector of task specific linear-thresholding classi-fiers. The complexity penalty of multi-task learning is bounded by a simple expression involving the margins of the task-specific classifiers, the Hilbert-Schmidt norm of the selected preprocessor and the Hilbert-Schmidt norm of the covariance operator for the total mixture of all task distri-butions, or, alternatively, the Frobenius norm of the total Gramian matrix for the data-dependent version. The results can be compared to state-of-the-art results on linear single-task learning.

Keywords: learning to learn, transfer learning, multi-task learning

1. Introduction

Simultaneous learning of different tasks under some common constraint, often called multi-task learning, has been tested in practice with good results under a variety of different circumstances (see Baxter 1998, Caruana 1998, Thrun 1998, Ando and Zhang 2005). The technique has been analyzed theoretically and in some generality by Baxter (2000) and Ando and Zhang (2005). The latter reference appears to be the first to use Rademacher averages in this context. The purpose of this paper is to improve some of these theoretical results in a special case of practical importance, when input data are represented in a linear, potentially infinite dimensional space, and the common constraint is a linear preprocessor.

Finite systems provide simple examples illustrating the potential advantages of multi-task learn-ing. Consider agnostic learning with an input space

X

and a finite set

F

of hypotheses f :

X

→{0,1_}. For a hypothesis f _∈

F

let er(f)be the expected error and eˆr(f) the empirical error on a training sample S of size n (drawn iid from the underlying task distribution) respectively. Combining Ho-effding’s inequality with a union bound one shows (see e.g. Anthony and Bartlett 1999), that with probability greater than 1₋δwe have for every f _∈

F

the error bound

er(f)_≤eˆr(f) +_√1 2n

p

ln|

F

|+ln(1/δ). (1)

Suppose now that there are a set

Y

, a finite but large set

G

of preprocessors g :

X

→

Y

, and another set

H

of classifiers h :

Y

→ {0,1_}with

H

|

F

|. For a cleverly chosen preprocessor g∈

G

it will

likely be the case that we find some h_∈

H

such that h_◦g has the same or even a smaller empirical error than the best f _∈

F

. But this will lead to an improvement of the bound above (replacing_|

F

| by

H

) only if we choose g before seeing the data, otherwise we incur a large estimation penalty

(2)

The situation is improved if we have a set of m different learning tasks with corresponding task distributions and samples S1, ...,Sm, each of size n and drawn iid from the corresponding

distribu-tions. We now consider solutions h1◦g, ...hm◦g for each of the m tasks where the preprocessing

map g∈

G

is constrained to be the same for all tasks and only the hl ∈

H

specialize to each task l

at hand. Again Hoeffding’s inequality and a union bound imply that with probability greater 1₋δ we have for all(h1, ...,hm)∈

H

mand every g∈

G

1 m

m

∑

l=1

erl(hl◦g)≤

1 m

m

∑

l=1

eˆrl(hl◦g) +

1 √

2n

r

ln

H

+

ln|

G

|+ln(1/δ)

m . (2)

Here erl(f)and eˆrl(f)denote the expected error in task l and the empirical error on training sample Sl respectively. The left hand side above is an average of the expected errors, so that the guarantee

implied by the bound is a little weaker than the usual PAC guarantees (but see Ben-David, 2003, for bounds on the individual errors). The first term on the right is the average empirical error, which a multi-task learning algorithm seeks to minimize. We can take it as an operational definition of task-relatedness relative to(H,

G

)that we are able to obtain a very small value for this term. The remaining term, which bounds the estimation error, now exhibits the advantage of multi-task learn-ing: Sharing the preprocessor implies sharing its cost of estimation, and the entropy contribution arising from the selection of g_∈

G

decreases with the number of learning tasks. Since by assump-tion

H

|

F

|, the estimation error in the multi-task bound (2) can become much smaller than in

the single task case (1) if the number m of tasks becomes large.

The choice of the preprocessor g∈

G

can also be viewed as the selection of the hypothesis space

H

◦g. This leads to an alternative formulation of multi-task learning, where the common object is a hypothesis space chosen from a class of hypothesis spaces (in this case

H

◦g : g_∈

G

), and the classifiers for the individual tasks are all chosen from the selected hypothesis space. Here we prefer the functional formulation of selecting a preprocessor instead of a hypothesis space, because it is more intuitive and sufficient in the situations which we consider.

The arguments leading to (2) can be refined and extended to certain infinite classes to give general bounds for multi-task learning (Baxter 2000, Ando and Zhang 2005). In this paper we concentrate on the case where the input space

X

is a subset of the unit ball in a Hilbert space H, the class

G

of preprocessors is a set

A

of bounded symmetric linear operators on H, and the class

H

is the set of classifiers hvobtained by 0-thresholding linear functionals v in H withkvk ≤B, that is

hv(x) =sign(hx,vi) and hν◦T(x) =sign(hT x,vi),x∈H,T∈

A

, kvk ≤B.

The learner now searches for a multi-classifier hv◦T= (hν1◦T, ...,hνm◦T)where the preprocessing

operator T _∈

A

is the same for all tasks and only the vectors vl specialize to each task l at hand. The desired multi-classifier hv◦T should have a small value of the average error

er(hv◦T) = 1 m

m

∑

l=1

erl(hvl◦T) =

1 m

m

∑

l=1 Pr

n

sign

D

T Xl,vl

E

6

=Yl

o

,

where Xl _{and Y}l _{are the random variables modeling input-values and labels for the l-th task. To}

guide this search we look for bounds on er(hv◦T)in terms of the total observed data for all tasks, valid uniformly for all v= v1, ...,vmwithvl

(3)

Theorem 1 Letδ∈(0,1). With probability greater than 1₋δit holds for all v= v1, ...,vm∈H withvl

≤1 and all bounded symmetric operators T on H withkTk_HS≥1, and for allγ∈(0,1)

that

er(hv◦T)≤eˆrγ(v◦T) +

8_kT_k_HS

γ√n

r

kC_k_HS+1

m+

s

ln4kTkHS

δγ

2nm .

Here eˆr_γ(v_◦T)is a margin-based empirical error estimate, bounded by the relative number of examples Xl

i,Yil

in the total training sample for all tasks l, where Yl i

T Xl i,vl

<γ(see section 4). The quantity_kT_k_HS is the Hilbert-Schmidt norm of T , defined for symmetric T by

kT_k_HS=

∑

λ2_i1/2

,

whereλiis the sequence of eigenvalues of T (counting multiplicities, see section 2).

C is the total covariance operator corresponding to the mixture of all the task-input-distributions in H. Since data are constrained to the unit ball in H we always have_kC_k_HS_≤1 (see section 3).

The above theorem is the simplest, but not the tightest or most general form of our results. For example the factor 8 on the right hand side can be decreased to be arbitrarily close to 2, thereby incurring only a logarithmic penalty in the last term.

A special case results from restricting the set of candidate preprocessors to

P

d, the set of

orthog-onal projections in H with d-dimensiorthog-onal range. In this case learning amounts to the selection of a d-dimensional subspace M of H and of an m-tuple of vectors vl in M (components of vl orthogonal to M are irrelevant to the projected data). All operators T_∈

P

dsatisfykTkHS=

√

d, which can then be substituted in the above bound. Identifying such a projection with the structural parameterΘ, this corresponds to the case considered by Ando and Zhang (2005), where a practical algorithm for this type of multi-task learning is presented. The identity_kΘ_k_HS=√d then expresses the regularization condition mentioned in (Ando, Zhang 2005).

The bound in the above theorem is dimension free, it does not require the data distribution in H to be confined to a finite dimensional subspace. Almost to the contrary: Suppose that the input data are distributed uniformly on M∩S1where M is a k-dimensional subspace in H and S1is the sphere consisting of vectors with unit norm in H. Then C has the k-fold eigenvalue 1/k, the remaining eigenvalues being zero. Therefore_kC_k_HS =1/√k, so part of the bound above decreases to zero as the dimensionality of the data-distribution increases, in contrast to the bound in (Ando, Zhang, 2005), which increases linearly in k. The fact that our bounds are dimension free allows their general use for multi-task learning in kernel-induced Hilbert spaces (see Cristianini and Shawe-Taylor 2000).

If we compare the second term on the right hand side to the estimation error bound in (2), we can recognize a certain similarity: Loosely speaking we can identify _kT_k2_HS/m with the cost of estimating the operator T , and_kT_k2_HS_kC_k_HS with the cost of finding the linear classifiers v1, ...,vm.

The order of dependence on the number of tasks m is the same in Theorem 1 as in (2).

In the limit m_→∞it is preferable to use a different bound (see Theorems 13 and 15), at the expense of slower convergence in m. The main inequality of the theorem then becomes

er(hv◦T)≤eˆrγ(v◦T) + 2T2

1/2

HS

(1₋ε)2γ√n

kC_k2_HS+ 3 m

1/4

+

v u u

tln

kT2_k1/2

HS

δγε2

(4)

for some very smallε>0 to be fixed in advance. If T is an orthogonal projection with d-dimensional range then T2

1/2

HS =d

1/4_{, so for a large number of tasks m the bound on the estimation error} becomes approximately

2d1/4_kC_k1_HS/2

γ√n .

One of the best dimension-free bounds for linear single-task learning (see e.g. Bartlett and Mendel-son 2002, or Lemma 11 below) would give 2/(γ√n) for this term, if all data are constrained to unit vectors. We therefore expect superior estimation for multi-task learning of d-dimensional pro-jections with large m, whenever d1/4_kC_k1_HS/21. If we again assume the data-distribution to be uniform on M_∩S1with M a k-dimensional subspace, this is the case whenever dk, that is, qual-itatively speaking, whenever the dimension of the utilizable part of the data is considerably smaller than the dimension of the total data distribution.

The results stated above give some insights, but they have the practical disadvantage of being unobservable, because they depend on the properties of the covariance operator C, which in turn depends on an unknown data distribution. One way to solve this problem is using the fact that the finite-sample approximations to the covariance operator have good concentration properties (see Theorem 8 below). The corresponding result is:

Theorem 2 With probability greater than 1₋δin the sample X it holds for all v1, ...,vm∈H with

kvlk ≤1 and all bounded symmetric operators T on H withkTk_HS≥1, and for allγ∈(0,1)that

er(hv◦T)≤eˆrγ(v◦T) +

8_kT_k_HS

γ√n

r

1 mn

Cˆ(X)

Fr+

1

m+

s

9 ln8kTkHS

δγ

2nm .

where theCˆ(X)

Fris the Frobenius norm of the gramian.

By definition

Cˆ(X)

Fr=

∑

l,r,i,j

D

X_il,X_jr

E2 !1/2

.

Here X_il is the random variable describing the i-th data point in the sample corresponding to the l-th task. The corresponding label Y_il enters only in the empirical margin error. The quantity (mn)−1Cˆ(X)

Fr can be regarded as an approximation to kCkHS, valid with high probability, so

that Theorem 2 is a sample based version of Theorem 1.

(5)

the construction of an example to demonstrate the advantages of multi-task learning. The appendix contains missing proofs and a convenient reference-table to the notation and definitions introduced in the paper.

2. Hilbert-Schmidt Operators

For a fixed real, separable Hilbert space H, with inner product_h., ._iand norm_k._k, we define a second real, separable Hilbert space consisting of Hilbert-Schmidt operators. With HS we denote the real vector space of operators T on H satisfying∑∞_i₌₁_kTeik2≤∞for every orthonormal basis(ei)∞i=1of H. Every T _∈HS is bounded. For S,T _∈HS and an orthonormal basis(ei)the series∑ihSei,Teii

is absolutely summable and independent of the chosen basis. The number_hS,T_i_HS =∑_hSei,Teii

defines an inner product on HS, making it into a Hilbert space. We denote the corresponding norm with_k._k_HSin contrast to the usual operator norm_k._k_∞. See Reed and Simon (1980) for background on functional analysis). We use HS∗to denote the set of symmetric Hilbert-Schmidt operators. For every member of HS∗there is a complete orthonormal basis of eigenvectors, and for T _∈HS∗ the norm_kT_k_HSis just the`2-norm of its sequence of eigenvalues. With HS+we denote the members of HS∗with only nonnegative eigenvalues.

We use two simple maps from H or H2 to HS to relate the geometries of objects in H to the geometry in HS.

Definition 3 Let x,y_∈H. We define two operators Qxand Gx,yon H by

Qxz = hz,xix,∀z∈H

Gx,yz = hx,ziy,∀z∈H.

We will frequently use parts of the following lemma, the proof of which is very easy.

Lemma 4 Let x,y,x0,y0_∈H and T_∈HS. Then (i) Qx∈HS+andkQxkHS=kxk

2 . (ii)_hQx,Qyi_HS=hx,yi2.

(iii)_hT,Qxi_HS=hT x,xi.

(iv)_hT∗T,QviHS=kT vk

2_. (v) QyQx=hx,yiGx,y.

(vi) Gx,y∈HS andkGx,yk_HS=kxkkyk.

(vii)Gx,y,Gx0_,_y0_HS=hx,x0ihy,y0i (viii)_hT,Gx,yi_HS=hT x,yi.

(ix) Forα_∈R, Q_αx=α2Qx.

Proof For x=0 (iii) is obvious. For x₆=0 choose an orthonormal basis(ei)1∞, so that e1=x/kxk. Then e1is the only nonzero eigenvector of Qxwith eigenvaluekxk2>0. Also

hT,QxiHS=

∑

i

hTei,Qxeii=hT x,Qxxi/kxk2=hT x,xi,

which gives (iii). (ii), (i) and (iv) follow from substitution of Qy, Qx and T∗T respectively for T .

(6)

basis. Then x=∑_k_hx,ekiek, so by boundedness of T

hT x,y_i =

*

T

∑

k

hx,ekiek,y

+

=

∑

k

hTek,hx,ekiyi=

∑

k

hTek,Gx,yeki

= _hT,Gx,yi_HS,

which is (viii). Similarly

Gx,y,Gx0_,_y0

HS =

∑

k

hx,ekiy,

x0,ek

y0

=

y,y0

∑

k

hx,eki

x0,ek

=

x,x0 y,y0,

which gives (vii) and (vi). (ix) is obvious.

The following application of Lemma 4 is the key to our main results.

Lemma 5 Let T _∈HS and w1, ...,wmand v1, ...,vmvectors in H withkvik ≤B. Then

m

∑

l=1

hTwl,vli ≤BkTkHS

∑

l,r

|hwl,wri|

!1/2

and

m

∑

l=1

hTwl,vli ≤Bm1/2kT∗Tk1_HS/2

∑

l,r

hwl,wri2

!1/4

Proof Without loss of generality assume B=1. Using Lemma 4 (viii), Schwarz’ inequality in HS and Lemma 4 (vii) we have

m

∑

l=1

hTwl,vli =

*

T,

m

∑

l=1 Gwl,vl

+

HS

≤ kT_k_HS

m

∑

l=1 Gwl,vl

HS

= _kT_k_HS

m

∑

l,r

hwl,wrihvl,vri

!1/2

≤ kT_k_HS

m

∑

l,r

|hwl,wri|

!1/2

.

This proves the first inequality. Also, using Schwarz’ inequality in H andRm_{, Lemma 4 (iv) and}

Schwarz’ inequality in HS

m

∑

l=1

hTwl,vli ≤ m

∑

l=1 kvlk2

!1/2

m

∑

l=1 kTwlk2

!1/2

≤√m

*

T∗T,

m

∑

l=1 Qwl

+1/2

HS

≤ √m_kT∗T_k1_HS/2

m

∑

l=1 Qwl

1/2

HS

=√m_kT∗T_k1_HS/2

∑

l,r

hwl,wri2

(7)

The set of d-dimensional, orthogonal projections in H is denoted with

P

d. We have

P

d⊂HS∗

and if P_∈

P

dthenkPkHS=

√

d and P2=P.

An operator T is called trace-class if∑∞_i₌₁_hTei,eiiis an absolutely convergent series for every

orthonormal basis(ei)∞i=1of H. In this case the number tr(T) =∑∞i=1hTei,eiiis called the trace of

T and it is independent of the chosen basis.

If

A

⊂HS∗is a set of symmetric and bounded operators in H we use the notation

k

A

kHS=sup{kTkHS: T ∈

A

} and

A

2=

T2: T _∈

A

.

3. Vector- and Operator-Valued Random Variables

Let(Ω,Σ,µ)be a probability space with expectation E[F] =R

ΩFdµ for a random variable F :Ω→

˙

R.Let X be a random variable with values in H, such that E[_kX_k]<∞. The linear functional v_∈H_7→E[_hX,v_i]is bounded by E[_kX_k]and thus defines (by the Riesz Lemma) a unique vector E[X]_∈H such that E[_hX,v_i] =_hE[X],v_i,_∀v_∈H, with_kE[X]_{k ≤}E[_kX_k].

We now look at the random variable QX, with values in HS. Suppose that E

h

kX_k2

i

<∞. Passing to the space HS of Hilbert-Schmidt operators the above construction can be carried out again: By Lemma 4 (i) E[_kQXkHS] =E

h

kX_k2

i

<∞, so there is a unique operator E[QX]∈HS

such that E[_hQX,TiHS] =hE[QX],TiHS,∀T ∈HS.

Definition 6 The operator E[QX]is called the covariance operator of X .

Lemma 7 The covariance operator E[QX]∈HS+has the following properties.

(i)kE[QX]kHS≤E[kQXkHS].

(ii)_hE[QX]y,zi=E[hy,Xihz,Xi],∀y,z∈H.

(iii) tr(E[QX]) =E

h

kX_k2

i

.

(iv) For H-valued independent X1and X2with E

h

kXik2

i

≤∞we have

hE[QX1],E[QX2]iHS=E

h

hX1,X2_i2

i

.

(v) Under the same hypotheses, if E[X2] =0 then

E[QX1+X2] =E[QX1] +E[QX2]

Proof (i) follows directly from the construction, (iv) from the identity

hE[QX1],E[QX2]iHS=E

hQX1,QX2iHS

. Let y,z_∈H. Then using 4 (viii) we get

hE[QX]y,zi = hE[QX],Gy,zi_HS=E

hQX,Gy,zi_HS

=E[_hQXy,zi]

= E[_hy,X_ihz,X_i]

and hence (ii). We have with orthonormal basis(ek)∞k=1and using (ii) tr(E[QX]) =

∑

k

hE[QX]ek,eki=

∑

k

E[_hek,Xihek,Xi] =E

h

(8)

which gives (iii). Substitution of an eigenvector v for both y and z in (ii) shows that the corresponding eigenvalue must be nonnegative, so E[QX]∈HS+.

Finally (v) holds because for any y,z _∈H we have, using independence and the mean-zero condition for X2, that

hE[QX1+X2]y,zi

=E[_hy,X1+X2ihX1+X2,zi]

=E[_hy,X1_ihX1,z_i] +E[_hy,X2_ihX2,z_i] +E[_hy,X1_ihX2,z_i] +E[_hy,X2_ihX1,z_i] =_h(E[QX1] +E[QX2])y,zi+hy,E[X1]ihE[X2],zi+hy,E[X2]ihE[X1],zi

=h(E[QX1] +E[QX2])y,zi

Property (ii) above is sometimes taken as the defining property of the covariance operator (see Baxendale 1976).

If X is distributed uniformly on M_∩S1, where M is a k-dimensional subspace and S1 the unit sphere in H, then E

h

hX,y_i2

i

=_hE[QX]y,yiis zero if and only if y∈M⊥, so the range of E[QX]is M,

so there are exactly k-eigenvectors corresponding to non-zero eigenvalues of E[QX]. By symmetry

these eigenvalues must all be equal, and by (iii) above they sum up to 1, so E[QX]has the k-fold

eigenvalue 1/k, with zero as the only other eigenvalue. It follows that_kE[QX]kHS =1/

√ k. We have given this derivation to illustrate the tendency of the Hilbert-Schmidt norm of the covariance operator of a distribution concentrated on unit vectors to decrease with the effective dimensionality of the distribution. This idea is relevant to the interpretation of our results.

The fact that HS is a separable Hilbertspace just as H allows to define an operator E[T] when-ever T is a random variable with values in HS and E[_kT_k_HS]<∞. Also any result valid in H has a corresponding analogue valid in HS. We quote a corresponding operator-version of a Theorem of Cristianini and Shawe-Taylor (2004) on the concentration of independent vector-valued random variables.

Theorem 8 Suppose that T1, ...,Tmare independent random variables in H withkTik ≤1. Then for

allδ>0 with probability greater thanδwe have

1 m

m

∑

i=1

E[Ti]−

1 m

m

∑

i=1 Ti

HS

≤√2

m 1+

r

ln(1/δ) 2

!

.

Apply this with Ti=QXi where the Xiare iid H-valued withkXik ≤1. The theorem then shows

that the covariance operator E[QX]can be approximated in HS-norm with high probability by the

empirical estimates(1/m)∑_iQXi. The quantity

∑

i

QXi

HS

=

∑

i,j

Xi,Xj

2 !1/2

is the Frobenius norm of the Gramian (or kernel-) matrix ˆC(X)i j=

Xi,Xj

, denotedC(ˆ X)

Fr. An

immediate corollary to the above is, that(1/m)C(ˆ X)

Fris with high probability a good

(9)

4. Multi-Task Problems and General Bounds

For our discussion of multi-task learning we concentrate on binary labeled data. Let(Ω,Σ,µ)be a probability space. We assume that there is a multi-task problem modeled by m independent random variables Zl= Xl,Yl:Ω_→

X

×{−1,1_}, where

• l_{∈ {}1, ...,m_}identifies one of the m learning tasks,

• Xl models the input data of the l-th task, distributed in a set

X

, called the input space.

• Yl _{∈ {−}1,1_}models the output-, or label-data of the l-th task.

• For each l_{∈ {}1, ...,m_}there is an n-tuple of independent random variables Z_iln_i₌₁= X_il,Y_iln_i₌₁, where each Z_il is identically distributed to Zl.

The random variable Z= Z_il(₍n_i_,,_lm₎₌₍) ₁_,₁₎ is called the training sample or training data. We also write X= X_il(₍_in_,,_lm₎₌₍) ₁_,₁₎. We use the superscriptl to identify learning tasks running from 1 to m, the subscriptito identify data points in the sample, running from 1 to n. We will use the notations

x= xl_i(₍n_i_,,_lm₎₌₍) ₁_,₁₎for generic members of(Xn₎m

and z= zl_i(₍_in_,,_lm₎₌₍) ₁_,₁₎= (x,y) = x_il,yl_i(₍n_i_,,_lm₎₌₍) ₁_,₁₎for generic members of((X_×{−1,1_})n)m.

A multiclassifier is a map h :

X

→ {−1,1_}m. We write h= h1, ...,hmand interpret hl(x)as the label assigned to the vector x when the task is known to be l. The average error of a multiclassifier

h is the quantity

er(h) = 1 m

m

∑

l=1

PrnhlXl₆=Ylo,

which is just the misclassification probability averaged over all tasks. Typically a classifier is chosen from some candidate set minimizing some error estimate based on the training data Z. Here we consider zero-threshold classifiers hfwhich arise as follows:

Suppose that

F

is a class of vector valued functions f :

X

→Rmwith f= f1, ...,fm. A function

f∈

F

defines a multi-classifier hf= h1f, ...,hmf

through hl_f(x) =sign fl_(x)

. To give uniform error bounds for such classifiers in terms of empirical estimates, we define forγ>0 the margin functions

φγ(t) =







1 if t_≤0 1₋t/γ if 0<t<γ

0 if γ_≤t ,

and for f_∈

F

the random variable

eˆrγ(f) = 1 mn

m

∑

l=1

n

∑

i=1

φγ

Y_ilfl

X_il

,

(10)

Theorem 9 Letε,δ_∈(0,1)

(i) With probability greater than 1₋δit holds for all f_∈

F

and allγ_∈(0,1)that

er(hf)≤eˆrγ(f) + 1

γ(1₋ε)E

h

ˆ

R

m

n (F) (X)

i

+

r

ln(1/(δγε))

2nm .

(ii) With probability greater than 1₋δit holds for all f_∈

F

and allγ_∈(0,1)that

γ(1₋ε)

R

ˆ

m

n (F) (X) +

r

9 ln(2/(δγε))

2nm .

Here ˆ

R

m

n (F)is the empirical Rademacher complexity in the sense of the following

Definition 10 Letσl_i: l_{∈ {}1, ...,m_},i_{∈ {}1, ...,n_} be a collection of independent random vari-ables, distributed uniformly in _{−1,1_}. The empirical Rademacher complexity of a class

F

of functions f :

X

→Rmis the function ˆ

R

m

n (F)defined on(Xn)mby

ˆ

R

nm(F) (x) =Eσ

"

sup f∈F

2 mn

m

∑

l=1

n

∑

i=1

σl ifl

xl_i

#

.

For the readers convenience we give a proof of Theorem 9 in the appendix.

The bounds in the Theorem each involve three terms. The last one expresses the dependence of the estimation error on the confidence parameter δ and a model-selection penalty ln(1/(γε)) for the choice of the marginγ. Note that it generally decreases as 1/√nm. This is not an a priori advantage of multi-task learning, but a trivial consequence of the fact that we estimate an average of m probabilities (in contrast to Ben David, 2003, where bounds are valid for each individual task - of course under more restrictive assumptions). The 1/√nm decay however implies that even for moderate values of m and n the parameter ε in Theorem 9 can be chosen very small, so that the factor 1/(1₋ε)in the second term on the right of the two bounds is very close to unity.

The second term involves the complexity of the function class

F

, either as measured in terms of the distribution of the random variable X or in terms of the observed sample. Since the distribution of X is unobservable in practice, the bound (i) is primarily of theoretical importance, while (ii) can be used to drive an algorithm which selects the multi-classifier hf∗, where(f∗,γ)∈

F

×(0,1)are chosen to minimize the right side of the bound with given δ, ε. It is questionable if minimizing upper bounds is a good strategy, but it can serve as a motivating guideline.

Of key importance in the analysis of these algorithms is the empirical Rademacher complexity ˆ

R

m

n (F) (X), as observed on the sample X, and its expectation, measuring respectively the

sample-and distribution-dependent complexities of the function class

F

. Bounds on these quantities can be substituted in Theorem 9 to give corresponding error bounds.

5. The Rademacher Complexity of Linear Multi-Task Learning

(11)

requireXl

≤1. The case Xl

=1 where the data are constrained to the unit sphere in H is of

particular interest, corresponding to a class of radial basis function kernels.

We write Cl for the covariance operator E[Q_Xl] and C for the total covariance operator C=

(1/m)∑lCl, corresponding to a uniform mixture of distributions. By Lemma 7 we have

Cl

HS≤

tr Cl

=EhXl

2i

≤1.

Let B>0, let T be a fixed symmetric, bounded linear operator on H with_kT_k_∞_≤1, and let

A

be a set of symmetric, bounded linear operators T on H, all satisfying_kT_k_∞_≤1. We will consider the vector-valued function classes

F

B = {x∈H7→(v1, ...,vm) (x):= (hx,v1i, ...,hx,vmi):kvik ≤B}

F

B◦T = {x∈H7→(v1, ...,vm)◦T(x):= (hT x,v1i, ...,hT x,vmi):kvik ≤B}

F

B◦

A

= {x∈H7→(v1, ...,vm)◦T(x):kvik ≤B,T ∈

A

}.

The algorithms which choose from

F

B and

F

B◦T are essentially trivial extensions of linear

single-task learning, where the tasks do not interact in the selection of the individual classifiers vi,

which are chosen independently. In the case of

F

B◦T the preprocessing operator T is chosen before

seeing the training data. Since_kT_k_∞_≤1 we have

F

B◦T ⊆

F

B, so that we can expect a reduced

complexity for

F

B◦T and the key question becomes if the choice of T (possibly based on experience

with other data) was lucky enough to allow for a sufficiently low empirical error.

The non-interacting classes

F

Band

F

B◦T are important for comparison to

F

B◦

A

which

repre-sents proper multi-task learning. Here the preprocessing operator T is selected from

A

in response to the data. The constraint that T be the same for all tasks forces an interaction of tasks in the choice of T and(v1, ...,vm), deliberately aiming for a low empirical error. At the same time we also have

F

B◦

A

⊆

F

B, so that again a reduced complexity is to be expected, giving a smaller contribution to

the estimation error. The promise of multi-task learning is based on the combination of these two ideas: Aiming for a low empirical error, using a function class of reduced complexity.

We first look at the complexity of the function class

F

B. The proof of the following lemma is

essentially the same as the proof of Lemma 22 in Bartlett and Mendelson (2002).

Lemma 11 We have

ˆ

R

m

n (FB) (x) ≤

2B nm

m

∑

l=1

n

∑

i=1

xl_i

2

!1/2

Eh

R

ˆm

n (FB) (X)

i

≤ √2B n

1 m

m

∑

l=1

EhXl

2i1/2

=_√2B n

1 m

m

∑

l=1

(12)

Proof Using Schwarz’ and Jensen’s inequality and the independence of theσl_i we get

ˆ

R

nm(FB) (x) = Eσ

"

sup

v1,...,vm,kvlk≤B

2 nm

m

∑

l=1

*

n

∑

i=1

σl ixli,vl

+#

≤ BE_σ

"

2 nm

m

∑

l=1

n

∑

i=1

σl ixli

#

≤ _nm2B

m

∑

l=1



Eσ   n

∑

i=1

σl ixli

2   

1/2

= 2B nm

m

∑

l=1

n

∑

i=1

xl_i

2

!1/2

.

Jensen’s inequality then gives the second conclusion

The first bound in the lemma is just the average of the bounds given by Bartlett and Mendelson in on the empirical complexities for the various task-components of the sample. For inputs constrained to the unit sphere in H, when Xl

=1, both bounds become 2B/√n, which sets the mark for

comparison with the interacting case

F

B◦

A

. For motivation we next look at the case

F

B◦T ,

working with a fixed linear preprocessor T of operator norm bounded by 1. Using the above bound we obtain

ˆ

R

m

n (FB◦T) (x) =

R

ˆnm(FB) (T x)≤

2B nm

m

∑

l=1

n

∑

i=1

T xl_i

2

!1/2

, (4)

which is always bounded by B/√n, because_kT x_{k ≤ k}x_k,_∀x. Using Lemma 4 (iv) we can rewrite the right side above as

2B √ n 1 m m

∑

l=1

*

T2,1 n

n

∑

i=1 Q_xl

i

+1/2

HS

.

Taking the expectation and using the concavity of the root function gives, with two applications of Jensen’s inequality and an application of Schwarz’ inequality (in HS),

E

h

ˆ

R

nm(FB◦T) (X)

i

≤√2B n

T2

1/2

HSkCk

1/2

HS,

which can be significantly smaller than B/√n, for example if T is a d-dimensional projection, and the data-distribution is spread well over a much more than d-dimensional submanifold of the unit ball in H, as explained in the introduction and section 3. If we substitute the bound above in Theorem 9 we obtain an inequality which looks like (3) in the limit m_→∞.

We now consider the case where T is chosen from some set

A

of (symmetric, bounded) candi-date operators on the basis of the same sample X, simultaneous to the determination of the classifica-tion vectors v1, ...,vl. We give two bounds each for the Rademacher complexity and its expectation.

(13)

Theorem 12 The following inequalities hold

ˆ

R

m

n (FB◦

A

) (x) ≤

2B_k_√

A

kHS

n

r

1 mn

Cˆ(x)

Fr+ 1 m (5) ˆ

R

m

n (FB◦

A

) (x) ≤

2B

A

2

1/2

HS √ n 1 mn Cˆ(x)

Fr 2 + 2 m

!1/4

(6)

E

h

ˆ

R

nm(FB◦

A) (

X)

i

≤ 2Bk_√

A

kHS

n

r

kC_k_HS+ 1

m (7)

Eh

R

ˆ m

n (FB◦

A) (

X)

i ≤ 2B

A

2

1/2

HS

√ n

kC_k2_HS+3 m

1/4

. (8)

Proof Fix x and define vectors wl=wl(σ) =∑ni=1σlixli depending on the Rademacher variablesσli.

Then by Lemma 5 and Jensen’s inequality

ˆ

R

(FB◦

A) (

x) = Eσ

"

sup

T∈A

sup

v1,...,vm,kvik≤B

2 nm

m

∑

l=1

hTwl,vli

#

(9)

≤ _nm2Bk

A

kHSEσ





∑

l,r

|hwl,wri|

!1/2 

≤ 2B

nmk

A

kHS

∑

_l_,_rEσ[|hwl,wri|]

!1/2

.

Now we have

Eσ

h

kwlk2

i

=

n

∑

i=1

n

∑

j=1 Eσ

h σl

iσlj

i D

xl_i,xl_j

E

=

n

∑

i=1

xl_i

2

. (10)

Also, for l₆=r, we get, using Jensen’s inequality and independence of the Rademacher variables,

(E_σ[_|hwl,wri|])2 ≤ Eσ

h

hwl,wri2

i

(11)

=

n

∑

i=1

n

∑

j=1

n

∑

i0₌₁

n

∑

j0₌₁

E_σhσl_iσr_jσl_i0σr_j0

i D

xl_i,xr_jE Dxl_i0,xr_j0

E

=

n

∑

i.j=1

D

xl_i,xr_j

E2

.

Taking the square-root and inserting it together with (10) in (9) we obtain the following intermediate bound

ˆ

R

m

n (FB◦

A

) (x)≤

2B_k

A

kHS

nm





m

∑

l=1

n

∑

i=1

xl_i

2 +

∑

l6=r n

∑

i,j=1

D

xl_i,xr_jE2

!1/2 

1/2

(12)

By Jensen’s inequality we have

1

m2

∑

l₆=r n

∑

i,j=1

D

xl_i,xr_j

E2 !1/2

≤ _m1₂

m

∑

l,r=1

n

∑

i,j=1

D

xl_i,xr_j

E2 !1/2

= 1 m

Cˆ(x)

(14)

which together with (12) andxl_i

≤1 implies (5).

To prove (6) first use the second part of Lemma 5 and Jensen’s inequality to get

ˆ

R

(FB◦

A

) (x)≤

2B

A

2

1/2

HS

n√m

∑

_l_,_rEσ

h

hwl,wri2

i !1/4

. (13)

Now we have E_σ

h σl

iσljσli0σl_j0

i

≤δi jδi0_j0+δ_ii0δ_{j j}0+δ_{i j}0δ_ji0 so

E_σ

h

hwl,wli2

i

≤

n

∑

i,j=1

xl_i

2 xl_j

2 +2

D

xl_i,xl_j

E2

≤ 2

n

∑

i=1

xl_i

2 !2 + n

∑

i,j=1

D

xl_i,xl_j

E2

≤2n2+

n

∑

i,j=1

D

xl_i,xl_j

E2

,

where we usedxl_i

≤1. Inserting this together with (11) in (13) gives

ˆ

R

nm(FB◦

A

) (x) ≤

2B

A

2

1/2

HS

n√m _l_,_r:l

∑

₆₌_rEσ

h

hwl,wri2

i

+

m

∑

l=1 E_σ

h

hwl,wli2

i !1/4

≤ 2B

A

2

1/2

HS

n√m

m

∑

l,r=1

n

∑

i,j=1

D

xl_i,xr_j

E2

+2mn2

!1/4

, (14)

which is (6).

Taking the expectation of (12), using Jensen’s inequality,Xl

≤1 and independence of Xland

Xrfor l6=r, and Jensen’s inequality again, we get

E

h

ˆ

R

nm(FB◦

A

) (X)

i

≤2Bk

A

kHS

nm



nm+

∑

l6=r

E

"

n

∑

i,j=1

D

X_il,Xr_j

E2

#!1/2 

1/2

=2Bk

A

kHS nm





∑

l6=r

E

"*

n

∑

i=1 Q_Xl

i, n

∑

j=1 QXr

j

+

HS

#!1/2

+nm





1/2

=2Bk√

A

kHS n

1

m2

∑

l6=r

hE[Q_Xl],E[Q_Xr]i1/2 HS +

1 m

!1/2

≤2Bk√

A

kHS n   * 1 m m

∑

l=1

E[Q_Xl],

1 m

m

∑

r=1 E[QXr]

+1/2

HS

+ 1 m





(15)

which gives (7). In a similar way we obtain from (14)

Eh

R

ˆm

n (FB◦

A

) (X)

i ≤2B

A

2

1/2

HS

n√m mn

2₊

∑

m

l6=r n

∑

i,j=1

D

E

h

Q_Xl i

i

,E

h

QXr j

iE

HS+2mn

2

!1/4

≤2B

A

2

1/2

HS

n√m



m2n2 E " 1 mn m

∑

l=1

n

∑

i=1 Q_Xl

i # 2 HS

+3mn2





1/4 ,

which gives (8)

6. Bounds for Linear Multi-Task Learning

Inserting the bounds of Theorem 12 in Theorem 9 immediately gives

Theorem 13 Let

A

be a be set of bounded, symmetric operators in H andε,δ_∈(0,1)

(i) With probability greater than 1₋δit holds for all f= (v1, ...,vm)◦T∈

F

B◦

A

and allγ∈(0,1)

that

γ(1₋ε)A+

r

ln(1/(δγε))

2nm ,

where A is either

A=2Bk_√

A

kHS

n

r

kC_k_HS+ 1

m (15)

or

A=2B

A

2

1/2

HS

√_n

kC_k2_HS+3 m

1/4

. (16)

(ii) With probability greater than 1₋δit holds for all f= (v1, ...,vm)◦T ∈

F

B◦

A

and for all

γ∈(0,1)that

er(hf)≤eˆrγ(f) + 1

γ(1₋ε)A(X) +

r

9 ln(2/(δγε))

2nm ,

where the random variable A(X)is either

A(X) =2Bk_√

A

kHS

n

r

1 mn

Cˆ(x)

Fr+ 1 m or

A(X) =2B

A

2

1/2

HS √ n 1 mn Cˆ(x)

Fr 2 +2 m

!1/4

.

(16)

Lemma 14 Suppose

{F(α1,α2,δ): 0<α1,α2,δ≤1} is a set of events such that:

(i) For all 0<α_≤1 and 0<δ_≤1,

Pr{F(α,α,δ)_{} ≤}δ.

(ii) For all 0<α1≤α≤α2≤1 and 0<δ1≤δ≤1, F(α1,α2,δ1)⊆F(α,α,δ). Then for 0<a,δ<1,

Pr





[

α∈(0,1]

F(αa,α,δα(1₋a))



≤δ.

Applications of this lemma follow a standard pattern, as explained in detail in (Anthony, Bartlett, 1999). Letε,δ,B be as in the previous theorem. Forα_∈(0,1]set

A

(α) =_{T _∈HS∗:_kT_k_HS_≤1/α_}

and consider the events

F(α1,α2,δ) ={∃f∈

F

B◦

A

(α2) such that er(hf)>eˆrγ(f) +

2B

α1γ(1−ε)√n

r

kC_k_HS+1

m+

r

ln(1/(δγε)) 2nm

)

.

By the first conclusion of Theorem 13 the events F(α1,α2,δ)satisfy hypothesis (i) of Lemma 14, and it is easy to see that (ii) also holds. If we set a=1₋εand replace αby 1/_kT_k_HS, then the conclusion of Lemma 14 reads as follows:

With probability greater 1₋δit holds for every f= (v1, ...,vm)◦T with(v1, ...,vm)∈

F

B and

T _∈HS∗with_kT_k_HS_≥1 and allγ_∈(0,1)that

er(hf)≤eˆrγ(f) +

2B_kT_k_HS

γ(1₋ε)2√n

r

kC_k_HS+ 1

m+

v u u

tln

kTkHS

δγε2

2nm .

Applying the same technique to the other conclusions of Theorem 13 gives the following result, which we state in abbreviated fashion:

Theorem 15 Theorem 13 holds with the following modifications:

• The class

F

B◦

A

is replaced by all f= (v1, ...,vm)◦T ∈

F

B ◦HS∗ with kTkHS ≥1 (or

T2

HS≥1 ).

• k

A

kHS(or

A

2

HS) is replaced bykTkHS(or

T2

HS).

• (1₋ε)and 1/(δγε)are replaced by(1₋ε)2and_kT_k_HS/ δγε2

(orT2

1/2

HS / δγε

2

) respectively.

The requirement_kT_k_HS_≥1 (orT2

HS≥1) is an artifact introduced by the stratification lemma

(17)

7. An Example

We conclude with an example illustrating the behavior of our bounds when learning from noisy data. We start with a fairly generic situation and subsequently introduce several idealized assumptions.

Suppose that Xl,Ylm_l₌₁are random variables modelling a multi-task problem as above. In the following let M be the smallest closed subspace M_⊂H such that Xl_∈M a.s..

We now mix the data variables Xl with noise, modeled by a random variable XN with values in H, such thatXN

≤1 and E

XN=0. The mixture is controlled by a parameter s_∈(0,1]and replaces the original data-variables Xlwith the contaminated random variable ˆXl=sXl+(1₋s)XN. We now make the assumption that

• XN is independent of Xl and Yl for all l.

Let us call s the signal amplitude and 1₋s the noise amplitude. The case s=1 corresponds to the original multi-task problem. Decreasing s,and adding more noise, clearly makes learning more difficult up to the case s=0 (which we exclude), where the data variables become independent of the labels and learning becomes impossible. We will look at the behavior of both of our bounds for non-interacting (single-task) and non-interacting (multi-task) learners as we decrease the signal amplitude s. The bounds which we use are implied by Lemma 11 and Theorem 9 for the non-interacting case and Theorem 13 for the interacting case, and state that each of the following two statements holds with probability at least 1₋δ:

1. Non-interacting bound. _∀v_∈

F1

,_∀γ, er(hv)≤eˆrγ(v) +_γ₍₁₋2_ε₎√_n+

q

ln(1/(δγε))

2nm 2. Interacting bound. _∀v_◦T _∈

F1

_◦

A

,_∀γ,

er(hv◦T)≤eˆrγ(v◦T) +_γ₍2₁k₋A_εk₎HS√_n

q

kC_k_HS+_m1+

q

ln(1/(δγε))

2nm

The first damage done by decreasing s is that the marginγmust be also decreased to sγto obtain a comparable empirical margin error for the mixed problem as for the original problem. Replacing

γbyγs is very crude and normally insufficient, because of interfering noise, but the replacement can be justified if one is willing to accept the orthogonality assumption:

• XN _⊥M a.s.

The assumption that the noise to be mixed in is orthogonal to the signal is somewhat artificial. We will later assume that the dimension d of the signal space M is small and that XN is distributed homogeneously on a high-dimensional sphere. This implies weak orthogonality in the sense that

Xl,XN2is small with high probability, a statement which could also be used, but at the expense of considerable complications. To immediately free us from consideration of the empirical term (and only for this purpose), we make the orthogonality assumption. By projecting to M we can then find for any sample X_il,Y_iland any preprocessor T and any multi-classifying vector v some T0 and v0 such that eˆr_γs(v0)and eˆrγs(v0◦T0)for the mixed sample Xˆil,Yil

(18)

This will cause a logarithmic penalty in the last term depending on the confidence parameterδ, but we will neglect this entire term on the grounds of rapid decay with the product nm. The remaining term, which depends on s and is different for both bounds, is then

2

γs(1₋ε)√n (17)

for the non-interacting and

2_k

A

kHS

γs(1₋ε)√n

r

kCskHS+

1

m (18)

for the interacting case. Here Cs= (1/m)∑lE

h

Q_sXl₊₍₁₋_s₎_XN

i

is the total covariance operator for

the noisy mixture problem. By independence of XN and Xl and the mean-zero assumption for XN we obtain from Lemma 4 and 7 that

Cs= (1/m)

∑

l

s2E[Q_Xl] + (1−s)2E[Q_XN]

=s2C+ (1₋s)2E[QXN],

where C would be the total covariance operator for the original problem. To bound the HS-norm of this operator we now introduce a simplifying assumption of homogeneity for the noise distribution:

• XN is distributed uniformly on a k-dimensional unit-sphere centered at the origin.

This implies that_kE[Q_XN]k_HS=1/

√

k so that

kCsk2HS≤s

2

kC_k_HS+_kE[Q_XN]k_HS≤s2kCk_HS+1/

√ k,

and substitution in (18) gives the new term

2_k

A

kHS

γ(1₋ε)√n

s

kC_k_HS+ 1 s2

1 √

k+

1 m

(19)

for the interacting bound. The dependence on the inverse signal amplitude in the first factor has disappeared, and as k and m increase, the bound for the noisy problem tends to the same limiting value

2_k

A

_k_HS_kC_k1_HS/2

γ(1₋ε)√n

as the bound for the original ’noise free’ problem, for any fixed positive value of s. This contrasts the behavior of all bounds which depend linearly on the dimension of the input space (such as in ) and diverge as k_→∞.

The quotient of (19) to the non-interacting (17) is

k

A

kHS

s

s2_k_C_k

HS+

1 √

k+

1 m,

(19)

An intuitive explanation of the fact, that for multi-task learning a large dimension k of the noise-distribution has a positive effect on the bound, is that for large k a sample of homogeneously distributed random unit vectors is less likely to lie in a common low-dimensional subspace, a cir-cumstance which could mislead the multi-task learner.

Of course there are many situations, when multi-task learning doesn’t give any advantage over single-task learning. To make a quantitative comparison we make more three more simplifying assumptions on the data distribution of the Xl:

• dim(M) =d<∞

The signal space M is of course unknown to the learner, but we assume that

• we know its dimension d.

Multi-task learning can then select from an economically chosen set

A

⊂HS∗of preprocessors such that

A

contains the set of d-dimensional projections

P

dandk

A

kHS=

√

d. We assume knowl-edge of d mainly for simplicity, without it we could invoke Theorem 15 instead of Theorem 13 above, causing some complications, which we seek to avoid.

• The mixture of the distributions of the Xl is homogeneous on S1∩M. This implies_kC_k_HS=1/√d, and, with_k

A

kHS=

√

d, the multi-task bound will improve over the non-interacting bound if

√

ds2+√d

k+

d m<1.

From this condition we conclude with four cook-book-rules to decide when it is worthwhile to go through the computational trouble of multi-task learning instead of the simpler single-task learning.

1. The problem is very noisy (s is expected to be small)

2. The noise is high-dimensional (k is expected to be large)

3. There are many learning tasks (m is large)

4. We suspect that the relevant information for all m tasks lies in a low-dimensional (d is small)

If one believes these criteria to be met, then one can use an algorithm as the one developed in (Ando, Zhang, 2005) to minimize the interacting bound above, with

A

=

P

d.

Appendix

(20)

Theorem 16 Let

F

be a[0,1]m-valued function class on a space

X

, and X= X_il(₍m_l_,_i,₎₌₍n) ₁_,₁₎a vector of

X

-valued independent random variables where for fixed l and varying i all the X_il are identically distributed. Fixδ>0. Then with probability greater than 1₋δwe have for all f= f1, ...,fm_∈

F

1 m

m

∑

l=1 E

h

fl

X₁l

i

≤_mn1

m

∑

l=1

n

∑

i=1 fl

X_il

+

R

nm(F) +

r

ln(1/δ) 2mn .

We also have with probability greater than 1₋δfor all f= f1, ...,fm_∈

F

, that 1

m

∑

l=1 E

h

fl

X₁l

i

≤ 1 mn

m

∑

l=1

n

∑

i=1 fl

X_il

+

R

ˆm

n (F) (X) +

r

9 ln(2/δ) 2mn .

Proof LetΨbe the function on

X

mn_{given by}

Ψ(x) =sup f∈F

1 m

m

∑

l=1 E

h

fl

X₁l

i

−1_n

n

∑

i=1 fl

X_il

!

and let X0be an iid copy of the

X

mn_{-valued random variable X. Then}

E[Ψ(X)] = EX

"

sup f∈F

1 mnEX0

"

m

∑

l=1

n

∑

i=1

fl

X_il

0

−fl

X_il

##

≤ E_XX0

"

sup f∈F

1 mn

m

∑

l=1

n

∑

i=1

fl

X_il0

−flX_il

#

= E_XX0

"

sup f∈F

1 mn

m

∑

l=1

n

∑

i=1

σl i fl

X_il0

−flX_il

#

,

for any realizationσ= σl i

of the Rademacher variables, because the expectation EXX0is symmetric under the exchange X_il0_←→X_il. Hence

E[Ψ(X)]_≤EXEσ

"

sup f_∈F

2 mn

m

∑

l=1

n

∑

i=1

σl ifl

X_il

#

=

R

m n (F).

Now fix x_∈

X

mn_{and let x}0_∈

_X

mn_{be as x, except for one modified coordinate x}l i

0

. Since each fl has values in[0,1]we have_|Ψ(x)₋Ψ(x0)_{| ≤}1/(mn). So by the one-sided version of the bounded difference inequality (see McDiarmid, 1998)

Pr

(

Ψ(X)>EX0Ψ X0+

r

ln(1/δ) 2mn

)

≤δ.

Together with the above bound on E[Ψ(X)]and the definition ofΨthis gives the first conclusion. With x and x0as above we have

ˆ

R

m

n (F) (x)−

R

ˆnm(F) (x0)

≤2/(mn), so by the other tail of

the bounded difference inequality

Pr

(

R

nm(F)<

R

ˆnm(F) (X) +

r

4 ln(1/δ) 2mn

)