• No results found

Optimal learning rates for least squares regularized regression with unbounded sampling

N/A
N/A
Protected

Academic year: 2021

Share "Optimal learning rates for least squares regularized regression with unbounded sampling"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

Contents lists available atScienceDirect

Journal of Complexity

journal homepage:www.elsevier.com/locate/jco

Optimal learning rates for least squares regularized

regression with unbounded sampling

Cheng Wang

a

, Ding-Xuan Zhou

b,∗

aCollege of Mathematical Sciences, Guangxi Normal University, Guilin, Guangxi 541004, PR China bDepartment of Mathematics, City University of Hong Kong, Kowloon, Hong Kong, China

a r t i c l e i n f o

Article history: Received 26 April 2010 Accepted 28 September 2010 Available online 14 October 2010 Keywords:

Learning theory Least squares regression

Regularization in reproducing kernel Hilbert spaces

Covering number

a b s t r a c t

A standard assumption in theoretical study of learning algorithms for regression is uniform boundedness of output sample values. This excludes the common case with Gaussian noise. In this paper we investigate the learning algorithm for regression generated by the least squares regularization scheme in reproducing kernel Hilbert spaces without the assumption of uniform boundedness for sampling. By imposing some incremental conditions on moments of the output variable, we derive learning rates in terms of regularity of the regression function and capacity of the hypothesis space. The novelty of our analysis is a new covering number argument for bounding the sample error.

©2010 Elsevier Inc. All rights reserved.

1. Introduction

Learning algorithms produce approximations of functions from samples. Efficiency of algorithms relies on models relating the approximated functions on a metric spaceXand samples inY

=

R. Here we take a model with a Borel probability measure

ρ

onZ

:=

X

×

Y. We assume that a sample z

= {

(

xi

,

yi

)

}

mi=1

Zmis drawn independently from

ρ

and the approximated function is theregression functionof

ρ

defined by

fρ

(

x

)

=

Y

yd

ρ(

y

|

x

),

x

X

.

(1.1)

The work described in this paper is supported partially by the Research Grants Council of Hong Kong [Project No. CityU

103508 ] and the National Science Fund for Distinguished Young Scholars of China [Project No. 10529101 ]. ∗Corresponding author.

E-mail addresses:[email protected](C. Wang),[email protected](D.-X. Zhou). 0885-064X/$ – see front matter©2010 Elsevier Inc. All rights reserved.

(2)

Here

ρ(

·|

x

)

is the conditional distribution of

ρ

atx

X. One measurement for the efficiency of a learning algorithm is the distance between the approximant produced by the algorithm andfρin the spaceL2

ρXwith norm

f

LX

=

(

X

|

f

(

x

)

|

2d

ρ

X

)

1/2where

ρ

Xis the marginal distribution of

ρ

onX.

In this paper we consider alearning algorithmgenerated by a least squares regularization scheme in areproducing kernel Hilbert space(RKHS). LetK

:

X

×

X

Rbe a bounded, symmetric, and positive semi-definite function. The RKHSHKassociated with the kernelKis the completion of the linear span

of functions

{

Kx

:=

K

(

x

,

·

),

x

X

}

with the inner product given by

Kx

,

Ky

HK

= ⟨

Kx

,

Ky

K

=

K

(

x

,

y

)

.

Then the learning algorithm for the regression problem is given by the regularization scheme

fz

=

fz

=

arg min f∈HK

1 m m

i=1

(

f

(

xi

)

yi

)

2

+

λ

f

2K

,

(1.2)

where

λ >

0 is a regularization parameter which may depend on the sample size

λ

=

λ(

m

)

with limm→∞

λ(

m

)

=

0.

There has been a large learning theory literature on error analysis for learning algorithm(1.2); see e.g. [4,17,13,6,2,5,11]. Most obtained error bounds are presented under the standard assumption that

|

y

| ≤

Malmost surely for some constantM

>

0, i.e.,

ρ(

·|

x

)

is supported on

[−

M

,

M

]

for almost every x

X. This standard assumption is abandoned in [2]. There the authors consider a general setting satisfying the condition

Y

exp

|

y

fH

(

x

)

|

2 M

|

y

fH

(

x

)

|

M

1

d

ρ(

y

|

x

)

Σ 2 2M2 (1.3)

for

ρ

X-almost everyx

Xand some constantsM

,

Σ

>

0, wherefHis the orthogonal projection offρ

onto the closure ofHK inL2ρX. Bounds for the error

fz

fρ

L2

ρX

are established under the assumption thatfHactually lies inHK. How to relax the assumptionfH

HK in the error analysis with

|

y

| ≤

M

is investigated in [11,5].

The main purpose of this paper is to conduct error analysis in another general setting satisfying the following moment hypothesis concerning unbounded outputs.

Moment hypothesis: There exist constantsM

>

0 andC

>

0 such that

Y

|

y

|

ℓd

ρ(

y

|

x

)

C

!

M

N, x

X

.

(1.4)

Remark 1. The moment hypothesis is a natural generalization of condition(1.3)to cases without the restrictionfH

HK orfH

L

(

X

)

. In fact, they are equivalent (with different constants) in the case fH

L

(

X

)

. To see this, we notice from the Taylor expansion that the left-hand side of(1.3)equals

ℓ=2

Y

|

y

fH

(

x

)

|

ℓd

ρ(

y

|

x

)

!

M

.

Then hypothesis(1.4)implies condition(1.3)withMreplaced by 3M

+

2

fH

∞. Conversely, when (1.3)is valid, we know that for 2

N,

Y

|

y

fH

(

x

)

|

ℓd

ρ(

y

|

x

)

Σ2

!

Mℓ−2

/

2. Hence(1.4)holds

true withC

=

Σ2

2M2

+

1 andMreplaced by max

{

2

fH

,

2M

}

.

A simple computation verifies(1.4)for Gaussian noise.

Example 1. LetB

>

0 andB0

>

0. If for eachx

X,

|

fρ

(

x

)

| ≤

Band the condition distribution

ρ(

·|

x

)

is

a normal distribution with variance

σ

x2bounded byB0, then(1.4)is satisfied withM

=

max

{

2B0

,

B

}

andC

=

4.

To show some ideas of our error analysis, we first state learning rates of(1.2)in the special case whenfρ

HKandKisC∞onX

Rn.

(3)

Theorem 1. Under the moment hypothesis and fρ

HK, if X

Rnand K is Con X

×

X , then for any 0

< ϵ <

1and0

< δ <

1, by taking

λ

=

mϵ−1, with confidence1

δ

, we have

fz

fρ

2L2 ρX

≤ ˜

Cϵmϵ−1

log4

δ

4ϵ+2

,

whereC

˜

ϵis a constant independent of m or

δ

.

Theorem 1is a corollary of our main result presented in the next section. 2. Main result

Our main result is about learning rates of(1.2)stated under conditions on the approximation ability ofHK with respect tofρand capacity ofHK.

The approximation ability of the hypothesis spaceHKwith respect tofρin the spaceL2ρXis reflected

by approximation error.

Definition 1. The approximation error of the triple

(

HK

,

fρ

, ρ

X

)

is defined as

D

(λ)

=

min

f∈HK

{‖

f

fρ

2

LX

+

λ

f

2K

}

, λ >

0

.

(2.1)

We shall assume that for some 0

< β

1 andCβ

>

0,

D

(λ)

Cβ

λ

β

λ >

0

.

(2.2)

Remark 2. Our analysis applies whenfρis replaced byfHwhich would implyD

(λ)

0 as

λ

0. So (2.2)is a natural assumption. Note [9] thatD

(λ)

=

o

(λ)

would implyfρ

0. So

β

=

1 in(2.2)is the best we can expect. This case is equivalent tofρ

HKwhenHKis dense inL2ρX. See [9]. Assumption

(2.2)with 0

< β <

1 can be characterized in terms of interpolation spaces [7].

For a general kernel on a general metric spaceX, we need the capacity ofHK to quantitatively

understand influence of the complexity of the hypothesis space to learning ability of algorithm(1.2). Here we use covering numbers to measure the capacity.

Definition 2. For a subsetF of a metric space and

η >

0, the covering numberN

(

F

, η)

is defined to be the minimal integer

such that there exist

disks with radius

η

coveringF.

We shall use this notion for ballsBR

= {

f

HK

: ‖

f

K

R

}

as subsets ofL

(

X

)

.

Definition 3. We sayHK has polynomial complexity exponents

>

0 if for some constantC0

>

0,

logN

(

B1

, η)

C0

1

η

s

,

η >

0

.

(2.3)

Remark 3. WhenXis a bounded domain inRnandK

Cτ

(

X

×

X

)

, it is known [18] that(2.3)holds

true withs

=

2τn. In particular, ifK

C

(

X

×

X

)

, condition(2.3)is valid for an arbitrarily smalls

>

0. It would be interesting to extend this covering number bound to unbounded input spacesX. Some ideas from [19] might help.

Now we can state our general result on learning rates for algorithm(1.2).

Theorem 2. Assume the moment hypothesis(1.4)and condition(2.2)with0

< β

1. If HK has

polynomial complexity exponent s

>

0and0

< ϵ <

s+1β , then by taking

λ

=

mβϵ−s+11

(4)

with confidence1

δ

, we have

fz

fρ

2L2 ρX

≤ ˜

Cϵmϵ−s+β1

log4

δ

β((s+1+1β)+2

,

whereC

˜

ϵis a constant depending on

ϵ

but not on m or

δ

.

Theorem 2establishes learning rates for unbounded sampling processes satisfying the moment hypothesis, which generalizes results in the classical case of uniformly bounded outputs (e.g. [13]). In addition, we do not require the sample sizemto be sufficiently large, a restriction imposed as m

mδ,ϵin [13].

Theorem 2provides a confidence-based estimate for the least squares error of the learning algorithm. The dependence of the estimate on the confidence (variance) is in the form of log

(

4

/δ)

which is mild.

Theorem 2will be proved in Section5and the constantC

˜

ϵ will be given explicitly. The proof is mainly based on our novel approach to handle unbounded sampling with a new covering number argument which will be presented in Section4. Note that when

β

=

1 andsis small enough, the learning rate stated inTheorem 2can be arbitrarily close to 1, hence is optimal [2,11,5,3].

To illustrateTheorem 2, we consider the example ofHKbeing a Sobolev spaceHτ

(

X

)

which consists

of functions onX

Rnwith all derivatives of order up ton2

< τ

Nlying inL2

(

X

)

. TakeHK

=

Hτ

(

X

)

.

Then(2.3)holds true withs

=

2τn. If

ρ

Xis the uniform measure andfρ

Hr

(

X

)

for some 0

<

r

< τ

,

then we know [7] that condition(2.2)is valid with

β

=

τr. So the conclusion in the following example is a corollary ofTheorem 2.

Example 2. LetXbe a bounded domain inRnand

ρ

Xbe the uniform measure. Assume the moment

hypothesis(1.4). IfHK

=

Hτ

(

X

)

withn2

< τ

N,fρ

Hr

(

X

)

for some 0

<

r

< τ

, and 0

< ϵ <

2nr+τ,

then by taking

λ

=

mτϵr −2nτ+τ, for any 0

< δ <

1, with confidence 1

δ

, we have

fz

fρ

2L2 ρX

≤ ˜

Cϵmϵ−2nr

log4

δ

τ(r2(rn++τ)τ)ϵ+2

.

Now let us describe two kinds of approaches for error analysis of algorithm(1.2)and compare our learning rates with those in the literature.

The first family of approaches aims at bounding

sup f∈Fλ

Z

(

f

(

x

)

y

)

2d

ρ

1 m m

i=1

(

f

(

xi

)

yi

)

2

(2.4) with a properly chosen function classFλ and then applying some uniform law of large numbers. Such an approach leads to capacity dependent error bounds for various learning algorithms stated in terms of various quantities measuring capacity ofHKsuch as VC-dimension,Vγ-dimension, covering

number, and empirical covering number (e.g. [5,10,14]).

A typical optimal learning rate stated in terms of covering numbers can be found in [13]. It asserts under the conditions ofTheorem 1that for 0

< ϵ <

1 andm

mδ,ϵ, with confidence 1

δ

, we have

fz

fρ

2L2

ρX

≤ ˜

C

(

log2δ

)

mϵ−1. A minor improvement of ourTheorem 1is to determine the restriction m

mδ,ϵ specifically in addition to our main contribution of removing the uniform boundedness assumption of

|

y

| ≤

M.

Another typical optimal learning rate is stated in terms of conditions on eigenvalues

{

λ

i

}

of the integral operatorLKX

:

L 2 ρX

L 2 ρXdefined by LKXf

(

x

)

=

X K

(

x

,

u

)

f

(

u

)

d

ρ

X

(

u

),

x

X

,

f

L2ρX

.

(5)

When the eigenvalues satisfya1ib

λ

i

a2ibfor alli

N, some 1

<

b

<

and 0

<

a1

,

a2

<

,

under the assumptionfH

HK, it was proved in [2] that with confidence 1

δ

,E

(

fz

)

E

(

fH

)

˜

C

(

log6δ

)

2mb+b1, whereE

(

f

)

is the generalization error defined forf

:

X

Ras E

(

f

)

=

Z

(

f

(

x

)

y

)

2d

ρ.

In the special case offρ

HKandb

ϵ1

1, the optimal learning rateO

(

mϵ−1

)

is achieved though the

eigenvalue condition is difficult to check. Optimal learning rates are also discussed in [5] for algorithm (1.2)with the penalty

f

qK for some 0

<

q

<

1 where the conditionfH

HK is replaced by

the uniform boundedness of the eigenvectors ofLKX or more generally by the norm comparison

assumption

f

C

f

sK

f

1−sLX

f

HK (2.5)

with some constantsC

>

0 and 0

s

1. In [11] the lower bound condition for the eigenvalues in [2] is removed and the restrictionfH

HK is replaced by approximation error condition(2.2)with

0

< β

1. When

(

HK

, ρ

X

)

satisfies(2.5)withs

=

1b, it was shown in [11] that when

|

y

| ≤

M, with

confidence 1

δ

,

π

M

(

fz

)

fρ

2L2

ρX

≤ ˜

C

(

log3δ

)

m

bβ

bβ+1, where

π

M

(

f

)

is the projection operator [3,13]

defined by

π

M

(

f

)(

x

)

=

M

,

iff

(

x

) >

M

,

f

(

x

),

if

M

f

(

x

)

M

,

M

,

iff

(

x

) <

M

.

This general result yields optimal learning rate in the special case

β

=

1. The upper bound condition for the eigenvalues and the norm comparison assumption(2.5)used in [2,11,5] can be easily verified in some common situations [9], and they have the advantage of applying to bounded input spacesX. It would be interesting to combine advantages of methods from [2,11,5] and our approach. In particular, the following two questions would lead to further study on error analysis:

1. Is it possible to have some criteria for checking the eigenvalue condition and(2.5)for general marginal distributions

ρ

Xwhich could be used to proveTheorem 1?

2. Can we extend the covering number approach to unbounded input spaces which can be used to recover results in [11]?

The second family of approaches for error analysis of the least squares algorithm(1.2)is to make full use of the linear nature of the algorithm for bounding the error betweenfzandfλ. In [17], a leave-one-out technique was used to obtain

E(

fz

fρ

2L2 ρX

)

D

λ

2

+

E

(

fρ

)

+

D

λ

2



4

κ

2 m

λ

+

2

κ

2 m

λ

2

,

where

κ

=

supxX

K

(

x

,

x

)

. Whenfρ

HKandE

(

fρ

) >

0, the learning rate would be

fz

fρ

2L2

ρX

C

δ

(

m1

)

1

2 with confidence 1

δ

, corresponding to the choice

λ

=

√1 m.

In [4], a functional analysis approach was employed to show that

fz

fρ

2L2

ρX

C

(

m1log2δ

)

25

with confidence 1

δ

whenfρ

=

XK

(

·

,

u

)

g

(

u

)

d

ρ

X

(

u

)

for someg

L 2

ρX. An integral operator

approach was applied in [6] to prove that under the same condition, with confidence 1

δ

, there holds

fz

fρ

2L2

ρX

C

(

log4δ

)

2m −23

(6)

3. Error decomposition

The error analysis of algorithm(1.2)will be conducted by an error decomposition procedure. The idea of error decomposition has been used in the analysis of regularization schemes [15,3,13,14,8,16]. But the previous approaches for bounding(2.4)cannot be applied here because of the unboundedness of the sampling outputs. We shall adjust the error decomposition technique by bounding the outputs with confidence and then applying a novel covering number argument for a finite set of functions (an

η

-net) instead of the ballBR(an infinite set of functions). The detailed procedure is described in

Section4.

Observe that the regression functionfρis a minimizer of the generalization errorE

(

f

)

and actually we have

f

fρ

2

LX

=

E

(

f

)

E

(

fρ

).

(3.1)

If we define the empirical errorEz

(

f

)

as Ez

(

f

)

=

1 m m

i=1

(

f

(

xi

)

yi

)

2

,

then the following error decomposition follows from the relationEz

(

fz

)

+

λ

fz

K2

Ez

(

fλ

)

+

λ

fλ

2 K,

as shown in [13].

Lemma 1. Let fλbe a minimizer of (2.1). Then

E

(

fz

)

E

(

fρ

)

+

λ

fz

2K

S1

(

z

)

+

S2

(

z

)

+

D

(λ),

(3.2)

where

S1

(

z

)

=

(

E

(

fz

)

E

(

fρ

))

(

Ez

(

fz

)

Ez

(

fρ

)),

S2

(

z

)

=

(

Ez

(

fλ

)

Ez

(

fρ

))

(

E

(

fλ

)

E

(

fρ

)).

With decomposition(3.2), the error

fz

fρ

2L2

ρX

=

E

(

fz

)

E

(

fρ

)

can be bounded by estimating the two quantitiesS1

(

z

)

andS2

(

z

)

. While the second quantity can be easily analyzed by applying

probability inequalities to the random variable

(

fλ

(

x

)

y

)

2

(

f

ρ

(

x

)

y

)

2on the space

(

Z

, ρ)

, the first

quantityS1

(

z

)

is the main task of error analysis for algorithm(1.2): thoughS1

(

z

)

can be expressed as

Z

ξ

1

(

z

)

d

ρ

m1

m

i=1

ξ

1

(

zi

)

with

ξ

1

(

z

)

=

(

fz

(

x

)

y

)

2

(

fρ

(

x

)

y

)

2, the major challenge is that

ξ

1is

not a single random variable and it depends on the samplezitself. Our approach to tackleS1

(

z

)

is a

novel covering number argument presented in Section4.

Our error analysis relies on the following probability inequality for random variables without uniform boundedness [1].

Lemma 2. Let X1

,

X2

, . . . ,

Xmbe independent random variables withEXi

=

0. If for some constants

M

, v >

0, the boundE

|

Xi

|

12

!

Mℓ−2

v

holds for every2

N, then Prob

m

i=1 Xi

ε

exp

ε

2 2

(

m

v

+

M

ε)

−1

ε >

0

.

In our setting we applyLemma 2to random variablesXi

=

Eg

g

(

zi

)

for a functiongonZwhere

zi

=

(

xi

,

yi

)

andEg

=

Zg

(

z

)

d

ρ

. Lemma 3. DenoteEzg

=

m1

m

i=1g

(

zi

)

for a measurable function g on Z . If for some M

, v >

0, the bound

E

|

g

Eg

|

12

!

M

−2

v

holds for2

N, then there holds ProbzZm

{

Eg

Ezg

ε

} ≤

exp

m

ε

2 2

(v

+

M

ε)

ε >

0

.

In particular, the second quantityS2

(

z

)

in(3.2)can be bounded easily.

(7)

Lemma 4. Under the moment hypothesis, with confidence at least1

δ 2, we have S2

(

z

)

240

{

+

1

)

2D

(λ)/λ

+

2

(

C

+

1

)

2M2

}

m log 2

δ

+

18D

(λ).

Proof. Consider the functiongon the space

(

Z

, ρ)

given withz

=

(

x

,

y

)

byg

(

z

)

=

(

fρ

(

x

)

y

)

2

(

fλ

(

x

)

y

)

2. It satisfies for 2

N

|

g

(

z

)

|

= |

fρ

(

x

)

fλ

(

x

)

|

|

fρ

(

x

)

+

fλ

(

x

)

2y

|

3ℓ

(

|

fλ

(

x

)

|

+ |

fρ

(

x

)

|

+

2ℓ

|

y

|

)

2ℓ−2

(

|

fλ

(

x

)

|

ℓ−2

+ |

fρ

(

x

)

|

ℓ−2

)

|

fλ

(

x

)

fρ

(

x

)

|

2

.

The moment hypothesis yields for almost everyx

Xthat

Y

|

y

|

ℓd

ρ(

y

|

x

)

C

!

Mℓ and

|

fρ

(

x

)

| =

Y yd

ρ(

y

|

x

)

CM

.

The reproducing property ofHKmeans

f

,

Kx

K

=

f

(

x

),

x

X

,

f

HK

.

(3.3)

It follows that

|

f

(

x

)

| ≤

κ

f

K

f

HK

,

x

X

.

(3.4)

Note thatD

(λ)

=

E

(

fλ

)

E

(

fρ

)

+

λ

fλ

2Kimplies

fλ

K

D

(λ)/λ

. So

|

fλ

(

x

)

| ≤

κ

D

(λ)/λ

for eachx

X. Hence E

|

g

|

=

X

Y

|

g

(

z

)

|

ℓd

ρ(

y

|

x

)

d

ρ

X

(

x

)

3ℓ

{

κ

(

D

(λ)/λ)

ℓ/2

+

CM

+

2ℓC

!

M

}

×

2ℓ−2

{

κ

ℓ−2

(

D

(λ)/λ)

(ℓ−2)/2

+

Cℓ−2Mℓ−2

}

X

|

fλ

(

x

)

fρ

(

x

)

|

2d

ρ

X

(

x

)

!

6ℓ

{

+

1

)

2D

(λ)/λ

+

2

(

C

+

1

)

2M2

}

ℓ−1D

(λ).

It follows that E

|

g

Eg

|

2ℓ+1E

|

g

|

2ℓ+1

!

6ℓ

{

+

1

)

2D

(λ)/λ

+

2

(

C

+

1

)

2M2

}

ℓ−1D

(λ).

DenoteM1,λ

=

12

{

+

1

)

2D

(λ)/λ

+

2

(

C

+

1

)

2M2

}

and

v

1,λ

=

242M1,λD

(λ)

. We find that

E

|

g

Eg

|

1 2

!

M

ℓ−2 1,λ

v

1,λ

.

Then we applyLemma 3and see that for any

ε >

0, ProbzZm

{

Eg

Ezg

ε

} ≤

exp

m

ε

2 2

(v

1,λ

+

M1,λ

ε)

.

Consider the quadratic equation by setting the probability on the right-hand side to be2δ. The positive solution is

ε

=

1 m

M1,λlog 2

δ

+

M12log22

δ

+

2m

v

1,λlog 2

δ

1 m

2M1,λlog 2

δ

+

24

2mM1,λlog 2

δ

D

(λ)

20M1,λ m log 2

δ

+

18D

(λ).

(8)

Then with confidence at least 1

δ 2, we have Eg

Ezg

20M1,λ m log 2

δ

+

18D

(λ).

So our desired bound follows from the identityEg

Ezg

=

S2

(

z

)

sinceEg

= −

(

E

(

fλ

)

E

(

fρ

))

and Ezg

= −

(

Ez

(

fλ

)

Ez

(

fρ

))

.

4. Novelty dealing with unboundedness

In this section we present our novelty in the error analysis to deal with the error termS1

(

z

)

in (3.2)for algorithm(1.2)before provingTheorem 2in the next section. ThoughS1

(

z

)

can be written

as

ξ

1

(

z

)

d

ρ

m1

m

i=1

ξ

1

(

zi

)

with

ξ

1

(

z

)

=

(

fz

(

x

)

y

)

2

(

fρ

(

x

)

y

)

2, the function

ξ

1is not really a

random variable: the functionfzalso depends on the samplez. We would follow the covering number approach in [13] to handle this term. However, the lack of uniform boundedness for sample function values causes serious difficulty and the approach in [13] of estimating quantity(2.4)is not applicable directly: to estimate(2.4), we would have to bound

|

Ez

(

f

)

Ez

(

fj

)

|

, which would lead to bounding

1 m

m

i=1

|

yi

|

for everyf

Fλ. We shall deal with the difficulty in three steps. Our key point is to use a

new novel covering number argument. 4.1. Bounding sample values with confidence

The first step in our approach is to boundywith confidence.

Proposition 1. Under the moment hypothesis there is a subset Zδof Zmwith measure at least1

δ 4such that 1 m m

i=1

|

yi

| ≤

Mδ

:=

CM

+

4M

(

1

+

2C

)

log

(

4

/δ)

m

z

Zδ

.

Proof. Letgbe the function onZgiven byg

(

z

)

= −|

y

|

. Then for 2

N, we have E

|

g

Eg

|

2ℓ+1E

|

y

|

2ℓ+1C

!

M

1 2

!

(

2M

)

ℓ−2C

(

4M

)

2

.

So we can applyLemma 3to obtain for

ε >

0, ProbzZm

{

Eg

Ezg

ε

} ≤

exp

m

ε

2 2

(

C

(

4M

)

2

+

2M

ε)

.

Setting the right-hand side to be 4δ and bounding the solution to the corresponding quadratic equation, we know that with confidence at least 1

δ

4, there holds Eg

Ezg

4M m log 4

δ

{

1

+

2Cm

}

.

Note that

Ezg

=

m1

mi=1

|

yi

|

and

Eg

=

E

|

y

| ≤

CM. Then with confidence 1

δ4, we have 1 m m

i=1

|

yi

| ≤

CM

+

4M

(

1

+

2C

)

log

(

4

/δ)

m

=

Mδ

.

This means that there is a subsetZδofZmwith measure at least 1

δ

4, such that for everyz

Zδ, we

havem1

m

(9)

4.2. Bounding error difference for a finite function set

The second step in our approach is to bound the error difference

[

E

(

f

)

E

(

fρ

)

] − [

Ez

(

f

)

Ez

(

fρ

)

]

for a finite set of functions. For this purpose, we need the following lemma which is a corollary of Lemma 3by taking

ε

as

ε

ε

+ |

Eg

|

.

Lemma 5. Let g be a measurable function on Z . If for some M

,

cv

>

0, the boundE

|

g

Eg

|

1 2

!

M

−2c

v

|

Eg

|

holds for2

N, then ProbzZm

{

Eg

Ezg

ε

ε

+ |

Eg

|} ≤

exp

m

ε

2

(

cv

+

M

)

ε >

0

.

In the following lemma,

{

fj

}

Nj=1is a fixed set of functions which will be chosen as an

η

-net of the

setBRin our error analysis conducted inLemma 7.

Lemma 6. Let0

< δ <

1, R

M and

{

fj

}

Nj=1be a set of functions in BR. Then there exists a subset Zδof

Zmwith measure at least1

δ

4such that E

(

fj

)

E

(

fρ

)

− [

Ez

(

fj

)

Ez

(

fρ

)

] ≤

ε

m,δ,N,R

+

1 2

{

E

(

fj

)

E

(

fρ

)

}

j

∈ {

1

, . . . ,

N

}

,

z

Z ′ δ

,

where

ε

m,δ,N,R

=

520

+

C

+

2

(

C

+

1

))

2 R2 mlog

(

4N

/δ).

Proof. Fixj

∈ {

1

, . . . ,

N

}

. Letgbe the function onZdefined byg

(

z

)

=

(

fj

(

x

)

y

)

2

(

fρ

(

x

)

y

)

2. By

the moment hypothesis, for 2

N, we have

E

|

g

Eg

|

2ℓ+1E

|

g

|

=

2ℓ+1E

{|

fj

(

x

)

fρ

(

x

)

|

· |

fj

(

x

)

+

fρ

(

x

)

2y

|

}

22ℓ+1E

{|

fj

(

x

)

fρ

(

x

)

|

2

R

+

CM

)

ℓ−2

((κ

R

+

CM

)

+

2ℓ

|

y

|

)

}

22ℓ+1

R

+

CM

)

ℓ−2

((κ

R

+

CM

)

+

2ℓC

!

M

)

X

|

fj

(

x

)

fρ

(

x

)

|

2d

ρ

X

.

But

X

|

fj

(

x

)

fρ

(

x

)

|

2d

ρ

X

=

E

(

fj

)

E

(

fρ

)

=

Eg

= |

Eg

|

. So we know that E

|

g

Eg

|

1 2

!

M

ℓ−2 3,R

v

3,R

|

Eg

|

withM3,R

=

4

+

C

+

2

(

C

+

1

))

2R2and

v

3,R

=

43M3,R. The constants are independent ofj. Thus we

can applyLemma 5and get ProbzZm

E

(

fj

)

E

(

fρ

)

(

Ez

(

fj

)

Ez

(

fρ

))

ε

+

E

(

fj

)

E

(

fρ

)

ε

exp

m

ε

2

(v

3,R

+

M3,R

)

.

Now we take all these events withj

∈ {

1

, . . . ,

N

}

and see that

ProbzZm

max 1≤j≤N E

(

fj

)

E

(

fρ

)

(

Ez

(

fj

)

Ez

(

fρ

))

ε

+

E

(

fj

)

E

(

fρ

)

ε

Nexp

m

ε

2

(v

3,R

+

M3,R

)

.

Setting the right-hand side to beδ4, we choose

ε

=

2

(v

3,R

+

M3,R

)

m log

(

4N

/δ)

520

+

C

+

2

(

C

+

1

))

2R2

(10)

Then we conclude that with confidence at least 1

δ 4, there holds E

(

fj

)

E

(

fρ

)

− [

Ez

(

fj

)

Ez

(

fρ

)

] ≤

ε

m,δ,N,R

ε

m,δ,N,R

+

E

(

fj

)

E

(

fρ

)

ε

m,δ,N,R

+

1 2

{

E

(

fj

)

E

(

fρ

)

}

j

∈ {

1

, . . . ,

N

}

.

This proves our statements.

4.3. A new covering number argument

Now we can describe our new covering number argument. It is based on the observation thatfzis only one function (though it changes with the samplez) and can be very close to one of the functions in the net

{

fj

}

Nj=1. Hence we can boundS1

(

z

)

with a single (but varying) functionfjinstead of the quantity

supf∈Fλ

|

E

(

f

)

Ez

(

f

)

|

in(2.4)concerning the whole function setFλ. Denote

W

(

R

)

= {

z

Zm

: ‖

fz

K

R

}

.

Lemma 7. Let0

< δ <

1and R

M. Take Zδ inProposition1and ZδinLemma6. Then for every z

Zδ

Z′ δ

W

(

R

)

we have

[

E

(

fz

)

E

(

fρ

)

] − [

Ez

(

fz

)

Ez

(

fρ

)

] ≤

C1R2m2(s1+1) log4

δ

+

1 2

{

E

(

fz

)

E

(

fρ

)

}

,

where C1is the constant given by

C1

=

6

κ

+

6C

+

8

(

1

+

2C

)/

M

+

520

+

C

+

2

(

C

+

1

))

2

(

C0

+

1

).

Proof. Letz

Zδ

Z′ δ

W

(

Rδ

)

. It satisfiesm1

m i=1

|

yi

| ≤

Mδ. Moreover, E

(

fj

)

E

(

fρ

)

− [

Ez

(

fj

)

Ez

(

fρ

)

] ≤

ε

m,δ,N,R

+

1 2

{

E

(

fj

)

E

(

fρ

)

}

,

j

∈ {

1

, . . . ,

N

}

.

Let

η

=

Rms+11 and

{

fj

}

jN=1withN

=

N

(

BR

, η)

be an

η

-net of the setBRmeaning that for any

f

BR, there exists somej

∈ {

1

, . . . ,

N

}

such that

f

fj

η

. Sincez

W

(

R

)

, we know that fz

BRand there is somejz

∈ {

1

, . . . ,

N

}

such that

fz

fjz

η

. Then

|

E

(

fz

)

E

(

fjz

)

| =

Z

[

fz

(

x

)

fjz

(

x

)

][

fz

(

x

)

+

fjz

(

x

)

2y

]

d

ρ

η

Z

(

|

fz

(

x

)

| + |

fjz

(

x

)

| +

2

|

y

|

)

d

ρ

η(

2

κ

R

+

2E

|

y

|

)

2

η(κ

R

+

CM

).

From the boundm1

m

i=1

|

yi

| ≤

Mδ, we find

|

Ez

(

fz

)

Ez

(

fjz

)

| =

1 m m

i=1

[

fz

(

xi

)

fjz

(

xi

)

][

fz

(

xi

)

+

fjz

(

xi

)

2yi

]

η

1 m m

i=1

(

|

fz

(

xi

)

| + |

fjz

(

xi

)

| +

2

|

yi

|

)

2

η(κ

R

+

Mδ

).

The above two bounds together withLemma 6tell us that

[

E

(

fz

)

E

(

fρ

)

] − [

Ez

(

fz

)

Ez

(

fρ

)

]

= [

E

(

fz

)

E

(

fjz

)

] + {[

E

(

fjz

)

E

(

fρ

)

] − [

Ez

(

fjz

)

Ez

(

fρ

)

]} + [

Ez

(

fjz

)

Ez

(

fz

)

]

2

η(

2

κ

R

+

CM

+

Mδ

)

+

ε

m,δ,N,R

+

1

(11)

But

|

E

(

fz

)

E

(

fjz

)

| ≤

2

η(κ

R

+

CM

)

. So the above expression is bounded by 2

η(

3

κ

R

+

2CM

+

Mδ

)

+

ε

m,δ,N,R

+

1

2

{

E

(

fz

)

E

(

fρ

)

}

.

By the covering number condition(2.3),

ε

m,δ,N,Rcan be bounded as

ε

m,δ,N,R

520

+

C

+

2

(

C

+

1

))

2 R2 m

(

log

(

4

/δ)

+

C0

(

R

/η)

s

)

520

+

C

+

2

(

C

+

1

))

2C 0R2+s m

η

s

+

520

+

C

+

2

(

C

+

1

))

2 m log 4

δ

R2

.

Putting the choice of

η

and the expression ofMδ, we see that

[

E

(

fz

)

E

(

fρ

)

] − [

Ez

(

fz

)

Ez

(

fρ

)

] ≤

1 2

{

E

(

fz

)

E

(

fρ

)

} + {

6

κ

+

6C

+

8

(

1

+

2C

)/

M

+

520

+

C

+

2

(

C

+

1

))

2

(

C0

+

1

)

}

R2m− 1 s+1log4

δ

.

This proves our statement.

CombiningLemmas 4and7, we get from(3.2)the following bound forE

(

fz

)

E

(

fρ

)

+

λ

fz

2K.

Proposition 2. Let0

< δ <

1and R

M. There exists a subset VRof Zmwith measure at most

δ

such

that for everyz

W

(

R

)

\

VR,

E

(

fz

)

E

(

fρ

)

+

λ

fz

2K

38D

(λ)

+

2

[

C1

+

322

(

C

+

1

)

2

]

R2m− 1 s+1log4

δ

+

480

+

1

)

2D

(λ)

λ

m log 2

δ

.

5. Deriving error bounds by iteration

We use an iteration technique [12,13,9] to derive our error bounds.

Proof of Theorem 2. From the bound for

fz

K obtained inProposition 2, we see that forR

Mand

z

W

(

R

)

\

VR, we have

fz

K

amR

+

bm,δ, where am

=

2

[

C1

+

322

(

C

+

1

)

2

]

1

λ

m2(s1+1)

log

(

4

/δ)

and bm

=

38D

(λ)/λ

+

480

+

1

)

2D

(λ)

λ

2m log

(

2

/δ)

+

M

.

It tells us thatW

(

R

)

W

(

amR

+

bm

)

VR.

Let us first derive a rough bound for

fz

K. From the definition offz, we see that

Ez

(

fz

)

+

λ

fz

2K

Ez

(

0

)

+

λ

0

2K

=

1 m m

i=1 y2i

.

It implies that

λ

fz

2K

1 m m

i=1

{

y2i

(

fz

(

xi

)

yi

)

2

} =

1 m m

i=1

{

fz

(

xi

)

[

2yi

fz

(

xi

)

]}

=

1 m m

i=1

{

fz

(

xi

)

2yi

} −

1 m m

i=1

{

fz

(

xi

)

2

} ≤

2 m m

i=1

{

fz

(

xi

)

yi

} ≤

2 m m

i=1 yi

κ

fz

K

.

(12)

It follows that

fz

K

mλ

m

i=1yiand we see fromProposition 1that forz

Zδ,

fz

K

2

κ

Mδ

λ

.

It means thatZδ

W

(

R(δ0)

)

whereR(δ0)

=

Mδ

λ

+

M. Define a sequence

{

R( j) δ

}

j=0by R(δj+1)

=

amR( j) δ

+

bm

.

We know that W

(

R(δ0)

)

W

(

R(δ1)

)

V R(δ0)

⊆ · · · ⊆

W

(

R (J) δ

)

J−1

j=0 V R(δj)

.

Since each setV

R(δj)has measure at most

δ

, the setW

(

R

(J)

δ

)

has measure at least 1

(

J

+

1

. By the definition of the sequence

{

R(δj)

}

,R(δJ)

=

aJmR(δ0)

+

bm

J−1

i=0aim,δ. Now we take

λ

=

m 2ϵ−s+11 with 0

< ϵ <

2(s1+1). Then am

C2m−ϵ

log

(

4

/δ),

whereC2

=

2

[

C1

+

322

(

C

+

1

)

2

]

. PuttingD

(λ)

Cβ

λ

βintobm,δ, we see

bm

≤ {

38Cβ

+

+

1

)

480Cβ

+

M

}

mβ−21(2ϵ− 1 s+1)

log

(

2

/δ).

Thus withC3

=

38Cβ

+

+

1

)

480Cβ

+

M, we have R(δJ)

C2J

log4

δ

2J+1 M

[

2

κ(

C

+

(

1

+

2

2C

))

+

1

]

ms+11−(J+2)ϵ

+

C3m(β −1)ϵ+21(s+β1)

log

(

2

/δ)

Jmax

{

1

, (

C2m−ϵ

log

(

4

/δ))

J−1

}

.

ChooseJϵto be the smallest integer satisfyingJϵ

1+β

2ϵ(s+1)

2. ThenJϵ

1+β 2ϵ(s+1)

1

1 ϵ(s+1)

1 and R(δJϵ)

M

[

2

κ(

C

+

(

1

+

2

2C

))

+

1

]

C2Jϵ

log4

δ

J2ϵ+1

+

C3JϵC2Jϵ−1

(

log

(

4

/δ))

Jϵ 2

m 1−β 2(s+1)

C4 1

ϵ(

s

+

1

)

C 1 ϵ(s+1) 2

(

log

(

4

/δ))

Jϵ 2+1m 1−β 2(s+1)

,

where C4

=

M

(

2

κ(

C

+

(

1

+

2

2C

))

+

1

)

+

C3

.

Since the setW

(

R(δJϵ)

)

has measure at least 1

(

Jϵ

+

1

, applyingProposition 2withR

=

R(δJϵ), we conclude that with confidence at least 1

(

Jϵ

+

2

,

E

(

fz

)

E

(

fρ

)

38Cβm2βϵ− β s+1

+

2

(

C 1

+

322

(

C

+

1

)

2

)

C42

1

ϵ(

s

+

1

)

2

×

C 2 ϵ(s+1) 2 ms+β1

(

log

(

4

/δ))

Jϵ+3

+

480

+

1

)

2Cβm−1+1s−+β1−2ϵ(1−β)log2

δ

.

By setting

δ

=

(

Jϵ

+

2

and

ϵ

=

2

βϵ

, we know that with confidence at least 1

δ

,

E

(

fz

)

E

(

fρ

)

C5

ϵ

2C 4β ϵ(s+1) 2 mϵ −s+β1

log

(

4

/

δ)

+

log

2

ϵ(

s

+

1

)

+

1



β(1+β) ϵ(s+1) +2

,

(13)

where C5

=

38Cβ

+

2

(

C1

+

322

(

C

+

1

)

2

)

C42

2 s

+

1

2

+

480

+

1

)

2Cβ

.

This proves the conclusion ofTheorem 2by taking

Cϵ

=

C5

ϵ

2C 4β ϵ(s+1) 2

1

+

log

2

ϵ(

s

+

1

)

+

1



β(ϵ(1s++β)1)+2 to be a constant independent ofmor

δ

.

WhenKisCwe know from [18,19] that(2.3)holds for anys

>

0. Take

β

=

1 ands

=

ϵ 1−ϵ when

0

< ϵ <

12 inTheorem 2. We know that by taking

λ

=

m2ϵ−1, for any 0

< δ <

1, with confidence 1

δ

, we have

fz

fρ

2L2 ρX

≤ ˜

Cϵm2ϵ−1

log4

δ

2ϵ+2

.

This verifiesTheorem 1by scaling 2

ϵ

to

ϵ

. References

[1] G. Bennett, Probability inequalities for the sum of independent random variables, J. Amer. Statist. Assoc. 57 (1962) 33–45. [2] A. Caponnetto, E. De Vito, Optimal rates for regularized least-squares algorithms, Found. Comput. Math. 7 (2007) 331–368. [3] D.R. Chen, Q. Wu, Y. Ying, D.X. Zhou, Support vector machine soft margin classifiers: error analysis, J. Mach. Learn. Res. 5

(2004) 1143–1175.

[4] E. De Vito, A. Caponnetto, L. Rosasco, Model selection for regularized least-squares algorithm in learning theory, Found. Comput. Math. 5 (2005) 59–85.

[5] S. Mendelson, J. Neeman, Regularization in kernel learning, Ann. Statist. 38 (2010) 526–565.

[6] S. Smale, D.X. Zhou, Learning theory estimates via integral operators and their approximations, Constr. Approx. 26 (2007) 153–172.

[7] S. Smale, D.X. Zhou, Estimating the approximation error in learning theory, Anal. Appl. 1 (2003) 17–41. [8] S. Smale, D.X. Zhou, Online learning with Markov sampling, Anal. Appl. 7 (2009) 87–113.

[9] I. Steinwart, A. Christmann, Support Vector Machines, Springer-Verlag, New York, 2008.

[10] I. Steinwart, D. Hush, C. Scovel, A new concentration result for regularized risk minimizers, E. Giné, V. Kolchinskii, W. Li, J. Zinn (Eds.), High Dimensional Probability IV, Institute of Mathematical Statistics, Beachwood, 2006 pp. 260–275. [11] I. Steinwart, D. Hush, C. Scovel, Optimal rates for regularized least-squares regression, in: S. Dasgupta, A. Klivans (Eds.),

Proceedings of the 22nd Annual Conference on Learning Theory, 2009, pp. 79–93.

[12] I. Steinwart, C. Scovel, Fast rates for support vector machines, Lecture Notes in Comput. Sci. 3559 (2005) 279–294. [13] Q. Wu, Y. Ying, D.X. Zhou, Learning rates of least-square regularized regression, Found. Comput. Math. 6 (2006) 171–192. [14] Q. Wu, Y. Ying, D.X. Zhou, Multi-kernel regularized classifiers, J. Complexity 23 (2007) 108–134.

[15] Q. Wu, D.X. Zhou, Analysis of support vector machine classification, J. Comput. Anal. Appl. 8 (2006) 99–119.

[16] G.B. Ye, D.X. Zhou, SVM learning andLpapproximation by Gaussians on Riemannian manifolds, Anal. Appl. 7 (2009) 309–339.

[17] T. Zhang, Leave-one-out bounds for kernel methods, Neural Comput. 15 (2003) 1397–1437.

[18] D.X. Zhou, Capacity of reproducing kernel spaces in learning theory, IEEE Trans. Inform. Theory 49 (2003) 1743–1752. [19] D.X. Zhou, Derivative reproducing properties for kernel methods in learning theory, J. Comput. Appl. Math. 220 (2008)

References

Related documents

Commonwealth shall, after compliance with section 1301.11, where required, and on or before April 15 of the year following the year in which the property first became subject

The lukewarm recognition of procurement activities as a key value-add function for strategic decisions and product/process design, the lack of strategic innovations with

While the second part of questionnaire assessed project success that was influenced by stakeholder impact, stakeholder engagement, and stakeholder psychological

To improve communication, an interprofessional high fidelity simulation-based program that uses the Agency for Healthcare Research and Quality TeamSTEPPS® 2.0 Framework and

Although the effect of microwave-assisted hydrodistillation has been conduct on various essential oil extraction, its effect on Ginger ( Zingiber Officinale Roscoe) and

The offset characterization of the two magnetic sensors components was performed inside a three-layer magnetic shielded chamber of 2.5 m 3 , previously characterized (at dif-

Perioperative data, including basic characteristics of patients, the length of hospital stay, intraoperative blood loss, thoracic intubation indwelling time, and complications

• To enter patient data click on the Patient List icon and select the New button or pull down the lists menu and choose Patients/Guarantors and Cases... MediSoft will