Error Analysis of ERM Algorithm with Unbounded and Non Identical Sampling

(1)

http://dx.doi.org/10.4236/jamp.2016.41019

How to cite this paper: Nie, W.L. and Wang, C. (2016) Error Analysis of ERM Algorithm with Unbounded and Non-Identical Sampling. Journal of Applied Mathematics and Physics, 4, 156-168. http://dx.doi.org/10.4236/jamp.2016.41019

Error Analysis of ERM Algorithm with

Unbounded and Non-Identical Sampling

*

Weilin Nie, Cheng Wang

#

Department of Mathematics, Huizhou University, Huizhou, China

Received 9 November 2015; accepted 24 January 2016; published 27 January 2016

This work is licensed under the Creative Commons Attribution International License (CC BY).

http://creativecommons.org/licenses/by/4.0/

Abstract

A standard assumption in the literature of learning theory is the samples which are drawn inde-pendently from an identical distribution with a uniform bounded output. This excludes the com-mon case with Gaussian distribution. In this paper we extend these assumptions to a general case. To be precise, samples are drawn from a sequence of unbounded and non-identical probability distributions. By drift error analysis and Bennett inequality for the unbounded random variables, we derive a satisfactory learning rate for the ERM algorithm.

Keywords

Learning Theory, ERM, Non-Identical, Unbounded Sampling, Covering Number

1. Introduction

In learning theory we study the problem of looking for a function or its approximation which reflects the relationship between the input and the output via samples. It can be considered as a mathematical analysis of artificial intelligence or machine learning. Since the exact distributions of the samples are usually unknown, we can only construct algorithms based on an empirical sample set. A typical setting of learning theory in mathematics can be like this: the input space X is a compact metric space, and the output space Y ⊂_ for regression. (When Y = + −

{

1, 1 ,

}

, it can be regarded as a binary classification problem.) Then Z: =X Y× is the whole sample space. We assume a distribution ρ on Z, which can be decomposed to two parts: marginal distribution

X

ρ on X and conditional distribution ρ

(

y x|

)

given some x X∈ . This implies

*_{This work is supported by NSF of China (Grant No. 11326096, 11401247), Foundation for Distinguished Young Talents in Higher}

Educa-tion of Guangdong, China (No. 2013LYM 0089), Doctor Grants of Huizhou University (Grant No. C511.0206) and NSF of Guangdong Province in China (No. 2015A030313674).

(2)

( )

d

(

, d

) (

| d

)

_X

Zg z ρ= X Yg x y ρ y x ρ

∫

∫ ∫

for any integrable function g z

( )

[1].

To evaluate the efficiency of a function f X: →Y we can choose the generalization error:

( )

(

( )

)

(

( )

)

2

, d d .

Z Z

f =

_∫

φ f x y ρ=

_∫

f x −y ρ



Here φ

(

f x y

( )

,

)

is a loss function which measures the difference between the prediction f x

( )

via f and the actual output y. It can be hinge loss in SVM (support vector machine) or pinball loss in quantile learning and etc.. In this paper we focus on the classical least square loss φ

(

f x y

( )

,

)

=

(

f x

( )

−y

)

2 for simplicity. [2] shows that

( )

(

( )

)

2

d .X X

f = fρ +

∫

f x − f xρ ρ

  (1)

From this we can see the regression function

( )

_Y d

(

|

)

f xρ =

∫

y ρ y x

is our goal minimizing the generalization error. The empirical risk minimization (ERM) algorithm aims to find a function which approximates the goal function fρ well. While ρ is always unknown beforehand, a sample

set

{ }

m₁

{

(

,

)

}

m₁ m

i i i i i

z ₌ x y ₌ Z

= = ∈

z is accessible. Then ERM algorithm can be described as

( )

(

)

2 1

1

arg min_f m i i ,

i

f f x y

m

∈ ₌

=

∑

−



z

where function space  is the hypothesis space which will be chosen to be a compact subset of C X

( )

. Then the error produced by ERM algorithm is 

( )

fz . We expect it is close to the optimal one 

( )

fρ , which means the excess generalization error 

( )

fz −

( )

fρ should be small, while the sample size m tends to infinity.

Dependent sampling has considered in some literature such as [3] for concentration inequality and [4] [5] for learning. More recently, in [6] and [7], the authors studied learning with non-identical sampling and dependent sampling, and obtained satisfactory learning rates.

In this paper we concentrate on the non-identical setting that each sample zi is drawn according to a

different distribution

_ρ

( )i _on_Z_{. And each}

_ρ

( )i _{can also be decomposed to marginal distribution} ( )i X

ρ

and conditional distribution _ρ( )i

(

_{y x}|

)

_{. Assume they are elements of}

(

_{C X}s

_{( )}

)

*_and

(

_{C Y}s

_{( )}

)

*_{respectively,} where _{C X}s

( )

_and _{C Y}s

( )

_{are Hölder spaces with} ₀_{≤ ≤}_s ₁_{. Hölder spaces} _{C X}s

( )

_{is the set of}

continuous functions with finite norm

( )

,

s s

C C C

f X = f X + f X

where

( )

sup

( )

.

s

s C _{x x}

f x f x

f X

x x ′

≠

′ − =

′ −

We assume a polynomial convergence condition for both sequences

{ }

( ) 1

i X _i

ρ

≥ and

( )

₍

₎

{

|

}

1

i

y x

ρ

≥ , i.e., there

exist 0, 0, s

( )

b X

b> C >

ρ

∈C X and

_ρ

(

_{y x}|

)

_∈_{C Y}s

( )

_{, such that}

( )

_d ( )i

_{( )}

_d b s

_{( )}

_, s

_{( )}

_, _.

X X b C

Xf x ρ Xf x ρ C i f X f C X i

−

− ≤ ∀ ∈ ∈

∫

 (2)

( )

_d ( )i

₍

_|

₎

_{( ) (}

_d _|

₎

b s

_{( )}

_, s

_{( )}

_, _.

b C

Yg y ρ y x Yg y ρ y x C i g Y g C Y i

−

− ≤ ∀ ∈ ∈

(3)

Power index b measures quantitatively differences between the non-identical setting and the i.i.d. case. The distributions are more similar as b is larger, and when b= ∞ it is indeed i.i.d. sampling, i.e. ( )i

X X

ρ

=

ρ

and

( )i

₍

_{y x}_|

₎

₍

_{y x}_|

₎

ρ =ρ for any i∈. The following example is taken from [8].

Example 1.Let

{ }

_h( )i _be_a _sequence_of_bounded_functions_on_X_such _that _sup ( )i

_{( )}

b i

x X∈ h x ≤C− .Thenthe

sequence

{ }

( )

1,2,

i X i

ρ

= definedby

( ) ( )

_{( )}

d t d t d

X X h x X

ρ = ρ + ρ satisfies (2) forany 0≤ ≤s 1.

On the other hand, most literature assume the output space is uniformly bounded, that is, y M≤ for some positive constant M and almost surely with respect to ρ. A typical kernel dependent result for the least-squares regularization algorithm under this assumption is [9]. There the authors get a learning rate close to 1 under some capacity condition for the hypothesis space. However, the most common distribution-Gaussian distribution is not bounded. This requirement is from the bounded condition in Bernstein inequality and limits the application of algorithms. In [10]-[13], some unbounded conditions for the output space are discussed in different forms, which extends the classical bounded condition. Here we will follow the latter one which is more generalized and simple in expression, and this is the second novelty of this paper. We assume the moment incremental condition for the output space, an extension of that we proposed in [11]:

(

y x|

)

=

_∫

_Y yd

ρ

(

y x|

)

≤C M! , ∀ ≤ ∈2 ,

  (4)

and

( )i

(

_|

)

_d ( )i

₍

_|

₎

_! _, _{, 2} _.

Y

y x =

_∫

y

ρ

y x ≤C M ∀ ∈i ≤ ∈

   (5)

We can see the Gaussian distribution satisfies this setting.

Example 2.Let B>0 and B₀ >0.Ifforeach x X f x∈ , ρ

( )

≤B andthe conditiondistribution ρ

( )

⋅|x isanormaldistributionwithvariance 2

x

σ boundedby B0, then (4) issatisfiedwith M =max 2 ,

{

B B0

}

and

4

C= .

Next we need to introduce the covering number and interpolation space.

Definition 1. Thecovering number  

(

,η

)

fora subset  of C X

( )

and

η

>0 isdefinedtobethe

minimalintegerNsuchthatthereexistNballswithradius η covering  .

Let the hypothesis space H C X⊆

( )

, be a compact Banach space with inclusion I H: →C X

( )

bounded and compact. We follow the assumption [14] [15] that there exist some constants r>0 and Cr >0, such that

the hypothesis space satisfies the capacity condition

(

1

)

log , r, 0,

r

B

η

_≤C

η

− _{∀ >}

η

 (6)

where B1=

{

f H f∈ : H ≤1

}

. Capacity condition describes the amount of functions in the hypothesis space. The sample error will decrease but approximation error will increase when covering number of H is larger (or simply say H is larger). So how to choose an appropriate hypothesis space is the key problem of ERM algorithm. We will demonstrate this in our main theorem.

Definition 2.Theinterpolationspace

(

2

)

,

X

Lρ H _θ_∞ isafunctionspaceconsistsof f L∈ 2ρ_X withnorm

( )

, 0

,

sup ,

t

K f t f

tθ θ∞

>

= < +∞

where K f t

( )

, is the K-functional defined as

( )

, inf

{

2

}

, 0.

X

L H

g H

K f t g f t g t

ρ ∈

= − + >

Interpolation space is used to characterize the position of the regression function, and it is related with the approximation error. Now we can state our main result as follow.

(4)

(

2

)

,

X

fρ∈ Lρ H _θ_∞ for some 0< <θ 1, the sample distribution satisfies (2), (3) for some b>0,Cb>0 and

0< <s 1, (4) and (5). Forany 1≤ ∈p , choosethe hypothesis space  tobe the ballofHcenteredat 0

withradius R=

(

γ

( )

m

)

−1r+1−2θ, where γ

( )

m =max

{

(

v m_b

( )

)

(p−1)p,m−1 1( )+r

}

and

( )

1

, 0 1

log , 1.

, 1

b

m b

m

v m b

m

m b

−

 < <

 

= =



> 

Moreover, we assume all functions in H and fρ are Hölder continuous of order s, i.e., there is a constant

0

s

C > , such that

( )

_{{ }}

, , .

s s

f x f y

C f H f x y X

x y ρ

−

≤ ∀ ∈ ≠ ∈

− ∪ (7)

Then for any 0< <δ 1, with confidence at least 1−δ, we have

( )

f

( )

f C

(

( )

m

)

r r2 2log .3

θ θ

ρ γ − + _δ

− ≤

 z 

Here C is a constant independent with m and δ.

Remark 1.In [6], theauthorspointedoutthatifwechoosethehypothesisspacetobethereproducingkernel Hilbert space (RKHS) K on X ⊂n, andthe kernel K C X X∈ 2

(

×

)

, then ourassumption (7) will hold

true.In particular, if the kernel ischosento beGaussian kernel Kσ , then (7) holds forany 0< ≤s 1. [16]

discussedthisindetail.

In all, we extend the polynomial convergence condition on the conditional distribution sequense and accordingly, set the moment inremental condition for the sequence in the least squares ERM algorithm. By error decomposition, truncate technique and unbounded concentration inequality, we can finally obtain the total error bound Theorem 1.

Compared with the non-identical settings in [6] and [17], our setting is more general since the conditional distribution sequence

{

( )

(

)

}

1

| i

i

y x

ρ

≥ is also a polynomially convergence sequence, but not identical as in their

settings. This together with unbounded y lead to the main difficulty for the error analysis in this paper.

For the classical i.i.d. and bounded conditions, [9] indicates that _X _⊂_Rn_{and kernel} _{K C}_∈ ∞_while

K

fρ∈ , the rate of least square regularization algorithm is

(

( )

)

1

p

O m −

for any >0. [17] shows that under some conditions on kernel, object function fρ, exponential convergence condition for distribution sequence and choose some special parameters, the optimal rate of online learning algorithm is close to

( )

(

1 4

)

1

p

O m . In [6], the best case occurs when _X _⊂_Rn_{and kernel} _{K C}_∈ ∞_{. The rate of least square} regularization algorithm can be close to Op

(

( )

1m 2 3

)

. However, our result implicates that while b>1, θ

tends to 1 and

r

,

s

tends to 0, since p can be any integer, the learning rate can be arbitrarily close to O_p

( )

1 m , which is the same as in i.i.d. case [9], and better than the former results with non-identical settings. With this result, we can extend the application of learning algorithm to more situations and still keep the best learning rate. The explicit expression of C_{in the theorem can be found through the proof of the theorem below.}

2. Error Decomposition

(5)

( )

arg min .

f

f_= _∈ f



Then the generalization error can be written as

( )

f −

( )

fρ =

{

( )

f −

( )

f

}

+

{

( )

f −

( )

fρ

}

.

 z   z   

The first term on the right hand side is the sample error, and the second term A=

( )

f −

( )

fρ is called approximation error which is independent with samples. [18] analyzed the approximation error by approximation theory. In the following we mainly study the sample error bound.

Now we break the sample error to some parts which can be bounded using truncate technique and unbounded concentration inequality. We refer the error decomposition 

( )

fz −

( )

f_ to [6]. Denote

( )

(

( )

)

2

1

1 m

i i

i

f f x y

m =

=

∑

−

z

( )

(

( )

)

2 ( )

1

1 m _d _,

i

m _Z

i

f f x y

m = ρ

=

∑∫

−



then fz =arg min_f∈z

( )

f and we have

( )

( ) ( )

( )

( ) ( )

( )

(

)

(

( )

)

(

( )

)

(

( )

)

( )

(

)

(

( )

)

(

( )

)

(

( )

)

( )

(

)

(

( )

)

(

( )

)

(

( )

)

.

m m m m

f f

f f f f f f f f f f

f f f f f f f f

ρ ρ ρ ρ

−

= − + − + − ≤ − + −

= − − − + − − −

   

=_ − − − _{ }+ − − − _

   

+_ − − − _{ }+ − − − _



    

   

 

         

       

z

z z z z z z z z z z z

z z z z z z

z z

In the following, we call the first and fourth brackets drift errors, and the left sample errors. We will bound the two types of errors respectively in the following sections, and finally obtain the total error bounds.

3. Drift Errors

Firstly we consider the drift error involving f in this section.To avoid handling two polynomial convergence

sequences simultaneously, we break the drift errors to two parts. Meanwhile, a truncate technique is used to deal with the unbounded assumption. Since  is a subset of C X

( )

, functions in  is uniformly bounded. Then we have

Proposition 1.Assume f _C

( )

X ≤B forsome B>0, forany 1≤ ∈p ,

( )

(

)

(

( )

)

( )

(

)

(

)

(

)

1

2

2 2

1 2 3 4 2 4

4 6 8 6 2 .

p p b

m m b b s

b s

w m

f f f f C B CM pC M C C B

m

C CC p C M CC M

ρ ρ

−

  _

− − − ≤  _ + + + + +

 



+ + + + _

 

   

Proof. From the definition of m and , we know that

( )

(

)

(

( )

)

(

( )

(

( )

)

(

( )

)

( )

(

( )

)

(

)

(

( )

₍

₎

₍

₎

)

( )

(

( )

)

(

)

(

)

(

( )

)

2 2

1

2 2

1

2 2

1 ₂ _d

1 ₂ _d _| _| _d

2 d | d .

m

i

m m _Z

i m

i i

X X Y

i

i X X X Y

f f f f f X f X f X f X y

m

f X f X f X f X y y x y x

m

f X f X f X f X y y x

ρ ρ ρ ρ

ρ ρ

ρ ρ ρ

=

− − − = − − − −



= _ − − − −



+ − − − − _

∑∫

∑ ∫ ∫

∫ ∫

   

 

   

Since 2

( )

2_, 2

( )

(

_d

(

_|

)

2 2_d

(

_|

)

₂ 2

Y Y

(6)

bracket as follow.

( )

(

( )

)

(

)

(

( )

₍

₎

₍

₎

)

( )

(

( )

₍

₎

₍

₎

)

_{( )}

(

( )

₍

₎

₍

₎

)

( )

(

( )

₍

₎

₍

₎

)

( )

(

)

(

)

(

( )

₍

₎

₍

₎

)

( ) 2 2 2 2 2 2

2 d | | d

d | | d | |

2 d | | d

2 2 d | | d .

i i

X X Y

i i

X Y Y

i i

X Y

i i

b

b _{X Y} X

f X f X f X f X y y x y x

f X y x y x f X y x y x

f X f X y y x y x

B CM C i B CM y y x y x

ρ ρ

ρ

ρ ρ ρ

ρ ρ ρ ρ

ρ ρ ρ

− − − − −  ≤ _ − + −  + − ⋅ − _ ≤ + + + −

∫ ∫

∫

∫ ∫

   

But for any K≥1 and 1≤ ∈p , there holds

( )

₍

₎

₍

₎

(

)

(

( )

₍

₎

₍

₎

)

(

( )

₍

₎

₍

₎

)

( )

₍

₎

₍

₎

(

)

_{( )} 1 1 1

d | | d | | d | |

1 _d _| _|

2 ! 2 !

sup 3 .

i i i

Y y K y K

p i b

b

p y K y K _{C Y}s

p p

p b p b

b b

p s p

y y

y y x y x y y x y x y y x y x

y y x y x y C i

K

C M y y C M

K C i KC i

K y y K

ρ ρ ρ ρ ρ ρ

ρ ρ

> ≤

−

− > ≤

− − − _≠ _′ − − ≤ − + − ≤ − +  − ′    ≤ + + ≤ +  ₋ _′   

∫

From (3.12) in [6], we have

( )

1

1 _{, 0} _1,

1

1 log , 1,

, 1. 1 b m b b i m b b

i w m m b

b _b

b

−

=

 _{< <}

 −  ≤ = + =   > − 

∑

Then we can bound the sum of the first term as

( )

(

( )

)

(

)

(

( )

₍

₎

₍

₎

)

( )

(

)

( )

(

)

( )

2 2 1 2 2 1

1 ₂ _d _| _| _d

2 !

2 2 3 .

m i i X X Y i p p b b

b p b

f X f X f X f X y y x y x

m

C M

w m w m

B CM C B CM KC

m K m

ρ ρ ρ ρ ρ

= − − − − −   ≤ + + + _ + _  

∑∫ ∫

 

Choose K to be

(

m w mb

( )

)

1p pM, we have

( )

(

( )

)

(

)

(

( )

₍

₎

₍

₎

)

( )

₍

₎

₍

₎

2 2 1 1

2 2 2

1 ₂ _d _| _| _d

2 3 4 6 .

m i i X X Y i p p b

b p b

f X f X f X f X y y x y x

m w m

C B C pC MB C CC p M m

ρ ρ ρ ρ ρ

= − − − − −   _ _ ≤_ _ _ + + + + _  

∑∫ ∫

 

For the second term, notice fg _{C X}s_{( )}≤ f _{C X}_{( )} g _{C X}s_{( )}+ f _{C X}s_{( )} g _{C X}_{( )}, and fρ _{C X}s_{( )}≤ 2CM C+ s so

( )

(

( )

)

(

)

(

)

(

( )

)

( )

(

( )

)

_{( )}

(

( )

)

(

_{( )}

)

₍

₎

(

( )

)

( )

( ) ( )

(

( )

)

( )

( ) ( ) ( ) ( ) ( )

(

)

2 2 2 2 2 2 2 2

2 d | d

d d 2 d | d

2 2 sup 2 2 s s s s i X X X Y

i i i

X X X X X X

X X X Y

b b b

s C X C X C X

x x

s _{C X} _{C X} _{C X} _{C X}

f X f X f X f X y y x

f X f X f X f X y y x

f X f x

B C i f f C i f X f X f X C i

x x

B BC f f f f f f

ρ ρ

ρ ρ ρ ρ

ρ ρ ρ

ρ ρ ρ ρ ρ ρ ρ

− − − ′ ≠ − − − − ≤ − + − + − −  − ′    ≤ + + + −  − ′     ≤ + + − + −

∫ ∫

∫

        

(

)

2 _{4 2} ₈ 2 _{6 2} _.

b b b

s s b

C i

B C C B CM CC M C i

(7)

( )

(

( )

)

(

)

(

)

(

( )

)

( )

₍

₎

2 2

1

2 2

1 ₂ _d _{| d}

4 2 8 6 2 .

m

i X X X Y

i b

s s

f X f X f X f X y y x

m

w m _B _{C C B} _CM _{CC M}

m

ρ ρ ρ ρ ρ

=

− − − −

 

≤ _ + + + + _

∑∫ ∫

 

Combining the two bounds, we have

( )

(

)

(

( )

)

( )

₍

₎

₍

₎

(

)

1

2

2 2

1 2 3 4 2 4

4 6 8 6 2 .

m m

p p b

b b s

b s

f f f f

w m

C B CM pC M C C B

m

ρ ρ

−

− − −



≤ _ + + + + +



+ + + + _

 

   

And this is indeed the proposition.

For the drift error involving fz, we have the same result since fz∈ as well, i.e.,

Proposition 2.Assume f_z _{C X}_{( )}≤B forsome B>0, forany 1≤ ∈p , wehave

( )

(

)

(

( )

)

( )

₍

₎

₍

₎

(

)

1

2

2 2

1 2 3 4 2 4

4 6 8 6 2 .

m m

p p b

b b s

b s

f f f f

w m

m

ρ ρ

−

− − −

  _

≤_ _ _ + + + + +

 



+ + + + _

 z   z 

4. Sample Error Estimate

We devote this section to the analysis of the sample errors. For the sample error term involving f, we will use

the Bennett inequality as in [11] and [19], which is initially introduced in [20]. Since two polynomial convergence conditions are posed on the marginal and conditional distribution sequences, we have to modify the Bennett inequality to fit our setting. Denote g=

_∫

_Zg z

( )

dρ and _g 1 _i 1m_{g z}

( )

m

=

∑

=

z for an integrable

function g, the lemma can be stated as follow.

Lemma 1. Assume 1 ! 2

2

g₋ g_≤ M v−

  holdsfor =2,3, andsomeconstants M v, >0 , thenwe have

{

}

₍

2

₎

Prob exp .

2

m Z

m

g g

v M

ε ε

ε

∈

 

 

− ≥ ≤ _− _

+

 

 

z 

z

For our non-identical setting, we can have a similar result from the same idea of proof. By denoting

( )i

_{( )}

_d ( )i Z

g=

_∫

g z ρ

 and 1 1m

( )

d ( )i

mg= _m

∑

i=

∫

_Zg z ρ

 , the following lemma holds.

Lemma 2. Assume 1 ! 2

2

g₋ g_≤ M v−

  and ( ) ( ) 1 ! 2

2

i _g₋ i_g_≤ _{M v}−

  for someconstants M v, >0

andany i∈ , =2,3,, thenwehave

{

}

₍

2

₎

Prob exp .

2

m m

Z

m

g g

v M

ε ε

ε

∈

 

 

− ≥ ≤ − ₊ 

 

 

z 

z

Now we can bound the sample error term z

( )

f −m

( )

f by applying this lemma.

(8)

( )

(

)

(

( )

)

2

(

2

)

_log3 1 _, 2

B

m m

C M

f f f f A

m

ρ ρ _δ

+

− − − ≤ +

  

z z  

where MB=6

(

B C+

(

+3

)

M

)

2 and A is the approximation error.

Proof.Let

( )

(

( )

)

2

(

( )

)

2

(

( )

)

(

( )

)

2

g z = f x −y − f xρ −y = f x − f xρ f x + f xρ − y , then

( )

(

z f −z fρ

)

−

(

m

( )

f −m

( )

fρ

)

=zg−mg.

Since f xρ

( )

=

∫

_Yydρ

(

y x|

)

≤

(

y x|

)

≤ 

(

y x2|

)

≤ 2CM , we have

(

)

( )

(

)

( )

(

)

(

)

(

)

(

( )

)

( )

(

)

(

)

( )

(

)

1

2 2

1

2

1 ₂

2

1 ₂

2 2

2 2 d

2 2 2 d | d

2 3 2 2 d |

2 3 2 2 !

2 ! 6 3

Z

X

X Y

Y

g g g g g

f x f x f x f x y

B CM f x f x B CM y y x

B CM f f B C M y y x

B CM A B C M C M

B C

ρ ρ

ρ

ρ ρ

ρ

+

− +

− ≤ + ≤

= − ⋅ + −

≤ + − + +

 

≤ + − _ + + _

 

 

≤ + ⋅ ⋅ _ + + _

 

≤ ⋅ + +

∫

         

 



 

(

)

2 2

(

)

1 ₂

1 !

2 B B

M − C₊ A _≤ M v−



for any =1, 2,, where

1 and vB =4

(

C+1

)

M AB . In the same way, we have the following bounds

( ) ( ) 1 ! 2 _{, ,} _{1, 2,}

2

i i

B B

g ₋ g _≤ M v− i ₌

   

as well. Then from Lemma 2 above, we have

{

}

₍

2

₎

Prob exp .

2

Z m

B B

m

g g

v M

ε ε

ε

∈

 

 

− ≥ ≤ _− _

+

 

 

   

z z

Set the right hand side to be δ 3, we can solve that

(

)

(

)

2 2

1 3 3 3

= log log 2 log

4 1

2 _log3 _log3

2 2 _log3 1 _. 2

m B B B

B B

B

g g M M mv

m

C M A

M

m m

C M

A m

ε

δ δ δ

δ δ

δ

 

− ≤ _ + + _

 

+

≤ +

+

≤ +

   



z

Therefore with confidence at least 1−δ 3, there holds

(

)

2 2 _log3 1 _. 2

B m

C M

g g A

m δ

+

− ≤ +

z    

This proves the proposition.

For the sample error term involving fz, analysis will be more involved since we need a concentration

inequality for a set of functions. Firstly we have to introduce the ratio inequality [9].

(9)

and ( ) ( ) 1 ! 2

2 i _g₋ i_g _≤ _{M v}−

  forsomeconstants M v, >0 and i,=2,3,, thenwehave

(

)

{

}

₍

₎

Prob exp .

2 4 5

m z m Z

m

g g g

M C

ε ε ε

∈

 

 

− ≥ + ≤ − ₊ 

 

 

  

z

Proof. Let ε to be ε ε

(

+

( )

f −

( )

fρ

)

in the Lemma 2, from the proof of the last proposition, we can conclude that

( )

(

)

{

}

( )

(

)

(

)

(

( )

)

(

( )

)

(

)

Prob

exp

2 4 1

exp .

2 4 5

m m

Z

B B

B

g g f f

m f f

C M f f M f f

m

M C

ρ

ρ ρ

ε ε

ε

∈ − ≥ + −

 ₊ ₋ 

 

≤ − 

 ₊ ₋ ₊ ₊ ₋ 

 _ _

 

 

 

≤ − ₊ 

 

 

   

 

   

z z

Note that 

( )

f −

( )

fρ =g and the lemma is proved.

Then we have the following result.

Lemma 4.Forasetoffunctions

{ }

f_{i i}N₌₁⊂ with N∈,constructfunctions

( )

(

( )

)

2

(

( )

)

2

)

i i

g z = − f x −y − f xρ −y for i=1, 2, , N, withconfidenceatleast 1−δ 3, wehave

(

)

2 4 5 _log3 1 _, _{1, 2, ,} 2

B

i m i i

M C N

g g A i N

m δ

+

− ≤ + ∀ =

z 

where Ai =

( )

fi −

( )

fρ forany i=1, 2, , . N

Proof. Since fi is an element of , from Lemma 3 we have

(

)

Prob exp ,

2 4 5

m i m i Z

B i

g g m

M C

g

ε ε

ε

∈

 −   

 _≥ _≤ ₋ 

   ₊ 

+  

   

 

 



z z

then there holds

(

)

1

Prob sup exp .

2 4 5

m i m i Z _{i N}

B i

g g _N m

M C

g

ε ε

ε

∈ _{≤ ≤}

 −   

 _≥ _≤ ₋ 

   ₊ 

+  

   

 

 



z z

Set the right hand side to be δ 3 and we have with probability at least 1−δ 3,

(

)

(

)

1 2

2 4 5 _log3 1 _, _{1, 2, , .}

2

i m i i i

B

i

g g g g

M C N _{A i} _N

m

ε ε ε

δ

− ≤ + ≤ +

+

≤ + =

z   

Here Ai =

( )

fi −

( )

fρ =gi. And this proves the lemma.

Now by a covering number argument we can bound the sample error term involving fz.

Proposition 4. If =B H_R

( )

:=

{

f H f∈ : _H ≤R

}

for some R>0, where H satisfies the capacity condition,forany 0< <δ 1, withconfidenceatleast 1 2 3− δ , thereholds

( )

(

)

(

( )

)

(

)

(

)

(

)

(

)

(

( )

)

1

1 _log3 ₅ ₁₃ ₁₆ ₂ ₄ ₅ ₁ 1 _.

2

m m

r r

B r

f f f f

m B C M M C C R f f

ρ ρ

ρ

δ

− +

− − −

≤ + + + + + + −

   

 

z z z z

(10)

Proof. Denote N= 

(

,η

)

where η is to be determined, then we can find an η-net

{ }

f_{i i}N₌₁ of , and there exist a function f j_j, ∈

{

1, 2, , N

}

, we have

( )

(

)

(

( )

)

( )

(

)

(

( )

)

(

( ) ( )

)

(

( )

)

( )

(

)

(

)

(

( )

)

.

m m

m m j m j m j j

m m j j m j j

f f f f

f f f f f f f f

f f g g f f

ρ ρ

− − −

= − + − − − + −

= − + _ −_ + −

   

       

   

z z z z

z z z z z z z z z z z

For the first term, since d ( )i

(

|

)

2

Yy ρ y x ≤ CM

∫

for all i=1, 2, , m, we have

( )

(

)

(

( )

)

( )

₍

₎

( )

(

)

(

)

1

= 2 d | d

2 2 2 2 .

m m j

m

i i

j j X

X Y i m i

f f

f x f x f x f x y y x

m

B CM B CM

m

ρ ρ

η _η

=

−

− + −

≤ + = +

∑∫ ∫

∑

 z 

z z

And for the third term,

( )

(

( )

)

(

( )

)

( )

(

)

(

( )

)

2 2

1

1 1

1

1 ₂ ₂ ₂1 _,

m

j j

i

m m

j j

i i

f f f x y f x y

m

f x f x f x f x y B y

m η m

=

= =

− = − − −

 

= − + − ≤ _ + _

 

∑

z z z z

z z

we need to bound

(

)

1

1 m _.

z z

i y y y y y

m =

= = − +

∑

   

Let g z

( )

= y and then

(

)

(

)

2

(

)

1 1 2

2 2 ! 2 ! 2 16 , 1, 2, .

2

g₋ g _≤ + y_≤ C M ₌ M − CM ₌

  

From Lemma 1 we have

{

}

₍

₂2

₎

Prob exp .

2 16 2

m

Z

m g g

CM M

ε ε

ε

∈

 

 

− ≥ ≤ _− _

+

 

 

z 

z

Set the right hand side to be δ 3 and with confidence at least 1−δ 3 we have

(

)

2

1

1 m _d 4 _log3 32 _log3 _{4 1} ₂ _{log .}3

Z i

M CM

g g y y C

m = ρ ε m δ m δ δ

− =

∑

−

_∫

≤ ≤ + ≤ +

z 

And this means,

(

)

(

)

(

)

1

1 m _d _| ₄ ₁ _{2 log}3 ₅ _{9 log}3

Y

i y y y x M C M C

m = ρ δ δ

≤ + + ≤ +

∑

∫

with probability at least 1−δ 3.

The second term can be bounded by 4 above. That is, with confidence at least 1−δ 3, we have

(

)

2 4 5 _log3 1 _. 2

B

j j M C N j

g g A

m δ

+

− ≤ +

z 

Since log log

(

,

)

log 1

( )

,

(

)

r r

N B H C R

R

η

η   η

= = _ _≤

 

(11)

( ) ( ) ( )

( )

(

2 2

)

(

( )

)

,

j j j

A = f − fρ = f − fz + fz − fρ ≤η B+ CM +  fz − fρ combining the three parts above, we have the following bound with confidence at least 1 2 3− δ ,

( )

(

)

(

( )

)

(

)

(

)

(

)

(

)

(

( )

)

(

)

(

)

(

)

₍

_{( )}

₎

2 4 5

3 3

2 5 9 log 2 1 log

1 2

2 4 5

3 3

5 13 20 log log

2 4 5 1 _.

2

m m

B

r

r B r

f f f f

M C N

B C M B C M

m

B CM f f

M C N

B C M

m C M C R

f f

m

ρ ρ

ρ

η η

δ δ

η

δ δ

η−

− − −

+

 

≤ _ + + _+ + + +

 

+ + + −

+

 

≤ _ + + _+

 

+

+ + −

   

 

z z z z

z

By choosing

_η

₌_m−1 1( )+r

for balancing, we have

( )

(

)

(

( )

)

(

)

(

)

(

)

(

( )

)

1

1 _log3 ₅ ₁₃ ₂₀ ₂ ₄ ₅ ₁ 1

2

m m

r r

B r

f f f f

m B C M M C C R f f

ρ ρ

ρ

δ

− +

− − −

 

≤ _ + + + + + _+ −

   

 

z z z z

z

with confidence atleast 1 2 3− δ , this proves the proposition.

5. Approximation Error and Total Error

Combining the results above, we can derive the error bound for the generalization error 

( )

fz −

( )

fρ .

Proposition 5.Underthemomentconditionforthedistributionofthesampleandcapacityconditionforthe hypothesisspace , forany 0< <δ 1 and 1≤ ∈p , withconfidenceatleast 1−δ, wehave

( )

₍

₎

₍

₎

(

)

(

)

(

)

(

)

(

)

1

2

2 2

1 1

2 1 2 2 3 4 2 4

4 2 3 4 12 2

3 2 5 13 20 2 4 5 1

4 2 _{3 3 ,}

p p b

b b s

b s

r r

B r

B

f f f f A

w m

m

m B C M M C C R

C M

A m

ρ

δ

−

− +

− = − +

  _

≤_ _ _ + + + + +

 



+ + + + _

 

+ ⋅ _ + + + + + _

+

+ +

 



 z   z 

where MB=6

(

B C+

(

+3

)

M

)

2.

What is left to be determined in the proposition is the approximation error A. By the choice of hypothesis

space we can get our main result. Proof of Theorem 1. Let

( )

1

, 0 1,

log , 1,

, 1,

b

m b

m

v m b

m

m b

−

 < <

 

=_ =



> 

and

( )

_max

{

(

( )

)

(p1) p_, 1 1( )r

}

b

m v m m

γ

₌ − − +

(12)

indicates that

( )

2

, , , ,

3

log r 3

b s r p M

f − fρ = f − f +A≤γ m _δ B R C + A

 z   z 

holds with confidence at least 1−δ for any 1≤ ∈p _, where Cb s r p M, , , , is a constant independent on m or δ .

For the approximation error A, we can bound it by Theorem 3.1 of [18]. Since the hypothesis space

( )

R

B H

=

 , and

(

2

)

0,

,

X

fρ∈ Lρ H _∞ with 0< <θ 1, we have

( )

(

)

2 ₂

1

2 ₁

0,

1

d X .

X

A f x f x f

R θ

θ _θ

ρ ρ ρ

− ₋

∞

 

= − _{≤  }

 

∫

 

The upper bound B is now chosen to be I R since f _C

( )

X ≤ I ⋅ f _H ≤ I R, then with confidence at least 1−δ,

( )

2 2 12

(

)

12

, , , , _0,

3

log r 3 .

b s r p M

f f m I R C R θ_θ f θ

ρ γ _δ ρ

− ₋

+ −

∞

− ≤ +

 z 

By choosing

( )

(

)

12

1 ,

r

R m

θ

γ −+

−

=

we have

( )

(

)

2 ₂ 2

(

)

12

, , , , _0,

3

( ) r r log _{b s r p M} 3

f f m θ_θ C I f θ

ρ γ − + _δ ρ _∞ −

 

− ≤  + 

 

 

 z 

holds with confidence at least 1−δ. Denote

(

)

2

2 1

, , , , , , , , , , 3 _0,

b s r p M b s r p M

C C C I f θ

θ ρ _∞ −

= = +

 , then the theorem

is obtained.

6. Summary and Future Work

We investigate the least squares ERM algorithm with non-identical and unbounded sample, i.e., polynomial convergence for

{ }

( )

1

i X _i

ρ

≥ and

( )

₍

₎

{

|

}

1

i

y x

ρ

≥ and moment inremental condition for the latter ones. Analogue

error decomposition as classical analysis for least sqaures regularization [9] [11] is conducted. Truncate technique is introduced for handling unbounded setting, and Bennett concentration inequality is used for the sample error. By the above analysis we finally get the error bound and learning rate.

However, our work only considers the ERM algorithm. It is neccesary for us to extend this to the regularization algorithms which are more widely used in practice. A more recent relative reference can be found in [21]. Another interesting topic in future study is dependent sampling [7].

References

[1] Cucker, F. and Zhou, D.X. (2007) Learning Theory: An Approximation Theory Viewpoint. Cambridge University Pre- ss, Cambridge. http://dx.doi.org/10.1017/CBO9780511618796

[2] Cucker, F. and Smale, S. (2002) On the Mathematical Foundations of Learning. BulletinoftheAmericanMathematical Society, 39, 1-49.

[3] Dehling, H., Mikosch, T. and Sorensen, M. (2002) Empirical Process Techniques for Dependent Data. Birkhauser Boston, Inc., Boston. http://dx.doi.org/10.1007/978-1-4612-0099-4

[4] Steinwart, I., Hush, D. and Scovel, C. (2009) Learning from Dependent Observations. JournalofMultivariateAnalysis, 100, 175-194.

(13)

[6] Xiao, Q.W. and Pan, Z.W. (2010) Learning from Non-Identical Sampling for Classification. Advancesin Computa-tionalMathematics, 33, 97-112.

[7] Pan, Z.W. and Xiao, Q.W. (2009) Least-Square Regularized Regression with Non-i.i.d. Sampling. Journalof Statistic-alPlanningandInference, 139, 3579-3587.

[8] Hu, T. and Zhou, D.X. (2009) Online Learning with Samples Drawn from Non-identical Distributions. Journalof Ma-chineLearningResearch, 10, 2873-2898.

[9] Wu, Q., Ying, Y. and Zhou, D.X. (2006) Learning Rates of Least-Square Regularized Regression. Foundationsof ComputationalMathematics, 6, 171-192.

[10] Capponnetto, A. and De Vito, E. (2007) Optimal Rates for the Regularized Least Squares Algorithm. Foundationsof ComputationalMathematics, 7, 331-368.

[11] Wang, C. and Zhou, D.X. (2011) Optimal Learning Rates for Least Squares Regularized Regression with Unbounded Sampling. JournalofComplexity, 27, 55-67.

[12] Guo, Z.C. and Zhou, D.X. (2013) Concentration Estimates for Learning with Unbounded Sampling. Advances in ComputationalMathematics, 38, 207-223.

[13] He, F. (2014) Optimal Convergence Rates of High Order Parzen Windows with Unbounded Sampling. Statistics& ProbabilityLetters, 92, 26-32.

[14] Zhou, D.X. (2002) The Covering Number in Learning Theory. JournalofComplexity, 18, 739-767.

[15] Zhou, D.X. (2003) Capacity of Reproducing Kernel Spaces in Learning Theory. IEEETransactionsonInformation Theory, 49, 1743-1752.

[16] Zhou, D.X. (2008) Derivative Reproducing Properties for Kernel Methods in Learning Theory. Journalof Computa-tionalandAppliedMathematics, 220, 456-463.

[17] Smale, S. and Zhou, D.X. (2009) Online Learning with Markov Sampling. AnalysisandApplications, 7, 87-113. [18] Smale, S. and Zhou, D.X. (2003) Estimating the Approximation Error in Learning Theory. AnalysisandApplications,

1, 17-41.

[19] Wang, C. and Guo, Z.C. (2012) ERM Learning with Unbounded Sampling. ActaMathematicaSinica, EnglishSeries, 28, 97-104.

[20] Bennett, G. (1962) Probability Inequalities for the Sum of Independent Random Variables. JournaloftheAmerican StatisticalAssociation, 57, 33-45.