Least Square Regression Learning with Data Dependent Hypothesis and Coefficient Regularzation

(1)

Least Square Regression Learning with Data

Dependent Hypothesis and Coefficient

Regularzation

Bao-Huai Sheng

Department of Mathematics, Shaoxing College of Arts and Sciences Shaoxing, Zhejiang 312000, China E-mail: shengbaohuai@ 163.com

Pei-Xin Ye

School of Mathematics and LPMC, Nankai University,Tianjin 300071, China E-mail: [email protected]

Abstract--We study the least square regression with data dependent hypothesis and coefficient regularization algorithms based on general kernel. An explicit expression of the solution of this kernel scheme is derived. Then we

provide a sample error with a decay ofO( 1 ) m and

estimate the approximation error in terms of some kind of

K -functional.

Index Terms -- Least Square Regressions, Data Dependent Hypothesis, Coefficient Regularization, General Kernel, Learning Rate.

I. INTRODUCTION AND MAIN RESULTS

We establish in this paper the mathematical foundations of least square regression learning with general kernel and coefficient regularization.

LetX ⊂ ℜq,Y⊂ ℜbe Borel sets and letρ be a Borel

probability measure on Z:= ×X Y . For

function f :X →Y define the error

2

( )

(

( )

)

.

Z

E_ρ f =

∫

y− f x dρ

Consider ρ( | )y x -conditional (with respect to x)

probability measure on Y and ρX -the marginal

probability measure on X. Define f_ρ( )x to be the conditional expectation of y with respect to measure

, i.e.

( ) ( | ), ,

Y

fρ x =

∫

yd y x x∈X

then functionf_ρ is known in statistics as the regression function ofρ.It is clear if

f_ρ∈

1

2 2

2( X) { : 2, X

( |

_X ( )

|

X

)

},

L ρ = f ‖ ‖f ρ =

∫

f x dρ < +∞

then it minimizes the errorE f( ) over all f∈L₂(ρ_X).

Thus, in the sense of error the regression

function fρ( )x is the best to describe the relation between inputsx∈X and outputsy∈Y.

In most cases, the distributionρ( , )x y is unknown and what one can know is a set of samples

1 1

{ }_i m_i {( ,_i _i)}m_i m

z= z = = x y = ∈Z which are drawn

independently and identically distributed according to

( , ).x y

ρ Our goal is to find an estimator f_z on the base

of given dataz that approximates f_ρ well with high

probability. This is a ill-posed problem and the regularization technique is needed. In many areas of machine learning, the following Tikhonov regularization scheme is commonly used to overcome the ill-posedness:

( ) 2 2

1

: min

{

(

( )

)

}

.

m H

z f H i i H

i

f arg f x y f

m λ

∈ =

=

∑

− + ‖ ‖ (1)

We take H to be a reproducing kernel Hilbert space (RKHS) induced by a Mercer kernel. Recall a

Mercer kernel K is a function on X×X which is

continuous, symmetric and positive semi-definite, i.e., for any given positive integerm and any finite set of distinct

points X = { ,x x₁ ₂,…,x_m}⊂X , the matrix

,

X X

K = ( ( , )), 1

m i j i j

K x x = is positive semi-definite. The

RKHS HK associated with the kernel K is defined to

be the closure of the linear span of the set of functions with the inner product

, ( , ).

K

x y H

K K K x y

〈 〉 = The reproducing property takes

the form

, ( ), . .

K

x H K

f K f x x X f H

〈 〉 = ∀ ∈ ∈ (2)

Corresponding author: Ye Peixin. This work is supported by the Natural

(2)

This kind of kernel scheme has been studied due to a lot of literatures, c.f. [1-11].

In this paper we consider a different kernel scheme.

Let K X: ×X → ℜ be a continuous and bounded

function which is called general kernel. For a given data Y : { ,= y y1 2,",ym}⊂X the data dependent

hypothesis space is given by

, :

K Y

H = 1

1

( ) ( , ) : { , , }

m

j j m

j

f_α x α K x y α α α

=

⎧ ⎫

= = ∈ℜ

⎨ ⎬

⎩

∑

" ⎭.

Every hypothesis function is determined by its coefficients and the penalty is imposed on these coefficients. Then, there comes the general co -efficient regularized scheme

2 1

1

: min m

{

(

( )

)

( ) ,

}

m

z i i

i

arg f x y

m α

α

α _∈ℜ λ α

=

∑

− + Ω (3)

whereΩ( )α is a positive function on ℜm. Formulation (3) is a data dependent scheme which has been found many applications in the design of support vector machines, micro-array analysis and variable selection (see e.g. [12] -[18]). the coefficient regularization was first introduced by Vapnik [1] to design linear programming support vector machines. It has some advantages. Firstly, the algorithm is directly a finite dimensional optimization problem and easy to be adapted to other algorithms. Secondly, one can freely choose the regularizer for different purposes. For instance the sparse representation can be obtained if the norm of the coefficients is used as the regularizer while it gives back the regularization scheme Besides these advantages, an important observation is, when the positive definite kernel is used, the coefficient regularization scheme usually provides quite comparable performance as the regularization scheme in a reproducing kernel Hilbert space. We now study a particular coefficient regularization. We endowℜm with usual inner product, i.e., for anya=( ,a a₁ ₂,",a_m)F,b= ( ,b b₁ ₂,",b_m)F∈ ℜm, we take

2 1

( , ) .

m i i i

a b a b a b =

=

∑

= F

In particular 2

2 .

a =a aF Set 2

2

( )α m α

Ω = ‖ ‖= 2

1

| |

m i i

m α =

∑

, we have the following coefficient regularization with

2

l -penalization

,

:

z zλ

α =α

2 2

2 1

1

min m{

(

( )

)

}.

m

i i i

arg y f x m

m α

α∈ℜ λ α

=

∑

− + ‖ ‖

(4)

We notice that (4) is a strict convex optimization problem whose solution may be analyzed with tools from convex analysis (see[19]). Based on this consideration, we shall give the explicit expression for the solution of

(4), with which and a inequality for convex functions show the robustness of the solutions (see Lemma 3). Thus we will use a new approach to estimate the learning rate‖fαz − fρ‖2,ρX.

For this purpose we define the integral regularized risk scheme corresponding to (4) as

( ) ( ) 2

2

: argmin m

{

E (f ) m

}

.

ρ ρ

λ α ρ α

α =α = _∈ℜ +λ ‖ ‖α

(5) Then, we have the following error decomposition.

( ) ( )

2, 2, 2,

z X z X X

fα − fρ ρ ≤ fα − fαρ ρ + fαρ − fρ ρ

‖ ‖ ‖ ‖ ‖ ‖

(6) where the first term of the righthand-side is called the sample error and the second term is called the approximation error.

Throughout this paper, we always assume|y|<M almost surely. So the regression function f_ρ is bounded

and square integrable with respect to ρX. For the kernel

function K , we only assume it is continuous and

bounded. We denote

( , )

: sup | ( , ) |

x y X X

k K x y

∈ ×

= and

2

|ρ| : 2 .

Zy dρ

=

_∫

Our first main result is an O( 1 )

m convergence rate

for the sample error.

Theorem 1.Let K x y( , )be a general kernel onX×X ,

z

α and α( )ρ be defined as in (4) and (5) respectively. Then, for any 0< <δ 1, with confidence 1−δ, there holds

()

2 2 2,

2 6 | | log

z X

k

f f

m

ρ

α α ρ

ρ δ λ

− ≤

‖ ‖ (7)

for

2

| |

M m

ρ

≥ and

2

.

k m

λ ≥

Next we will show that the approximation error can be bounded by the following K-functional:

2 2

2, 2

( , ) inf

(

)

.

m X

K f_ρ f_α f_ρ _ρ m

α

λ λ α

∈ℜ

= ‖ − ‖ + ‖ ‖

Combining this estimate with (6) and (7), we have the following learning rate estimate.

Theorem 2. Under the assumption of Theo -rem 1,then, for any0< <δ 1, with confidence 1−δ,there holds

2

1 2

2 2,

2 6 | | log

( , ) ,

z X

k

f f K f

m

α ρ ρ ρ

ρ

δ _λ

λ

− ≤ +

‖ ‖

(8)

II. ESTIMATE OF SAMPLE ERROR AND

APPROXIMATION ERROR

Since

()

( )

2, 2.

z X z

f f ρ k m

ρ

α − α ρ ≤ α −α

‖ ‖ ‖ ‖

(3)

We reduce the estimate of sample error to that

of ( )

2.

z

ρ α −α

‖ ‖ For this purpose we need some lemmas.

Lemma 1. Let

0

( ) |_{x x}

f x ₌

∇ be the gradient of f x( )

at x0. Then, the following result holds:

(i). There exists uniquely a minimizer of α( )ρ

of the problem (7) and

( )

2 ( ) 2

2 2

| |

( ( )) | | , .

Z y f ρ x d _m

ρ α

ρ

ρ ρ α

λ

− ≤ ≤

∫

‖ ‖

(10)

(ii). For any ( , )x y ∈Z there holds

2

( ) 2 ( ) ( ) .

(

y f x

)

KY x

(

y f x

)

α α α

∇ − = − F −

(11)

(iii). α( )ρ satisfies

( )

(

( )

)

,

Y Z

m ρ K x y f_αρ x d

λ α =

_∫

F − ρ

(12)

where K_Y( )x =( ( ,K x y₁),",K x y( , _m)) . For a vector-valued function f x y( , )={ ( , ),f x y₁ ",f_m( , )}x y F and a scalar-valued functionα( )x , we define

1

( , ) ( )

(

( , ) ( ), , _m( , ) ( )

)

f x yα x = f x yα x " f x yα x F

and

( , ) ( )

Z f x yα x dρ=

∫

1( , ) ( ) , , ( , ) ( ) .

(

∫

_Zf x yα x dρ"

∫

_Zfm x yα x dρ

)

F

Proof: It is easy to see thatE_ρ(f_α)and λ α 2 are strict

convex functions onℜm

respectively. Hence, (5) is a strict convex optimization problem on _Rm

. It therefore has a unique solution. Since α( )ρ

is the minimizer of (5), we have

( )

( ) 2 2

2 0 2

( ) ( ) | |

Z

E f ρ m E f y d

ρ

ρ α +λ ‖α ‖≤ ρ = ρ =

∫

ρ

which implies (10). By simple computations, we yield (11). Finally by the definition ofα( )ρ

and the Fermat theorem, one has

( )

2 ( )

0

( (

( )

)

)|

2

Z y f x d ρ m

ρ

α α ρ α α= λ α

= ∇

_∫

− +

( )

2 ( )

( ) 2

(

) |

Z y f x ρ d m

ρ

α α α α= ρ λ α

=

_∫

∇ − +

( )

2 _Y( )

(

( )

)

2 .

Z K x y f ρ x d m

ρ

α ρ λ α

= −

_∫

F − +

Hence, (12) holds. The proof of Lemma 1 is complete. We now recall a law of large numbers for random variables with values in a Hilbert space from [11]. There are other forms of the large number law (see e.g. [4]).

Lemma 2. Let H be a Hilbert space andξ be a

random variable on ( , )Z ρ with values in H .

Assume almost surely.

Denoted 2_{( )} ₍ 2 _).

H

E

σ ξ = ‖ ‖ξ Let { } 1

m i i

ξ = be inde -pendent

random drawers of ρ. For any 0< <δ 1, with confidence 1−δ, there holds

1

( )

(

)

m

i i H i

E m = ξ ξ

− ≤

∑

‖

(13)

The next lemma shows the robustness for the solutions of (5).

Lemma 3. Let ρ µ, be distributions onX Y× with

2

2 ,| | ,K x y( , )

ρ < +∞ µ < +∞ be a general kernel on

X×X with α( )ρ

and α( )µ

be the solutions of scheme (5) for ρand µrespectively. Then, there holds

( ) ( ) 2

ρ µ

α −α ≤

‖

( )

2

( )

(

( )

)

Y

ZK x y f x d

m αρ ρ

λ ×

‖

∫

−

F

( ) 2

( )

(

( )

)

Y

ZK x y fαρ x d µ

−

_∫

F −

‖

(14)

Proof. Let V x( )be a differentiable convex function on

.

ℜThen, it is well known that the following inequality

holds

( ) ( ) ( )( ), , .

V x −V y ≥V′ y x−y x y∈ ℜ

In this paper, we take 2

( ) ,

V x =x then, one has the

following inequality

2 2

2 ( ), , .

x −y ≥ y x−y x y∈ ℜ (15)

It follows

( ) ( )

2 2

( ) ( )

(

y−f_αµ x

) (

− y− f_αρ x

)

( ) ( ) ( )

2

(

y f_αρ ( ) (x

)

f_αρ ( )x f_αµ ( ))x

≥ − −

( )

( ) ( )

2

(

y f_αρ ( )x K

)

_Y( )x

(

αµ αρ

)

= − − F −

( )

( ) ( )

2

, 2 ( ) ( ) .

(

y f ρ x

)

KY x

)

µ ρ

α

α α

= − − − F

Integrating above inequality with respect to µon both sides, we have the following useful inequality

( ) ( )

2 2

( ) ( )

(

)

(

)

Z y− fαµ x dµ− Z y−fαρ x dµ

∫

( )

( ) ( )

2

, 2 ( ) ( ) .

(

_Z

(

y f ρ x

)

KY x d

)

µ ρ

α

α α µ

≥ − −

∫

− F

(16)

Note that

( ) 2 ( ) 2

2 2

µ ρ

α − α

‖ ‖‖ ‖

( ) ( ) ( ) ( ) ( ) 2

2 2

2

(

α µ αρ ,α ρ

)

α ρ αµ .

= − +

‖

−

‖

(17)

By (16) and (17), we have

( ) ( )

( ) 2 ( ) 2

2 2

( ) ( )

(

Eµ fαµ +λm

‖ ‖

α µ

) (

− Eµ fαρ +λm

‖ ‖

α ρ

)

( ) ( )

2 2

( ( )) ( ( ))

Z y fαµ x dµ Z y fαρ x dρ

=

_∫

− −

_∫

−

( ) 2 ( ) 2

2 2

( )

m µ ρ

λ α α

+

‖ ‖‖ ‖

−

( )

( ) ( )

2

, 2 ( ) ( )

(

_Z

(

y f ρ x

)

KY x d

)

µ ρ

α

α α µ

(4)

( ) ( ) ( ) ( ) ( ) 2

2 2

2λm

(

α µ αρ ,α ρ

)

λm αρ α µ

+ − +

‖

−

‖

( )

( ) ( ) ( )

2

(

,

(

( )

)

_Y( )

)

Z

m y f ρ x K x d

µ ρ ρ

α

α α λ α µ

= − −

_∫

− F

(18)

Substituting the expression of ( )

m ρ

λ α in (12)

into (18), one has

( ) ( )

( ) 2 ( ) 2

2 2

( ) ( )

(

E f µ m

) (

E f ρ m

)

µ ρ

µ α +λ

‖ ‖

α − µ α +λ

‖ ‖

α

()

( ) ( )

2

(

,

(

( )

)

_Y( )

Z y f ρ x K x d

µ ρ

α

α α ρ

≥ −

_∫

− F

( )

( ) ( ) 2

2 2

( ) ( )

(

)

Y

)

Z y f ρ x K x d m

ρ µ

α µ λ α α

−

_∫

− F +

‖

−

‖

()

( ) ( )

2

(

,

(

( )

)

_Y( )

Z y f ρ x K x d

µ ρ

α

α α ρ

= −

_∫

− F

( )

( ) ( ) 2

2 2

( ) ( ) .

(

)

Y

)

Z y f ρ x K x d m

µ ρ

α µ λ α α

−

_∫

− F +

‖

−

‖

Since

( ) ( ) 2 2.

m ρ µ

λ α α

+

‖

−

‖

()

( ) 2 2

( )

(

E f µ m

)

µ µ α +λ

‖ ‖

α −

( )

( ) 2 2

( ) 0,

(

Eµ f_αρ +λm

‖ ‖

α ρ

)

≤

we have by theCauchy's inequality that

( ) ( ) 2 2

m ρ µ

λ

‖

α −α

‖

( )

( ) ( )

2

(

,

(

( )

)

_Y( )

Z y f ρ x K x d

ρ µ

α

α α ρ

≤ −

_∫

− F

(

( )( )

)

_Y( )

)

₂

Z y fαρ x K x dµ

−

∫

− F

( )

( ) ( ) 2

2

(

( )

)

_Y( )

Z y f ρ x K x d

ρ µ

α

α α ρ

≤

‖

−

‖‖

×

_∫

− F

−

∫

_Z

(

y−fα(ρ)( )x

)

KY( )x dµ

‖

2.

F

( )

( ) ( ) 2

2

(

( )

)

_Y( )

Z y f ρ x K x d

ρ µ

α

α α ρ

≤

‖

−

‖‖

×

_∫

− F

Thus (14) holds.

Proof of Theorem 1.

Take ξ( )z =(y− f_α(ρ)( ))x K_Y( )x

F

, then,ξ( )z ∈ ℜm for any z=( , )x y .By (14) and, we know

2 2 2 2

2 2

( ) ( ( )) | | .

z

Z ξ z dρ≤k m Z y− fα x dρ≤k m ρ

∫

‖ ‖

∫

(19)

Also,

()

2

( )z k m y| f_αρ ( ) |x

ξ ≤ −

‖ ‖

( ) 2

2

| | .

(

)

(

)

k m M k m α ρ k m M k ρ

λ

≤ + ‖ ‖ ≤ +

Let ρ_z be the equi-distribution measure onZ. So we have

1

( , ) ( , ).

m

z i i Z

i

f x y d f x y m

ρ

=

∑

∫

(20)

Combing above estimate with Lemma 2, we know for any0< <δ 1, with confidence 1−δ, there holds

( )

( ( )) _Y( )

Z y−fαρ x K x dρ

∫

‖

F

( ) 2

1

( ( )) ( )

m

i i Y i i

y f x K x m = αρ

−

∑

− F

‖

2

| | 2

2 ( log )

2 2 | | log )

k m M k

k m

ρ

λ δ _ρ

δ

+

≤ +

2 2

2

| |

2 2

2 2 | | log

(

kM k k

)

m m

ρ _ρ

λ δ

≤ + +

2

2 6k |ρ| log

δ

≤ . (21)

Since

2

| |

M m

ρ

≥ and

2

,

k m

λ ≥ by (21) and Lemma 3 we

have

2 ( )

2

2 6 | | log

z

k

m

ρ ρ δ

α α

λ

− ≤

‖

, (22)

The conclusion of Theorem 1 follows from (9) and (22).

Proof of Theorem 2. It is well-known that for all

2( X)

f∈L ρ , there holds

2

( ) ( )

|

( ) ( )

|

_X.

X

E_ρ f −E_ρ f_ρ =

∫

f x − f_ρ x dρ

It follows

() ( )

1 2 2, X

(

( ) ( )

)

fαρ − fρ ρ = Eρ fαρ −Eρ fρ

‖

( )

1 ( ) 2 2

( ) ( )

(

Eρ fαρ Eρ fρ λm α ρ

)

≤ − + ‖ ‖

1 2 2

inf ( ) ( )

[

_m

(

E_ρ f_α E_ρ f_ρ m

)]

α∈ℜ λ α

= − + ‖ ‖

1 2

( , ) ,

K f_ρ λ

= (23)

where in the second equality we use the fact thatα( )ρ is the minimizer of (5). Combing this with (6) and (7) we complete the proof of Theorem 2.

Note that when the kernel belongs to some kind of Mercer kernel, under a very mild regularity condition on the regression function, we may derive the dimensional-free learning rate. We will study this problem in the future work.

ACKNOWLEDGMENT

This work was supported in part by a grant from Natural Science Foundation of China (Grant No. 10871226, 10971251) and Zhejiang Natural SCience Foundation of Zhejiang Province(Grant No. Y6100096).

(5)

REFERENCES

[1] V. Vapnik, The Nature of Statistical Learning Theory, Springer,New York, 1995.

[2] P. Niyogi and F. Girosi, Generalization bounds for function approximation from scattered noisy data, Adv. Comput. Math. 10(1999), 51-80

[3] T. Evgeniou, M. Pontil, T. Poggio, Regula -rization networks and support vector machines, Adv. Comput. Math. 13 (2000), 1-50

[4] F. Cucker, S. Smale, On the mathematical foundations of learning theory, Bull. Amer. Math. Soc. 39 (2001), 1-49 [5] F. Cucker, S. Smale, Best choices for regularization

parameters in learning theory: On the bias-variance problem, Found. Compt. Math. 2 (2002), 413-428

[6] E. De Vito, A. Caponnetto,and L. Rosasco, Model selection for regularized leasts -quares algorithm in learning theory, Found. Comput. Math. 5 (2005), 59-85 [7] Q. Wu, Y. M. Ying, D. X. Zhou, Learning rates of least

square regression, Found. Comput. Math., 6(2)(2006), 171-192

[8] A. Caponnetto, E. De Vito, Optimal rates for the regularized least-squares algori -thm, Found. Comput. Math., 7(3)(2007), 331-368

[9] S. Smale, D. X. Zhou, Shannon sampling and function reconstruction rom point values, Bull. (New Series) Amer. Math. Soc., 41(3)(2004), 279-305

[10] S. Smale and D. X. Zhou, Shannon sampling II. Connections to learning theory, Appl. Comput. Harmonic Anal., 19 (2005)285-302

[11] S. Smale, D. X. Zhou, Learning theory estimates via integral operators and their applications, Constr. Approx. 26(2007), 153-172

[12] H. W. Sun, Q. Wu, Application of integral operator for regularized least-square regression, Math. and Comput. Model. 49(2009), 276-285

[13] H. W. Sun, Q. Wu, Least square regres -sion with indefinite kernels and coefficient regularizationn, Appl.

Computat. Harmo. Anal., doi: 10.1016/ j.acha.2010.04.001

[14] C. De Mol, E. De Vito, L. Rosasco, Elastic-net regularization in learning theory, J. Complexity 25(2)(2009), 201- 230

[15] G. Gnecco, M. Sanguineti, The weight -decay technique in learning from data:an optimization point of view, Comput. Manag. Sci. 6(2009), 53-79

[16] J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online learning for matrix factorization and sparse coding, J. Mach. Learn. Res., 11(2010),19-60

[17] B. H. Sheng, J. L. Wang, P. Li, The covering number for some Mecer kernel Hilbert spaces, J. Complexity 24(2008), 241-258.

[18] V. Koltchinskii, Rejoinder: Local Rademacher complexities and oracle inequalities in risk minimization, The Annals of Statistics 34(6)(2006), 2697- 2706

[19] J.-B. Hiriart-Urruty, C. Lemare chas, Fundamental of Convex Analysis, Springer-Verlag, Berlin, 2001.

Baohuai Sheng received the Ph.D. degree in applied mathematics from Xidian University, Xi’an, China, in 2002.He is a Full Professor at Shaoxing College of Arts and Sciences. He has published more than 70 journal and conference papers. His current research interests include approximation theory, machine learning.