• No results found

Least Square Regression Learning with Data Dependent Hypothesis and Coefficient Regularzation

N/A
N/A
Protected

Academic year: 2020

Share "Least Square Regression Learning with Data Dependent Hypothesis and Coefficient Regularzation"

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)

Least Square Regression Learning with Data

Dependent Hypothesis and Coefficient

Regularzation

Bao-Huai Sheng

Department of Mathematics, Shaoxing College of Arts and Sciences Shaoxing, Zhejiang 312000, China E-mail: shengbaohuai@ 163.com

Pei-Xin Ye

School of Mathematics and LPMC, Nankai University,Tianjin 300071, China E-mail: [email protected]

Abstract--We study the least square regression with data dependent hypothesis and coefficient regularization algorithms based on general kernel. An explicit expression of the solution of this kernel scheme is derived. Then we

provide a sample error with a decay ofO( 1 ) m and

estimate the approximation error in terms of some kind of

K -functional.

Index Terms -- Least Square Regressions, Data Dependent Hypothesis, Coefficient Regularization, General Kernel, Learning Rate.

I. INTRODUCTION AND MAIN RESULTS

We establish in this paper the mathematical foundations of least square regression learning with general kernel and coefficient regularization.

LetX ⊂ ℜq,Y⊂ ℜbe Borel sets and letρ be a Borel

probability measure on Z:= ×X Y . For

function f :XY define the error

2

( )

(

( )

)

.

Z

Eρ f =

yf x dρ

Consider ρ( | )y x -conditional (with respect to x)

probability measure on Y and ρX -the marginal

probability measure on X. Define fρ( )x to be the conditional expectation of y with respect to measure

, i.e.

( ) ( | ), ,

Y

fρ x =

yd y x xX

then functionfρ is known in statistics as the regression function ofρ.It is clear if

fρ

1

2 2

2( X) { : 2, X

( |

X ( )

|

X

)

},

L ρ = f ‖ ‖f ρ =

f x dρ < +∞

then it minimizes the errorE f( ) over all fL2X).

Thus, in the sense of error the regression

function fρ( )x is the best to describe the relation between inputsxX and outputsyY.

In most cases, the distributionρ( , )x y is unknown and what one can know is a set of samples

1 1

{ }i mi {( ,i i)}mi m

z= z = = x y = ∈Z which are drawn

independently and identically distributed according to

( , ).x y

ρ Our goal is to find an estimator fz on the base

of given dataz that approximates fρ well with high

probability. This is a ill-posed problem and the regularization technique is needed. In many areas of machine learning, the following Tikhonov regularization scheme is commonly used to overcome the ill-posedness:

( ) 2 2

1

1

: min

{

(

( )

)

}

.

m H

z f H i i H

i

f arg f x y f

m λ

∈ =

=

− + ‖ ‖ (1)

We take H to be a reproducing kernel Hilbert space (RKHS) induced by a Mercer kernel. Recall a

Mercer kernel K is a function on X×X which is

continuous, symmetric and positive semi-definite, i.e., for any given positive integerm and any finite set of distinct

points X = { ,x x1 2,…,xm}⊂X , the matrix

,

X X

K = ( ( , )), 1

m i j i j

K x x = is positive semi-definite. The

RKHS HK associated with the kernel K is defined to

be the closure of the linear span of the set of functions with the inner product

, ( , ).

K

x y H

K K K x y

〈 〉 = The reproducing property takes

the form

, ( ), . .

K

x H K

f K f x x X f H

〈 〉 = ∀ ∈ ∈ (2)

Corresponding author: Ye Peixin. This work is supported by the Natural

(2)

This kind of kernel scheme has been studied due to a lot of literatures, c.f. [1-11].

In this paper we consider a different kernel scheme.

Let K X: ×X → ℜ be a continuous and bounded

function which is called general kernel. For a given data Y : { ,= y y1 2,",ym}⊂X the data dependent

hypothesis space is given by

, :

K Y

H = 1

1

( ) ( , ) : { , , }

m

m

j j m

j

fα x α K x y α α α

=

⎧ ⎫

= = ∈ℜ

⎨ ⎬

" ⎭.

Every hypothesis function is determined by its coefficients and the penalty is imposed on these coefficients. Then, there comes the general co -efficient regularized scheme

2 1

1

: min m

{

(

( )

)

( ) ,

}

m

z i i

i

arg f x y

m α

α

α ∈ℜ λ α

=

=

− + Ω (3)

whereΩ( )α is a positive function on ℜm. Formulation (3) is a data dependent scheme which has been found many applications in the design of support vector machines, micro-array analysis and variable selection (see e.g. [12] -[18]). the coefficient regularization was first introduced by Vapnik [1] to design linear programming support vector machines. It has some advantages. Firstly, the algorithm is directly a finite dimensional optimization problem and easy to be adapted to other algorithms. Secondly, one can freely choose the regularizer for different purposes. For instance the sparse representation can be obtained if the norm of the coefficients is used as the regularizer while it gives back the regularization scheme Besides these advantages, an important observation is, when the positive definite kernel is used, the coefficient regularization scheme usually provides quite comparable performance as the regularization scheme in a reproducing kernel Hilbert space. We now study a particular coefficient regularization. We endowℜm with usual inner product, i.e., for anya=( ,a a1 2,",am)F,b= ( ,b b1 2,",bm)F∈ ℜm, we take

2 1

( , ) .

m i i i

a b a b a b =

=

= F

In particular 2

2 .

a =a aF Set 2

2

( )α m α

Ω = ‖ ‖= 2

1

| |

m i i

m α =

, we have the following coefficient regularization with

2

l -penalization

,

:

z zλ

α =α

2 2

2 1

1

min m{

(

( )

)

}.

m

i i i

arg y f x m

m α

α∈ℜ λ α

=

=

− + ‖ ‖

(4)

We notice that (4) is a strict convex optimization problem whose solution may be analyzed with tools from convex analysis (see[19]). Based on this consideration, we shall give the explicit expression for the solution of

(4), with which and a inequality for convex functions show the robustness of the solutions (see Lemma 3). Thus we will use a new approach to estimate the learning rate‖fαzfρ‖2,ρX.

For this purpose we define the integral regularized risk scheme corresponding to (4) as

( ) ( ) 2

2

: argmin m

{

E (f ) m

}

.

ρ ρ

λ α ρ α

α =α = ∈ℜ +λ ‖ ‖α

(5) Then, we have the following error decomposition.

( ) ( )

2, 2, 2,

z X z X X

fα − fρ ρ ≤ fα − fαρ ρ + fαρ − fρ ρ

‖ ‖ ‖ ‖ ‖ ‖

(6) where the first term of the righthand-side is called the sample error and the second term is called the approximation error.

Throughout this paper, we always assume|y|<M almost surely. So the regression function fρ is bounded

and square integrable with respect to ρX. For the kernel

function K , we only assume it is continuous and

bounded. We denote

( , )

: sup | ( , ) |

x y X X

k K x y

∈ ×

= and

2

|ρ| : 2 .

Zy dρ

=

Our first main result is an O( 1 )

m convergence rate

for the sample error.

Theorem 1.Let K x y( , )be a general kernel onX×X ,

z

α and α( )ρ be defined as in (4) and (5) respectively. Then, for any 0< <δ 1, with confidence 1−δ, there holds

()

2 2 2,

2 6 | | log

z X

k

f f

m

ρ

α α ρ

ρ δ λ

− ≤

‖ ‖ (7)

for

2

2

| |

M m

ρ

≥ and

2

.

k m

λ ≥

Next we will show that the approximation error can be bounded by the following K-functional:

2 2

2, 2

( , ) inf

(

)

.

m X

K fρ fα fρ ρ m

α

λ λ α

∈ℜ

= ‖ − ‖ + ‖ ‖

Combining this estimate with (6) and (7), we have the following learning rate estimate.

Theorem 2. Under the assumption of Theo -rem 1,then, for any0< <δ 1, with confidence 1−δ,there holds

2

1 2

2 2,

2 6 | | log

( , ) ,

z X

k

f f K f

m

α ρ ρ ρ

ρ

δ λ

λ

− ≤ +

‖ ‖

(8)

II. ESTIMATE OF SAMPLE ERROR AND

APPROXIMATION ERROR

Since

()

( )

2, 2.

z X z

f f ρ k m

ρ

α − α ρ ≤ α −α

‖ ‖ ‖ ‖

(3)

We reduce the estimate of sample error to that

of ( )

2.

z

ρ α −α

‖ ‖ For this purpose we need some lemmas.

Lemma 1. Let

0

( ) |x x

f x =

∇ be the gradient of f x( )

at x0. Then, the following result holds:

(i). There exists uniquely a minimizer of α( )ρ

of the problem (7) and

( )

2 ( ) 2

2 2

| |

( ( )) | | , .

Z y f ρ x d m

ρ α

ρ

ρ ρ α

λ

− ≤ ≤

‖ ‖

(10)

(ii). For any ( , )x yZ there holds

2

( ) 2 ( ) ( ) .

(

y f x

)

KY x

(

y f x

)

α α α

∇ − = − F −

(11)

(iii). α( )ρ satisfies

( )

( )

( )

(

( )

)

,

Y Z

m ρ K x y fαρ x d

λ α =

F − ρ

(12)

where KY( )x =( ( ,K x y1),",K x y( , m)) . For a vector-valued function f x y( , )={ ( , ),f x y1 ",fm( , )}x y F and a scalar-valued functionα( )x , we define

1

( , ) ( )

(

( , ) ( ), , m( , ) ( )

)

f x yα x = f x yα x " f x yα x F

and

( , ) ( )

Z f x yα x dρ=

1( , ) ( ) , , ( , ) ( ) .

(

Zf x yα x dρ"

Zfm x yα x dρ

)

F

Proof: It is easy to see thatEρ(fα)and λ α 2 are strict

convex functions onℜm

respectively. Hence, (5) is a strict convex optimization problem on Rm

. It therefore has a unique solution. Since α( )ρ

is the minimizer of (5), we have

( )

( ) 2 2

2 0 2

( ) ( ) | |

Z

E f ρ m E f y d

ρ

ρ α +λ ‖α ‖≤ ρ = ρ =

ρ

which implies (10). By simple computations, we yield (11). Finally by the definition ofα( )ρ

and the Fermat theorem, one has

( )

2 ( )

0

( (

( )

)

)|

2

Z y f x d ρ m

ρ

α α ρ α α= λ α

= ∇

− +

( )

2 ( )

( ) 2

(

) |

Z y f x ρ d m

ρ

α α α α= ρ λ α

=

∇ − +

( )

( )

2 Y( )

(

( )

)

2 .

Z K x y f ρ x d m

ρ

α ρ λ α

= −

F − +

Hence, (12) holds. The proof of Lemma 1 is complete. We now recall a law of large numbers for random variables with values in a Hilbert space from [11]. There are other forms of the large number law (see e.g. [4]).

Lemma 2. Let H be a Hilbert space andξ be a

random variable on ( , )Z ρ with values in H .

Assume almost surely.

Denoted 2( ) ( 2 ).

H

E

σ ξ = ‖ ‖ξ Let { } 1

m i i

ξ = be inde -pendent

random drawers of ρ. For any 0< <δ 1, with confidence 1−δ, there holds

1

1

( )

(

)

m

i i H i

E m = ξ ξ

− ≤

(13)

The next lemma shows the robustness for the solutions of (5).

Lemma 3. Let ρ µ, be distributions onX Y× with

2

2 ,| | ,K x y( , )

ρ < +∞ µ < +∞ be a general kernel on

X×X with α( )ρ

and α( )µ

be the solutions of scheme (5) for ρand µrespectively. Then, there holds

( ) ( ) 2

ρ µ

α −α ≤

( )

2

( )

(

( )

)

Y

ZK x y f x d

m αρ ρ

λ ×

F

( ) 2

( )

(

( )

)

Y

ZK x y fαρ x d µ

F −

(14)

Proof. Let V x( )be a differentiable convex function on

.

ℜThen, it is well known that the following inequality

holds

( ) ( ) ( )( ), , .

V xV yVy xy x y∈ ℜ

In this paper, we take 2

( ) ,

V x =x then, one has the

following inequality

2 2

2 ( ), , .

xyy xy x y∈ ℜ (15)

It follows

( ) ( )

2 2

( ) ( )

(

yfαµ x

) (

yfαρ x

)

( ) ( ) ( )

2

(

y fαρ ( ) (x

)

fαρ ( )x fαµ ( ))x

≥ − −

( )

( ) ( )

2

(

y fαρ ( )x K

)

Y( )x

(

αµ αρ

)

= − − F −

( )

( ) ( )

2

, 2 ( ) ( ) .

(

(

y f ρ x

)

KY x

)

µ ρ

α

α α

= − − − F

Integrating above inequality with respect to µon both sides, we have the following useful inequality

( ) ( )

2 2

( ) ( )

(

)

(

)

Z yfαµ x dµ− Z yfαρ x dµ

( )

( ) ( )

2

, 2 ( ) ( ) .

(

Z

(

y f ρ x

)

KY x d

)

µ ρ

α

α α µ

≥ − −

− F

(16)

Note that

( ) 2 ( ) 2

2 2

µ ρ

α − α

‖ ‖‖ ‖

( ) ( ) ( ) ( ) ( ) 2

2 2

2

(

α µ αρ ,α ρ

)

α ρ αµ .

= − +

(17)

By (16) and (17), we have

( ) ( )

( ) 2 ( ) 2

2 2

( ) ( )

(

Eµ fαµ +λm

‖ ‖

α µ

) (

Eµ fαρ +λm

‖ ‖

α ρ

)

( ) ( )

2 2

( ( )) ( ( ))

Z y fαµ x dµ Z y fαρ x dρ

=

− −

( ) 2 ( ) 2

2 2

( )

m µ ρ

λ α α

+

‖ ‖‖ ‖

( )

( ) ( )

2

, 2 ( ) ( )

(

Z

(

y f ρ x

)

KY x d

)

µ ρ

α

α α µ

(4)

( ) ( ) ( ) ( ) ( ) 2

2 2

m

(

α µ αρ ,α ρ

)

λm αρ α µ

+ − +

( )

( ) ( ) ( )

2

2

(

,

(

( )

)

Y( )

)

Z

m y f ρ x K x d

µ ρ ρ

α

α α λ α µ

= − −

− F

(18)

Substituting the expression of ( )

m ρ

λ α in (12)

into (18), one has

( ) ( )

( ) 2 ( ) 2

2 2

( ) ( )

(

E f µ m

) (

E f ρ m

)

µ ρ

µ α +λ

‖ ‖

α − µ α +λ

‖ ‖

α

()

( ) ( )

2

(

,

(

( )

)

Y( )

Z y f ρ x K x d

µ ρ

α

α α ρ

≥ −

− F

( )

( ) ( ) 2

2 2

( ) ( )

(

)

Y

)

Z y f ρ x K x d m

ρ µ

α µ λ α α

− F +

()

( ) ( )

2

(

,

(

( )

)

Y( )

Z y f ρ x K x d

µ ρ

α

α α ρ

= −

− F

( )

( ) ( ) 2

2 2

( ) ( ) .

(

)

Y

)

Z y f ρ x K x d m

µ ρ

α µ λ α α

− F +

Since

( ) ( ) 2 2.

m ρ µ

λ α α

+

()

( ) 2 2

( )

(

E f µ m

)

µ µ α +λ

‖ ‖

α −

( )

( ) 2 2

( ) 0,

(

Eµ fαρ +λm

‖ ‖

α ρ

)

we have by theCauchy's inequality that

( ) ( ) 2 2

m ρ µ

λ

α −α

( )

( ) ( )

2

(

,

(

( )

)

Y( )

Z y f ρ x K x d

ρ µ

α

α α ρ

≤ −

− F

(

( )( )

)

Y( )

)

2

Z y fαρ x K x dµ

− F

( )

( ) ( ) 2

2

(

( )

)

Y( )

Z y f ρ x K x d

ρ µ

α

α α ρ

‖‖

×

− F

Z

(

yfα(ρ)( )x

)

KY( )x dµ

2.

F

( )

( ) ( ) 2

2

(

( )

)

Y( )

Z y f ρ x K x d

ρ µ

α

α α ρ

‖‖

×

− F

Thus (14) holds.

Proof of Theorem 1.

Take ξ( )z =(yfα(ρ)( ))x KY( )x

F

, then,ξ( )z ∈ ℜm for any z=( , )x y .By (14) and, we know

2 2 2 2

2 2

( ) ( ( )) | | .

z

Z ξ z dρ≤k m Z yfα x dρ≤k m ρ

‖ ‖

(19)

Also,

()

2

( )z k m y| fαρ ( ) |x

ξ ≤ −

‖ ‖

( ) 2

2

| | .

(

)

(

)

k m M k m α ρ k m M k ρ

λ

≤ + ‖ ‖ ≤ +

Let ρz be the equi-distribution measure onZ. So we have

1

1

( , ) ( , ).

m

z i i Z

i

f x y d f x y m

ρ

=

=

(20)

Combing above estimate with Lemma 2, we know for any0< <δ 1, with confidence 1−δ, there holds

( )

( ( )) Y( )

Z yfαρ x K x dρ

F

( ) 2

1

1

( ( )) ( )

m

i i Y i i

y f x K x m = αρ

− F

2

2

| | 2

2 ( log )

2 2 | | log )

k m M k

k m

ρ

λ δ ρ

δ

+

≤ +

2 2

2

| |

2 2

2 2 | | log

(

kM k k

)

m m

ρ ρ

λ δ

≤ + +

2

2 6k |ρ| log

δ

≤ . (21)

Since

2

2

| |

M m

ρ

≥ and

2

,

k m

λ ≥ by (21) and Lemma 3 we

have

2 ( )

2

2 6 | | log

z

k

m

ρ ρ δ

α α

λ

− ≤

, (22)

The conclusion of Theorem 1 follows from (9) and (22).

Proof of Theorem 2. It is well-known that for all

2( X)

fL ρ , there holds

2

( ) ( )

|

( ) ( )

|

X.

X

Eρ fEρ fρ =

f xfρ x dρ

It follows

() ( )

1 2 2, X

(

( ) ( )

)

fαρ − fρ ρ = Eρ fαρ −Eρ fρ

( )

1 ( ) 2 2

( ) ( )

(

Eρ fαρ Eρ fρ λm α ρ

)

≤ − + ‖ ‖

1 2 2

inf ( ) ( )

[

m

(

Eρ fα Eρ fρ m

)]

α∈ℜ λ α

= − + ‖ ‖

1 2

( , ) ,

K fρ λ

= (23)

where in the second equality we use the fact thatα( )ρ is the minimizer of (5). Combing this with (6) and (7) we complete the proof of Theorem 2.

Note that when the kernel belongs to some kind of Mercer kernel, under a very mild regularity condition on the regression function, we may derive the dimensional-free learning rate. We will study this problem in the future work.

ACKNOWLEDGMENT

This work was supported in part by a grant from Natural Science Foundation of China (Grant No. 10871226, 10971251) and Zhejiang Natural SCience Foundation of Zhejiang Province(Grant No. Y6100096).

(5)

REFERENCES

[1] V. Vapnik, The Nature of Statistical Learning Theory, Springer,New York, 1995.

[2] P. Niyogi and F. Girosi, Generalization bounds for function approximation from scattered noisy data, Adv. Comput. Math. 10(1999), 51-80

[3] T. Evgeniou, M. Pontil, T. Poggio, Regula -rization networks and support vector machines, Adv. Comput. Math. 13 (2000), 1-50

[4] F. Cucker, S. Smale, On the mathematical foundations of learning theory, Bull. Amer. Math. Soc. 39 (2001), 1-49 [5] F. Cucker, S. Smale, Best choices for regularization

parameters in learning theory: On the bias-variance problem, Found. Compt. Math. 2 (2002), 413-428

[6] E. De Vito, A. Caponnetto,and L. Rosasco, Model selection for regularized leasts -quares algorithm in learning theory, Found. Comput. Math. 5 (2005), 59-85 [7] Q. Wu, Y. M. Ying, D. X. Zhou, Learning rates of least

square regression, Found. Comput. Math., 6(2)(2006), 171-192

[8] A. Caponnetto, E. De Vito, Optimal rates for the regularized least-squares algori -thm, Found. Comput. Math., 7(3)(2007), 331-368

[9] S. Smale, D. X. Zhou, Shannon sampling and function reconstruction rom point values, Bull. (New Series) Amer. Math. Soc., 41(3)(2004), 279-305

[10] S. Smale and D. X. Zhou, Shannon sampling II. Connections to learning theory, Appl. Comput. Harmonic Anal., 19 (2005)285-302

[11] S. Smale, D. X. Zhou, Learning theory estimates via integral operators and their applications, Constr. Approx. 26(2007), 153-172

[12] H. W. Sun, Q. Wu, Application of integral operator for regularized least-square regression, Math. and Comput. Model. 49(2009), 276-285

[13] H. W. Sun, Q. Wu, Least square regres -sion with indefinite kernels and coefficient regularizationn, Appl.

Computat. Harmo. Anal., doi: 10.1016/ j.acha.2010.04.001

[14] C. De Mol, E. De Vito, L. Rosasco, Elastic-net regularization in learning theory, J. Complexity 25(2)(2009), 201- 230

[15] G. Gnecco, M. Sanguineti, The weight -decay technique in learning from data:an optimization point of view, Comput. Manag. Sci. 6(2009), 53-79

[16] J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online learning for matrix factorization and sparse coding, J. Mach. Learn. Res., 11(2010),19-60

[17] B. H. Sheng, J. L. Wang, P. Li, The covering number for some Mecer kernel Hilbert spaces, J. Complexity 24(2008), 241-258.

[18] V. Koltchinskii, Rejoinder: Local Rademacher complexities and oracle inequalities in risk minimization, The Annals of Statistics 34(6)(2006), 2697- 2706

[19] J.-B. Hiriart-Urruty, C. Lemare chas, Fundamental of Convex Analysis, Springer-Verlag, Berlin, 2001.

Baohuai Sheng received the Ph.D. degree in applied mathematics from Xidian University, Xi’an, China, in 2002.He is a Full Professor at Shaoxing College of Arts and Sciences. He has published more than 70 journal and conference papers. His current research interests include approximation theory, machine learning.

References

Related documents

[r]

search engine. As of January 2016, Baidu Encyclopedia has more than 13 million articles. As for editors of personal information, Baidu Encyclopedia classifies duplicate

Amount of time spent studying is positively related to amount of class meeting time, however, the ratio is 0.75 hours of study time for every one hour of class time. This

Recommendation 1 We recommend remission induction therapy with a combination of high dose glucocorticoids and cyclophosphamide in patients with severe newly diagnosed GPA, MPA or

Extrusion detection is to deal with this security issue, which focuses “primarily on the analysis of system activity and outbound traffic in order to detect malicious

2 2 In reaching its decision, the appellate court proved willing to make the distinction which the district court would not; namely, that although certain actions

On the following page is guidance provided by the Colorado Department of Health Care Policy and Financing, for Colorado's Medical Assistance Program, Billing for Family