Least Square Regression Learning with Data
Dependent Hypothesis and Coefficient
Regularzation
Bao-Huai Sheng
Department of Mathematics, Shaoxing College of Arts and Sciences Shaoxing, Zhejiang 312000, China E-mail: shengbaohuai@ 163.com
Pei-Xin Ye
School of Mathematics and LPMC, Nankai University,Tianjin 300071, China E-mail: [email protected]
Abstract--We study the least square regression with data dependent hypothesis and coefficient regularization algorithms based on general kernel. An explicit expression of the solution of this kernel scheme is derived. Then we
provide a sample error with a decay ofO( 1 ) m and
estimate the approximation error in terms of some kind of
K -functional.
Index Terms -- Least Square Regressions, Data Dependent Hypothesis, Coefficient Regularization, General Kernel, Learning Rate.
I. INTRODUCTION AND MAIN RESULTS
We establish in this paper the mathematical foundations of least square regression learning with general kernel and coefficient regularization.
LetX ⊂ ℜq,Y⊂ ℜbe Borel sets and letρ be a Borel
probability measure on Z:= ×X Y . For
function f :X →Y define the error
2
( )
(
( ))
.Z
Eρ f =
∫
y− f x dρConsider ρ( | )y x -conditional (with respect to x)
probability measure on Y and ρX -the marginal
probability measure on X. Define fρ( )x to be the conditional expectation of y with respect to measure
, i.e.
( ) ( | ), ,
Y
fρ x =
∫
yd y x x∈Xthen functionfρ is known in statistics as the regression function ofρ.It is clear if
fρ∈
1
2 2
2( X) { : 2, X
( |
X ( )|
X)
},L ρ = f ‖ ‖f ρ =
∫
f x dρ < +∞then it minimizes the errorE f( ) over all f∈L2(ρX).
Thus, in the sense of error the regression
function fρ( )x is the best to describe the relation between inputsx∈X and outputsy∈Y.
In most cases, the distributionρ( , )x y is unknown and what one can know is a set of samples
1 1
{ }i mi {( ,i i)}mi m
z= z = = x y = ∈Z which are drawn
independently and identically distributed according to
( , ).x y
ρ Our goal is to find an estimator fz on the base
of given dataz that approximates fρ well with high
probability. This is a ill-posed problem and the regularization technique is needed. In many areas of machine learning, the following Tikhonov regularization scheme is commonly used to overcome the ill-posedness:
( ) 2 2
1
1
: min
{
(
( ))
}
.m H
z f H i i H
i
f arg f x y f
m λ
∈ =
=
∑
− + ‖ ‖ (1)We take H to be a reproducing kernel Hilbert space (RKHS) induced by a Mercer kernel. Recall a
Mercer kernel K is a function on X×X which is
continuous, symmetric and positive semi-definite, i.e., for any given positive integerm and any finite set of distinct
points X = { ,x x1 2,…,xm}⊂X , the matrix
,
X X
K = ( ( , )), 1
m i j i j
K x x = is positive semi-definite. The
RKHS HK associated with the kernel K is defined to
be the closure of the linear span of the set of functions with the inner product
, ( , ).
K
x y H
K K K x y
〈 〉 = The reproducing property takes
the form
, ( ), . .
K
x H K
f K f x x X f H
〈 〉 = ∀ ∈ ∈ (2)
Corresponding author: Ye Peixin. This work is supported by the Natural
This kind of kernel scheme has been studied due to a lot of literatures, c.f. [1-11].
In this paper we consider a different kernel scheme.
Let K X: ×X → ℜ be a continuous and bounded
function which is called general kernel. For a given data Y : { ,= y y1 2,",ym}⊂X the data dependent
hypothesis space is given by
, :
K Y
H = 1
1
( ) ( , ) : { , , }
m
m
j j m
j
fα x α K x y α α α
=
⎧ ⎫
= = ∈ℜ
⎨ ⎬
⎩
∑
" ⎭.Every hypothesis function is determined by its coefficients and the penalty is imposed on these coefficients. Then, there comes the general co -efficient regularized scheme
2 1
1
: min m
{
(
( ))
( ) ,}
m
z i i
i
arg f x y
m α
α
α ∈ℜ λ α
=
=
∑
− + Ω (3)whereΩ( )α is a positive function on ℜm. Formulation (3) is a data dependent scheme which has been found many applications in the design of support vector machines, micro-array analysis and variable selection (see e.g. [12] -[18]). the coefficient regularization was first introduced by Vapnik [1] to design linear programming support vector machines. It has some advantages. Firstly, the algorithm is directly a finite dimensional optimization problem and easy to be adapted to other algorithms. Secondly, one can freely choose the regularizer for different purposes. For instance the sparse representation can be obtained if the norm of the coefficients is used as the regularizer while it gives back the regularization scheme Besides these advantages, an important observation is, when the positive definite kernel is used, the coefficient regularization scheme usually provides quite comparable performance as the regularization scheme in a reproducing kernel Hilbert space. We now study a particular coefficient regularization. We endowℜm with usual inner product, i.e., for anya=( ,a a1 2,",am)F,b= ( ,b b1 2,",bm)F∈ ℜm, we take
2 1
( , ) .
m i i i
a b a b a b =
=
∑
= FIn particular 2
2 .
a =a aF Set 2
2
( )α m α
Ω = ‖ ‖= 2
1
| |
m i i
m α =
∑
, we have the following coefficient regularization with2
l -penalization
,
:
z zλ
α =α
2 2
2 1
1
min m{
(
( ))
}.m
i i i
arg y f x m
m α
α∈ℜ λ α
=
=
∑
− + ‖ ‖(4)
We notice that (4) is a strict convex optimization problem whose solution may be analyzed with tools from convex analysis (see[19]). Based on this consideration, we shall give the explicit expression for the solution of
(4), with which and a inequality for convex functions show the robustness of the solutions (see Lemma 3). Thus we will use a new approach to estimate the learning rate‖fαz − fρ‖2,ρX.
For this purpose we define the integral regularized risk scheme corresponding to (4) as
( ) ( ) 2
2
: argmin m
{
E (f ) m}
.ρ ρ
λ α ρ α
α =α = ∈ℜ +λ ‖ ‖α
(5) Then, we have the following error decomposition.
( ) ( )
2, 2, 2,
z X z X X
fα − fρ ρ ≤ fα − fαρ ρ + fαρ − fρ ρ
‖ ‖ ‖ ‖ ‖ ‖
(6) where the first term of the righthand-side is called the sample error and the second term is called the approximation error.
Throughout this paper, we always assume|y|<M almost surely. So the regression function fρ is bounded
and square integrable with respect to ρX. For the kernel
function K , we only assume it is continuous and
bounded. We denote
( , )
: sup | ( , ) |
x y X X
k K x y
∈ ×
= and
2
|ρ| : 2 .
Zy dρ
=
∫
Our first main result is an O( 1 )
m convergence rate
for the sample error.
Theorem 1.Let K x y( , )be a general kernel onX×X ,
z
α and α( )ρ be defined as in (4) and (5) respectively. Then, for any 0< <δ 1, with confidence 1−δ, there holds
()
2 2 2,
2 6 | | log
z X
k
f f
m
ρ
α α ρ
ρ δ λ
− ≤
‖ ‖ (7)
for
2
2
| |
M m
ρ
≥ and
2
.
k m
λ ≥
Next we will show that the approximation error can be bounded by the following K-functional:
2 2
2, 2
( , ) inf
(
)
.m X
K fρ fα fρ ρ m
α
λ λ α
∈ℜ
= ‖ − ‖ + ‖ ‖
Combining this estimate with (6) and (7), we have the following learning rate estimate.
Theorem 2. Under the assumption of Theo -rem 1,then, for any0< <δ 1, with confidence 1−δ,there holds
2
1 2
2 2,
2 6 | | log
( , ) ,
z X
k
f f K f
m
α ρ ρ ρ
ρ
δ λ
λ
− ≤ +
‖ ‖
(8)
II. ESTIMATE OF SAMPLE ERROR AND
APPROXIMATION ERROR
Since
()
( )
2, 2.
z X z
f f ρ k m
ρ
α − α ρ ≤ α −α
‖ ‖ ‖ ‖
We reduce the estimate of sample error to that
of ( )
2.
z
ρ α −α
‖ ‖ For this purpose we need some lemmas.
Lemma 1. Let
0
( ) |x x
f x =
∇ be the gradient of f x( )
at x0. Then, the following result holds:
(i). There exists uniquely a minimizer of α( )ρ
of the problem (7) and
( )
2 ( ) 2
2 2
| |
( ( )) | | , .
Z y f ρ x d m
ρ α
ρ
ρ ρ α
λ
− ≤ ≤
∫
‖ ‖(10)
(ii). For any ( , )x y ∈Z there holds
2
( ) 2 ( ) ( ) .
(
y f x)
KY x(
y f x)
α α α
∇ − = − F −
(11)
(iii). α( )ρ satisfies
( )
( )
( )
(
( ))
,Y Z
m ρ K x y fαρ x d
λ α =
∫
F − ρ(12)
where KY( )x =( ( ,K x y1),",K x y( , m)) . For a vector-valued function f x y( , )={ ( , ),f x y1 ",fm( , )}x y F and a scalar-valued functionα( )x , we define
1
( , ) ( )
(
( , ) ( ), , m( , ) ( ))
f x yα x = f x yα x " f x yα x F
and
( , ) ( )
Z f x yα x dρ=
∫
1( , ) ( ) , , ( , ) ( ) .
(
∫
Zf x yα x dρ"∫
Zfm x yα x dρ)
FProof: It is easy to see thatEρ(fα)and λ α 2 are strict
convex functions onℜm
respectively. Hence, (5) is a strict convex optimization problem on Rm
. It therefore has a unique solution. Since α( )ρ
is the minimizer of (5), we have
( )
( ) 2 2
2 0 2
( ) ( ) | |
Z
E f ρ m E f y d
ρ
ρ α +λ ‖α ‖≤ ρ = ρ =
∫
ρwhich implies (10). By simple computations, we yield (11). Finally by the definition ofα( )ρ
and the Fermat theorem, one has
( )
2 ( )
0
( (
( ))
)|
2Z y f x d ρ m
ρ
α α ρ α α= λ α
= ∇
∫
− +( )
2 ( )
( ) 2
(
) |
Z y f x ρ d m
ρ
α α α α= ρ λ α
=
∫
∇ − +( )
( )
2 Y( )
(
( ))
2 .Z K x y f ρ x d m
ρ
α ρ λ α
= −
∫
F − +
Hence, (12) holds. The proof of Lemma 1 is complete. We now recall a law of large numbers for random variables with values in a Hilbert space from [11]. There are other forms of the large number law (see e.g. [4]).
Lemma 2. Let H be a Hilbert space andξ be a
random variable on ( , )Z ρ with values in H .
Assume almost surely.
Denoted 2( ) ( 2 ).
H
E
σ ξ = ‖ ‖ξ Let { } 1
m i i
ξ = be inde -pendent
random drawers of ρ. For any 0< <δ 1, with confidence 1−δ, there holds
1
1
( )
(
)
m
i i H i
E m = ξ ξ
− ≤
∑
‖
‖
(13)
The next lemma shows the robustness for the solutions of (5).
Lemma 3. Let ρ µ, be distributions onX Y× with
2
2 ,| | ,K x y( , )
ρ < +∞ µ < +∞ be a general kernel on
X×X with α( )ρ
and α( )µ
be the solutions of scheme (5) for ρand µrespectively. Then, there holds
( ) ( ) 2
ρ µ
α −α ≤
‖
‖
( )2
( )
(
( ))
Y
ZK x y f x d
m αρ ρ
λ ×
‖
∫
−F
( ) 2
( )
(
( ))
Y
ZK x y fαρ x d µ
−
∫
F −‖
(14)
Proof. Let V x( )be a differentiable convex function on
.
ℜThen, it is well known that the following inequality
holds
( ) ( ) ( )( ), , .
V x −V y ≥V′ y x−y x y∈ ℜ
In this paper, we take 2
( ) ,
V x =x then, one has the
following inequality
2 2
2 ( ), , .
x −y ≥ y x−y x y∈ ℜ (15)
It follows
( ) ( )
2 2
( ) ( )
(
y−fαµ x) (
− y− fαρ x)
( ) ( ) ( )
2
(
y fαρ ( ) (x)
fαρ ( )x fαµ ( ))x≥ − −
( )
( ) ( )
2
(
y fαρ ( )x K)
Y( )x(
αµ αρ)
= − − F −
( )
( ) ( )
2
, 2 ( ) ( ) .
(
(
y f ρ x)
KY x)
µ ρ
α
α α
= − − − F
Integrating above inequality with respect to µon both sides, we have the following useful inequality
( ) ( )
2 2
( ) ( )
(
)
(
)
Z y− fαµ x dµ− Z y−fαρ x dµ
∫
∫
( )
( ) ( )
2
, 2 ( ) ( ) .
(
Z(
y f ρ x)
KY x d)
µ ρ
α
α α µ
≥ − −
∫
− F(16)
Note that
( ) 2 ( ) 2
2 2
µ ρ
α − α
‖ ‖‖ ‖
( ) ( ) ( ) ( ) ( ) 2
2 2
2
(
α µ αρ ,α ρ)
α ρ αµ .= − +
‖
−‖
(17)By (16) and (17), we have
( ) ( )
( ) 2 ( ) 2
2 2
( ) ( )
(
Eµ fαµ +λm‖ ‖
α µ) (
− Eµ fαρ +λm‖ ‖
α ρ)
( ) ( )
2 2
( ( )) ( ( ))
Z y fαµ x dµ Z y fαρ x dρ
=
∫
− −∫
−( ) 2 ( ) 2
2 2
( )
m µ ρ
λ α α
+
‖ ‖‖ ‖
−( )
( ) ( )
2
, 2 ( ) ( )
(
Z(
y f ρ x)
KY x d)
µ ρ
α
α α µ
( ) ( ) ( ) ( ) ( ) 2
2 2
2λm
(
α µ αρ ,α ρ)
λm αρ α µ+ − +
‖
−‖
( )
( ) ( ) ( )
2
2
(
,(
( ))
Y( ))
Z
m y f ρ x K x d
µ ρ ρ
α
α α λ α µ
= − −
∫
− F(18)
Substituting the expression of ( )
m ρ
λ α in (12)
into (18), one has
( ) ( )
( ) 2 ( ) 2
2 2
( ) ( )
(
E f µ m) (
E f ρ m)
µ ρ
µ α +λ
‖ ‖
α − µ α +λ‖ ‖
α()
( ) ( )
2
(
,(
( ))
Y( )Z y f ρ x K x d
µ ρ
α
α α ρ
≥ −
∫
− F( )
( ) ( ) 2
2 2
( ) ( )
(
)
Y)
Z y f ρ x K x d m
ρ µ
α µ λ α α
−
∫
− F +‖
−‖
()
( ) ( )
2
(
,(
( ))
Y( )Z y f ρ x K x d
µ ρ
α
α α ρ
= −
∫
− F( )
( ) ( ) 2
2 2
( ) ( ) .
(
)
Y)
Z y f ρ x K x d m
µ ρ
α µ λ α α
−
∫
− F +‖
−‖
Since
( ) ( ) 2 2.
m ρ µ
λ α α
+
‖
−‖
()( ) 2 2
( )
(
E f µ m)
µ µ α +λ
‖ ‖
α −( )
( ) 2 2
( ) 0,
(
Eµ fαρ +λm‖ ‖
α ρ)
≤we have by theCauchy's inequality that
( ) ( ) 2 2
m ρ µ
λ
‖
α −α‖
( )
( ) ( )
2
(
,(
( ))
Y( )Z y f ρ x K x d
ρ µ
α
α α ρ
≤ −
∫
− F
(
( )( ))
Y( ))
2Z y fαρ x K x dµ
−
∫
− F( )
( ) ( ) 2
2
(
( ))
Y( )Z y f ρ x K x d
ρ µ
α
α α ρ
≤
‖
−‖‖
×∫
− F
−
∫
Z(
y−fα(ρ)( )x)
KY( )x dµ‖
2.F
( )
( ) ( ) 2
2
(
( ))
Y( )Z y f ρ x K x d
ρ µ
α
α α ρ
≤
‖
−‖‖
×∫
− FThus (14) holds.
Proof of Theorem 1.
Take ξ( )z =(y− fα(ρ)( ))x KY( )x
F
, then,ξ( )z ∈ ℜm for any z=( , )x y .By (14) and, we know
2 2 2 2
2 2
( ) ( ( )) | | .
z
Z ξ z dρ≤k m Z y− fα x dρ≤k m ρ
∫
‖ ‖∫
(19)Also,
()
2
( )z k m y| fαρ ( ) |x
ξ ≤ −
‖ ‖
( ) 2
2
| | .
(
)
(
)
k m M k m α ρ k m M k ρ
λ
≤ + ‖ ‖ ≤ +
Let ρz be the equi-distribution measure onZ. So we have
1
1
( , ) ( , ).
m
z i i Z
i
f x y d f x y m
ρ
=
=
∑
∫
(20)Combing above estimate with Lemma 2, we know for any0< <δ 1, with confidence 1−δ, there holds
( )
( ( )) Y( )
Z y−fαρ x K x dρ
∫
‖
F( ) 2
1
1
( ( )) ( )
m
i i Y i i
y f x K x m = αρ
−
∑
− F‖
2
2
| | 2
2 ( log )
2 2 | | log )
k m M k
k m
ρ
λ δ ρ
δ
+
≤ +
2 2
2
| |
2 2
2 2 | | log
(
kM k k)
m m
ρ ρ
λ δ
≤ + +
2
2 6k |ρ| log
δ
≤ . (21)
Since
2
2
| |
M m
ρ
≥ and
2
,
k m
λ ≥ by (21) and Lemma 3 we
have
2 ( )
2
2 6 | | log
z
k
m
ρ ρ δ
α α
λ
− ≤
‖
‖
, (22)The conclusion of Theorem 1 follows from (9) and (22).
Proof of Theorem 2. It is well-known that for all
2( X)
f∈L ρ , there holds
2
( ) ( )
|
( ) ( )|
X.X
Eρ f −Eρ fρ =
∫
f x − fρ x dρIt follows
() ( )
1 2 2, X
(
( ) ( ))
fαρ − fρ ρ = Eρ fαρ −Eρ fρ
‖
‖
( )
1 ( ) 2 2
( ) ( )
(
Eρ fαρ Eρ fρ λm α ρ)
≤ − + ‖ ‖
1 2 2
inf ( ) ( )
[
m(
Eρ fα Eρ fρ m)]
α∈ℜ λ α
= − + ‖ ‖
1 2
( , ) ,
K fρ λ
= (23)
where in the second equality we use the fact thatα( )ρ is the minimizer of (5). Combing this with (6) and (7) we complete the proof of Theorem 2.
Note that when the kernel belongs to some kind of Mercer kernel, under a very mild regularity condition on the regression function, we may derive the dimensional-free learning rate. We will study this problem in the future work.
ACKNOWLEDGMENT
This work was supported in part by a grant from Natural Science Foundation of China (Grant No. 10871226, 10971251) and Zhejiang Natural SCience Foundation of Zhejiang Province(Grant No. Y6100096).
REFERENCES
[1] V. Vapnik, The Nature of Statistical Learning Theory, Springer,New York, 1995.
[2] P. Niyogi and F. Girosi, Generalization bounds for function approximation from scattered noisy data, Adv. Comput. Math. 10(1999), 51-80
[3] T. Evgeniou, M. Pontil, T. Poggio, Regula -rization networks and support vector machines, Adv. Comput. Math. 13 (2000), 1-50
[4] F. Cucker, S. Smale, On the mathematical foundations of learning theory, Bull. Amer. Math. Soc. 39 (2001), 1-49 [5] F. Cucker, S. Smale, Best choices for regularization
parameters in learning theory: On the bias-variance problem, Found. Compt. Math. 2 (2002), 413-428
[6] E. De Vito, A. Caponnetto,and L. Rosasco, Model selection for regularized leasts -quares algorithm in learning theory, Found. Comput. Math. 5 (2005), 59-85 [7] Q. Wu, Y. M. Ying, D. X. Zhou, Learning rates of least
square regression, Found. Comput. Math., 6(2)(2006), 171-192
[8] A. Caponnetto, E. De Vito, Optimal rates for the regularized least-squares algori -thm, Found. Comput. Math., 7(3)(2007), 331-368
[9] S. Smale, D. X. Zhou, Shannon sampling and function reconstruction rom point values, Bull. (New Series) Amer. Math. Soc., 41(3)(2004), 279-305
[10] S. Smale and D. X. Zhou, Shannon sampling II. Connections to learning theory, Appl. Comput. Harmonic Anal., 19 (2005)285-302
[11] S. Smale, D. X. Zhou, Learning theory estimates via integral operators and their applications, Constr. Approx. 26(2007), 153-172
[12] H. W. Sun, Q. Wu, Application of integral operator for regularized least-square regression, Math. and Comput. Model. 49(2009), 276-285
[13] H. W. Sun, Q. Wu, Least square regres -sion with indefinite kernels and coefficient regularizationn, Appl.
Computat. Harmo. Anal., doi: 10.1016/ j.acha.2010.04.001
[14] C. De Mol, E. De Vito, L. Rosasco, Elastic-net regularization in learning theory, J. Complexity 25(2)(2009), 201- 230
[15] G. Gnecco, M. Sanguineti, The weight -decay technique in learning from data:an optimization point of view, Comput. Manag. Sci. 6(2009), 53-79
[16] J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online learning for matrix factorization and sparse coding, J. Mach. Learn. Res., 11(2010),19-60
[17] B. H. Sheng, J. L. Wang, P. Li, The covering number for some Mecer kernel Hilbert spaces, J. Complexity 24(2008), 241-258.
[18] V. Koltchinskii, Rejoinder: Local Rademacher complexities and oracle inequalities in risk minimization, The Annals of Statistics 34(6)(2006), 2697- 2706
[19] J.-B. Hiriart-Urruty, C. Lemare chas, Fundamental of Convex Analysis, Springer-Verlag, Berlin, 2001.
Baohuai Sheng received the Ph.D. degree in applied mathematics from Xidian University, Xi’an, China, in 2002.He is a Full Professor at Shaoxing College of Arts and Sciences. He has published more than 70 journal and conference papers. His current research interests include approximation theory, machine learning.