Learning Rates of Support Vector Machine
Classifiers with Data Dependent Hypothesis
Spaces
Bao-Huai Sheng
Department of Mathematics, Shaoxing College of Arts and Sciences Shaoxing, Zhejiang 312000, China e-mail: shengbaohuai@ 163.com
Pei-Xin Ye
School of Mathematics and LPMC, Nankai University,Tianjin 300071, China e-mail: [email protected]
Abstract—We study the error performances of p -norm Support Vector Machine classifiers based on reproducing kernel Hilbert spaces. We focus on two category problem and choose the data-dependent polynomial kernels as the Mercer kernel to improve the approximation error. We also provide the standard estimation of the sample error, and derive the explicit learning rate.
Index Terms—Support vector machine classification; Learning rate; Reproducing kernel Hilbert spaces; Cesaro means.
I. INTRODUCTION AND RESULTS
Support vector machine classification [1]-[7], [9]-[25] has a foundation in the framework of statistical learning theory and classical regularization theory for function approximation. It is one of the most important topics in the field of machine learning. It has been applied successfully to various practical problems in science, engineering and many other related fields. The goal of classification is to construct a classifier which can predict the unknown class of an observation with small misclassification error. This problem has been studied widely and many important algorithms have been developed (Ref. [3]-[8]).
Let X = −[ 1,1] , Y = −{ 1,1} . A binary classifier
:
f X →Ydivides the input space X into two classes. Let ρ be an unknown probability distribution on
X×Y and ( , )X Y be the corresponding random variable. The misclass -ification error for a classifier f :X →Y
is defined to be the probability of the event { ( )f X ≠Y}
( ) : Prob{ ( ) } ( ( ) | ) X( ),
X
f = f ≠ =
∫
P ≠ f x x dρ xR X Y Y
where ρX is the marginal distribution on X and
( | )y x
ρ is the conditional probability measure at x
Corresponding author: Ye Peixin. This work is supported by the Natural Science Foundation of China (Grant No. 10871226, 10971251.)
induced by ρ . The distribution ρ is known only through a set of samples : { } 1 {( , )} 1
m m m
i i i i i
z = z = = x y = ∈ Ζ independently drawn according to ρ .It is known from [9] the classifier which minimizes the misclassi -fication error is the Bayes ruler fc: sgn(= fρ) , where
( ) ( | ) ( 1 | ) ( 1 | ),
Y
fρ x =
∫
ydρ y x =P y= x −P y= − x x∈X.Denote the p-norm hinge loss function as
( ( )) : (1 ( )) | ( ) | { ( ) 1},
p p
yf x
V yf x = −yf x + = −y f x χ ≤ (1)
where χ
( )
⋅ is the indicator function. V yf x( ( )) measures the cost paid by replacing the true y with the estimate( )
f x . The corresponding V -risk is
( ) : ( ( )) ( ( )).
Z
f =
∫
V yf x dρ=EV yf xE
Let fV :X R
ρ → be a measurable function minimizing
the expected risk fρV : arg min ( ),= E f where the minimum is taken over all measurable functions. According to [8], we may always choose a fV
ρ such that
( ) [ 1,1]
V
fρ x ∈ − for each x∈X . Since the expected risk involving the unknown distribution ρ is not computable, its discretization is used instead which is computable in terms of the sample z, is defined as
1 1
1 1
( ) : ( ( )) (1 ( )) .
m m
p
z i i i i
i i
f V y f x y f x
m = m = +
=
∑
=∑
−E (2)
Regularized learning schemes are implemented by minimizing a penalized version of the empirical error over a set of functions, called a hypothesis space. Then the regularized classifier generated for a sample m
z∈ Ζ is defined as sgn(fz,λ), where fz,λ is a minimizer of the
following well-known Tikhonov regularization scheme
,
1 1
: arg min
{
( ( )) ( ) .}
m
z f i i
i
f V y f x f
m
λ ∈ λ
=
= H
∑
+ Ω (3)kind of scheme is very popular in many areas of the applications and theory of machine learning.
In this paper, we take the hypothesis space H to be the reproducing kernel spaces reproducing by polynomials on
[ 1,1] [ 1,1]− × − and a given data in [ 1,1]− . We want to estimate excess misclassification error
,
(sgn(fzλ))− (fc).
R R Many investigations of this topics have been done when the kernel K x y( , ) is the Gaussian
kernel
2
2
( , ) exp{ }
2 x t
Kσ x t σ
−
= − (Ref. e.g.[5],[8]). Now
let us turn to other kernels, when the kernel is a polynomial kernel, the excess misclassification error estimate may become simpler since the polynomial class has many advantages in estimating the covering number, which is needed in presenting the error estimate. In fact, the excess misclassification error estimate for the polynomial kernel is a field investigated by many mathematicians. Among others, [5] gave a quantitative estimate for the convergence rate in the univariate case
[0,1]
X = with the Bernstein- Durrmeyer polynomials operators, [4] gave the corresponding estimate in the
case n
X ⊂R being a simplex. It is known that the best error of the Bernstein operator is O( 1 )
m . Thus we introduce the generalized Vallee Poussin means of orthonormal algebraic polynomials which can approximate the target function with O( 1 )
mα for some 0
α > , and the orthogonal algebraic polynomials may exist for any positive measure ρX (Ref.e.g.[12],[13]).
These properties is much better than the the Bernstein operator and it enable us to estimate the excess misclassi -fication error by generalized Vallee Poussin means. So we may yield better approximation error and therefore the learning error can be improved.
Note that now our hypothesis space depends on the input data. This makes the analysis of the error quite difficult and different from the previous results which for the algorithms with a data independent hypothesis space (see [1]-[9]). There are already some literatures in this area but the research is not very rich yet. In [14] the uniform convergence inequality is studied for the data dependent functions. In the coefficient regularization was analyzed under the restriction that the kernel is positive semi-definite or has certain smoothness condition (such as Lipschitz condition).
Throughout the paper, we shall write A=O B( ) if there exists a constant C>0 such that A≤CB . We write A~B if A=O B( ) and B=O A( ). Let
[ 1,1]
X = − and assume the marginal distribution
( ) ( )
X
dρ x =w x dx
with w x( )= −(1 x) (1α1 +x) ,α2
1 1, 2 1,
α > − α > −
1 2 1,
α α+ > − being Jacobi weights on [ 1,1]− with finite
moments; i.e., 1
1| | ( ) ,
k
x w x dx
− < ∞
∫
k∈ Ν =0: 0,1, 2,". Itis known from [14] that for all k∈`0 there exists a unique polynomial
( , ) ( ) k , ( ) 0
k k k
p w x =γ w x +" γ w > such that
1
, 1p x pk( ) k′( ) ( )x w x dx δk k′.
− =
∫
Then we denoteak( )f =
∫
−11f x p x w x dx k( ) k( ) ( ) , ∈ Ν0. (4)Moreover, there uniquely exists Lagrange polynomial interpolating operator L xn( ) of order n−1, such that
L xn( k n, )=yk, k=1, 2,", ,n (5)
for any real numbers yk, k=1, 2,", .n.
Let p≥1, L dwp( ) be the class of all measurable real functions f for which
1 1
( )
(
1| ( ) | ( ))
pp p L dw
f f x w x dx
−
=
∫
‖ ‖
.
< +∞ For 1 2
1 1
max( , )
2 2
δ > α + α + the Cesaro means
( )
Cδ of f is defined by
0
1
( , ) ( ) ( ), [ 1,1],
N
N N k k k k
m
f x A a f p x x
A
δ δ
δ
σ −
=
=
∑
∈ −( 1)
, 0,1, ( 1) ( 1)
k
k
A N
k
δ δ
δ Γ + +
= =
Γ + Γ + "
For a given β >0 we define,
1
( ) ~ k( )
k
D β f k aβ f ∞
=
∑
(6) and we say ( )D β f∈L dwp( ) if there exists ϕ ∈L dwp( )such that ak( ) k ak( )f
β
ϕ = . It is known form [17] that there is some δ >1 / 2 such that
( ) ( ) 2 ( ), 1 ,
p p
N f L dw C f L dw p
δ
σ ≤ ≤ ≤ ∞
‖ ‖ ‖ ‖ (7)
with C2 independent of N.
For f∈L dwp( ) we define the generalized vallee Poussin means of the ortho-normal {p xk( )} by
2
0
( )( ) ( ) ( ) ( ), [ 1,1],
N
N k k
k
k
f x a f p x x
N
η η
=
=
∑
∀ ∈ − (8)where η( )u ∈C∞(R+) is a nonnegative non- increase function with ( )η u =1 for u∈[0,1] and η( )u =0 for
2
u> . Then ηN( , )f x = ( ),f x f∈L dwp( ). Combing
(9) with [18, Proposition 2.1], we have
‖ηN( )f‖Lp(dw)≤C2‖ ‖f Lp(dw). (9)
For a given N>1 we define the polynomial kernels by
2
0
( , ) ( ) ( ), , [ 1,1].
N
N k k
k
K x y p x p y x y
=
=
∑
∈ − (10)For a given discrete set tN ⊂ −[ 1,1], the corresponding RKHS N
N
t K
,
, ( , ), , .
N i j
K i j N i j i x j y
i j i j
f g c d K x y f c K g d K
〈 〉 =
∑
=∑
=∑
(11) We assume the distribution ρ( , )x y = ρ( | ) ( )y x w x with
( ) w x
being Jacobi weight. Corresponds to the scheme (3) we have the following scheme,
2
, : arg min tN
{
( ) N,N}
,K N
z f z K t
f λ = ∈H E f +λ‖ ‖f (12)
We define the projection operator π on the space of measurable function f :X →R as
1, if ( ) 1,
( )( ) 1, if ( ) 1,
( ), if 1 ( ) 1.
f x
f x f x
f x f x
π
> ⎧
⎪
= −⎨ < −
⎪ − ≤ ≤
⎩
Since V y( π( )( ))f x ≤V yf x( ( )) , we know that
( ( ))π f ≤ ( ),f z( ( ))π f ≤ z( ).f
E E E E By virtue of π( )f ,
we take π(fz,λ) instead of fz,λ to analyze the related
learning rates.
Theorem 1.1. Let 1 2 p (2 p)
ζ
β α
=
+ − , exp{ 2m }
ζ
λ = − ,
min{ / 2, 2 / }
p p p
α = for p>1 and α∈[0,1] for p=1,
0
β > , N~mζ be a given number. Then, for all
2 1 1/
(4 / 4 log( p ))
m≥ ζ + pM − ζ and 0< <δ 1 , with confidence at least 1 2− δ , we have
,
( ( )) ( ) , .
2 (2 )
V z
p
p
f f c m
p
θ
λ ρ β
π θ
β α
−
− ≤ ⋅ =
+ −
E E
Theorem 1.2. Under the assumption of Theorem 2.1, we have for all 0< <δ 1 / 2, with confidence at least 1 2− δ ,
, * / 2
, for p=1,
(sgn( )) ( )
, for p 1,
z c
cm
f f
c m
θ
λ θ
−
−
⎧⎪
− ≤ ⎨
> ⎪⎩
R R
where
2 (2 p)
p
p β θ
β α
=
+ − ,
* 2
c = c, and c is the same
as in Theorem 2.1.
II. THEAPPROXIMATIONERROR
We estimate the approximation error for ηN(fρV), i.e.
( ( V)) ( V)
N fρ fρ
η − +
E E 2
,
( )
N N
V N fρ K t
λ η‖ ‖ .
Let Pn be the set of all algebraic polynomials of
order not exceeding n . For pn∈Pn , denoted by
, 1
{ }n
n k n k
t = x = the zeroes of p xn( ) arranged as
, 1, 2, 1,
1 xn n xn− n x n xn 1.
− < < <"< < <
Then, there holds the Gauss quadrature formula (Ref. e.g.[15])
11 , , 2 1
1
( ) ( ) ( ), .
n
k k n k k n k n k
p x w x dx λ p x p −
− =
∑
= ∈∫
P (13)For any pn∈Pn, there holds the Nikolskii inequality
(see, e.g.[13],[16])
2
1 1
( )
2
( ) 1 p( ),
p
n L dw n L dw
p ≤C n − + p
‖ ‖ ‖ ‖ (14)
where a+=max( , 0)a , 1≤ ≤ ∞p .
Proposition 2.1. Let ηN( )f , KN( , )x y be defined as (10) and (12) respectively. Then, for any f ∈L dwp( ) p≥1,
0 β > ,
1 1
2( )
2 3 2 2
, 1
( ( )) ( ) ( ) ,
N N
V V V p
N N K t p
M
f f f C N
N
ρ ρ ρ β
η λ η λ − +
− + ‖ ‖ ≤ +
E E
(15)
To prove Proposition 2.1, we firstly bound
,
( )
N N
V N fρ K t
η
‖ ‖ .
Lemma 2.2. For any f∈L dwp( ) there is C1>0 such
that
1 1
2( )
2 2 2 2
, 1 ( )
( ) .
N N p
p
N f K t C N f L dw
η − +
≤
‖ ‖ ‖ ‖
(16)
Lemma 2.3 ([19], Proposition 2.1). Let D( )β f be given by (8). There is a constant M0>0 such that
( ) 0 ( )
( )
( ) p ,
p
L dw N L dw
M D f
f f
Nβ β
η − ≤ ‖ ‖
‖ ‖
(17)
Lemma 2.4 ([3], Theorem 25 ). If f :X →R is measurable, then
( )f − (fρV)≤
E E
( )
1 1 1
( ) ( )
|| || , if 1 p 2,
2 || || (2 || || ), if p 2. p
p p
V p L dw
p V p V p
L dw L dw
f f
p f f f f
ρ
ρ ρ
− − −
⎧ − ≤ ≤
⎪
⎨ − + − >
⎪⎩
(18)
Lemma 2.5. There are constants M1>0 , M2 >0 ,
3 0
M > such that p≥1, β >0,
3
( ( V)) ( V)
N p
M
f f
N
ρ ρ β
η − ≤
E E (19)
where 3 1 2 1 1 2
1 2
2p ( p 1) p, N
M p M M M
M
β
− −
= + ≥ .
Now Proposition 2.1. can be derived from Lemma 2.2 and Lemma 2.5.
Ⅲ.SAMPLEERROR
Proposition 3.1. Let 0 1 2 ζ
< < , λ=exp{ 2− mζ} ,
min{ / 2, 2 / }
p p p
α = for p>1and α∈[0,1] for p=1,
0
β > , N≤mζ . Then, for all
2 1 1/
(4 / 4 log( p ))
m≥ ζ + pM − ζ and 0< <δ 1 , with confidence at least 1 2− δ , there holds
, ,
[ ( (E π fzλ))−E(fρV)] [− Ez( (π fzλ))−Ez(fρV)]
[ ( ) ( V)] [ ( ) ( V)]
z f z fρ f fρ
+E −E − E −E
, 1
{ ( ( )) ( )} ( ),
2
V z
f λ fρ c m
π
≤ E −E + (20)
2 3 1 2 log ( ) 3 2 p p M c m m N β δ + = + 1 4 2 1 2 1 2 log 2 20 3
(
)
p(
p p B C m m α ζ δ − + − + + + 3
1/ ( 2 ) 1 2
1 1
2 log 8 4 log
.
3
(
)
)
p p p p B CB
m m m
α ζ δ δ + − − + +
For simplicity, we divide the sample error in (13) for
( V)
N
f =η fρ into two terms. The first term is
[ ( ( V)) ( V)]
z N z
A= E η fρ −E fρ − [ ( ( V)) ( V)]
N fρ fρ
η −
E E , the
second one is B= [ ( (E π fz,λ))−E(fρV)]− ,
[ ( ( )) ( V)]
z π fzλ − z fρ
E E . For a compact F⊂X , we
denote by N F Xi( , ) (see[23]) the set of functions in X
that have more than a unique best approximation in F
with respect to the norm, and *
( )f
λ =
i
sup
{
λ′≥1:λ′fc+ −(1 λ′)f∈/N F X( , ) .}
Lemma 3.2 ([23]). Let 1< < ∞p and set F⊂L dwp( ) be a compact set of functions that are bounded by 1 put Y to be a random variable bounded by 1. If
i( , ( ))
p
Y∈/N F L dw , set
i
*( ) sup
{
1: (1 ) ( , ) .}
c
Y f Y N F X
λ = λ′≥ λ′ + −λ′ ∈/ Then,
for every f ∈F,
{ ( ( )) ( V( ))}2 { ( ) ( V)} ,p
p
E V yf x −V yfρ x ≤B E f −E fρ α
(21) where
*
1 ( )
( ) inf 1
p
Y
B c p
λ λ λ λ ′ < < ′ =
′ − , c p( ).
Lemma 3.3. For any 0< <δ 1, with confidence at least
1−δ , there holds
2 1
2
1 1
2 log 2 log
3
(
)
p p p B A m m α δ δ + − ≤ +
1 ( ( )) ( ) .
2
(
)
V V N fρ fρ
η
+ E −E (22)
Lemma 3.4 ([1],[6]). Let ξ be a random variable on probability space Z with mean Eξ µ= and variance
2 2
( )
σ ξ =σ . If |ξ µ− ≤| B almost everywhere, then for all τ>0,
2
2 1
1
Prob ( ) exp .
1 2( ) 3
{
}
{
}
m m i z i m z m B τ ξ µ τ σ τ ∈ = − ≥ ≤ − +∑
] (23)
To prove the result of sample error we need to extend (22) to a function set by means of the covering number. So let us recall some definitions. Let F be a subset of a metric space and r′ >0. The covering number N F( , )r′
is defined to be the minimal integer l∈ Ν such that there exist l balls with radius r′ covering F . Denote the
covering number of the unit ball B1 as
1 ( ) :r′ = ( , ),r′
N N B ∀ >r′ 0. Take
*
,
{ N : }.
N N N
t
R = f ∈ K ‖ ‖f K t ≤R
B H
We recall some well-known results which will be used in our estimate for sample error.
Lemma 3.5 ( [1]). Let E be a finite dimension Banach
space with norm ⋅E , r=dimE ,
i
{ :|| || }
R = f ∈E f E≤ R
B , then
i 4 log ( R, ) log( ).
R r r r ′ ≤ ′ N B
The reproducing property implies ‖ ‖f ∞≤
,
sup ( , ) 2 1
N N
N K t x X
K x x f N R
∈
≤ +
‖ ‖ , so Lemma 3.5
implies
* 4 2 1
log ( R, )r (2N 1) log N R r
+
′ ≤ +
′
N B
CNlog N R. r
≤
′ (24)
The next known result we need is a quantitative version of the law of the large numbers
Lemma 3.6 ([6]). Let 0≤ ≤α 1, B>0, c≥0, and G be a set of functions on Z , such that for every g∈G,
0
Eg≥ , |Eg− ≤g| B and 2
( )
Eg ≤c Egα . Then for any 0
τ > ,
1
1 2
1
( )
Prob sup 4
( )
{
}
m m i i z gEg g z
m Eg α α τα τ − = ∈ ∈ − > +
∑
] G 2 1( , ) exp .
1
2( )
3
{
m}
c B α α τ τ τ − − − ≤ +
N G (25)
By direct calculation it is not difficult to derive the following lemma.
Lemma 3.7. Let 0 1 2 ζ
< < . Then, for all
2 1 1/
(4 / 4 log( p ))
m≥ ζ + pM − ζ and 0< <δ 1 , with
confidence at least 1−δ , we have
, ,
[ ( ( )) ( V)] [ ( ( )) ( V)]
z z z z
f λ fρ f λ fρ
π − − π −
E E E E
, 1
{ ( ( )) ( )} 20
2
V z
f λ fρ
π τ
≤ E −E + (26)
for all *
R
f∈B , where τ is equal to
4 3
1 2
2 2 log(1 / )
3 3
p C p
m
m ζ
δ
+ +
− + +
(
8CB1 2p 4Bplog(1 / ))
1/ (2 p). mm
α ζ
δ −
− + (27)
Proof of Proposition 3.1. The result of Proposition 3.1 can be derived from Lemma 3.3 and Lemma 3.6.
Ⅳ.ESTIMATEOFLEARNINGRATES
combining the previous results on approximation error and sample error.
Proof of Theorem 1.1. Putting (17) and (22) into (16) with f =ηN(fρV), we have
,
( ( )) ( V)
z
f λ fρ
π − ≤
E E
1 1
2( )
2 2 3
1 ,
1
{ ( ( )) ( )} ( ), 2
V p
z p
M
C N f f c m
N β λ ρ
λ − + π
+ + E −E +
where c m( ) given by (22), we have with confidence at least 1 2− δ , that
1 1
2( )
2 2
, 1
( ( )) ( V) 2 p
z
f λ fρ C N
π λ − +
− ≤
E E
3 1
2 3
1 1
2 log 2 log
3
2
3
(
)
p
p
p p
B M
m m
N
α
β δ δ
+
−
+ + +
3 1
4
2
1 2 1 2
1 1
2 log 8 4 log
2
40 .
3 3
(
(
)
p)
p
p p
p B CB C
m m
m m
α
ζ δ ζ δ
+ +
−
− −
+ + + +
(28)
Since N ~mζ and λ =exp{ 2− mζ} , we have
1 1
2( )
2 1 ,
exp{ }
p
N
mζ
λ − +
≤ then (28) can be rewritten as
,
( ( )) ( V)
z
f λ fρ
π − ≤
E E
3
2
1 3
1
2 log
1 1
2 ( ) 3 ( )
3
exp{ }
p p
C M
m
mζ m βζ
δ
+
′
+ + +
1 1 1
2 2 2
2
1 1 2 1
2(2 log ) 40 4 log ( ) ( )
3
{
p(
(
p))
p}
p p
p p
B B
m
α α α
α
δ δ
+
− − −
−
+ +
4 3 1
40 2
(
p C 2p log δ+ +
+ + +
1 1 2
2 2
1
8 4 log .
(
)
p)(
p)
p p
CB B m
ζ
α α
δ
− − − −
+
Take 1
2 p (2 p)
ζ
β α
=
+ − , we have
,
( (π fzλ))− (fρV)≤cm −θ.
E E
Proof of Theorem 1.2. Theorem 1.2 can be derived immediately by combining the result of Theorem 1.1 and the following relations between the expected risk and excess misclassification error:
(sgn( ))f − (fc)≤
R R
( ( )) ( ), for p=1, 2( ( ( )) ( )), for p 1,
c V
f f
f fρ
π π
− ⎧⎪
⎨
− >
⎪⎩
E E
E E
where f :X →R is measurable, Ref.[3]. Acknowledgment
This work was supported in part by a grant from Natural Science Foundation of China (Grant No.10871226, 10971251)and Zhejiang Natural SCience Foundation of Zhejiang Province(Grant No. Y6100096).
Ye Peixin is supported by the Program for New Century Excellent Talents in University of China..
REFERENCES
[1] F. Cucker, S. Smale, On the mathematical foundations of learning theory, Bull. Amer. Math. Soc. 2001, 39 (1): 1-49. [2] D.R. Chen, Q. Wu, Y. Ying, D.X. Zhou, Support vector
machine soft margin classifiers: Error analysis, J. Mach. Learn. Res. 2004, 5: 1143-1175.
[3] H. Z. Tong, D.R. Chen, L.Z. Peng, Learning rates for regularized classifiers using multivariate polynomial kernels, J. Complexity. 2008, 24: 619-631.
[4] D.X. Zhou, K. Jetter, Approximation with polynomial kernels and SVM classifiers, Adv. Comput. Math. 2006, 25: 323-344.
[5] Q. Wu, D.X. Zhou, SVM soft margin classifiers: Linear programming versus quadratic programming, Neural Comput. 2005, 17: 1160-1187.
[6] Q. Wu, D.X. Zhou, Analysis of support vector machine classification, J. Comp. Anal. Appl. 2006, 8: 99-119. [7] Y. Lin, Support vector machined and the Bayes ruler in
classification, Data Min. and Knowl. Discov. 2002, 23: 259- 275.
[8] P. G. Nevai, Orthogonal polynimials, Amer. Math. Soc. March, 1979, 18 (213).
[9] A. Cannon, J.M. Ettinger, D. Hush, C. Scovel, Machine learning with data dependent hypothesis classes, J. Mach. Learn. Res. 2002, 2: 335-358.
[10]N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc. 1950, 68 (2) : 337-404.
[11]F. Dai, Z. Ditzian, Cesaro summability and Marchaud inequality, Constr. Approx. 2007, 25: 73-88.
[12]Q. Wu, Y. Ying, D.X. Zhou , Learning rates of least-square regularized regression, Found. Comput. Math. 2006, 6: 171-192.
[13]S. Mendelson, Obtaining fast error rate in nonconvex situations, J. Complexity. 2008, 24: 380-397.
[14]D.X. Zhou, Capacity of reproducing kernel spaces in learning theory, IEEE Trans. Inf. Theory, 2003, 49(7) : 1743- 1752.
[15]F. Cucker, S. Smale, Best choices for regularization parameters in learning theory: On the bias-variance problem, Found. Compt. Math. 2002, 2: 413-428.
[16]W. Chen, Z. Ditzian, Best approximation and $K$-functional, Acta Math. Hungar, 1997, 75(3) : 165-208. [17] C. A. Micchelli, M. Pontil, Learning the kernel function via regularization, J. Mach. Learn. Res., 6(2005),1099-1125. [18]F. Dai, Z. Ditzian, Ces\`{a}ro summability and Marchaud
inequality, Constr. Approx. 2007, 25: 73-88.
[19]S. Smale, D.X. Zhou, Shannon sampling, Connections to learning theory, Appl. Comput. Harmon. Anal. 2005, 19: 285-302.
[20]S. Smale, D.X. Zhou, Learning theory estimates via integral operators and their applications, Constr. Approx. 2007, 26: 153-172.
[21]Q. Wu, Y. Ying, D.X. Zhou , Learning rates of least-square regularized regression, Found. Comput. Math.}2006, 6: 171-192.
[22]S. Mendelson, Obtaining fast error rate in non-convex situations, J. Complexity. 2008, 24: 380-397.
[23]D.X. Zhou, The coveing number in learning theory, J. Complexity, 2002, 18(3) : 739-767.
[24]D.X. Zhou, Capacity of reproducing kernel spaces in learning theory, IEEE Trans. Inf. Theory}, 2003, 49(7) : 1743-1752.
Baohuai Sheng attended Baoji Normal College, Baoji, Shaanxi, from 1981 to 1985. He earned his BS degree in mathematical teaching from the department of mathematics in 1985. He earned his M.S. degree in basic mathematics from the department of mathematics of Hangzhou University in 1988. From 1988-2001 he worked towards the Doctor of Sciences in the applied mathematics at Xidian University, Xian, P.R.China, and earned his S. D. degree in March,2 001.From March, 2001, to September,2003, he served as a Professor at the Ningbo University. Since September,2003,he has been a faculty member of Shaoxing College of Arts and Sciences, Shaoxing, Zhejiang, P.R.China, where he serves currently as a Professor and the Chairman of Mathematical Department. His current research interests focus on the area of approximation theory, nonlinear optimization, learning theory.