Learning errors of linear programming support vector regression
q
Feilong Cao
⇑, Yubo Yuan
Institute of Metrology and Computational Science, China Jiliang University, Hangzhou 310018, Zhejiang Province, China
a r t i c l e
i n f o
Article history: Received 1 June 2009
Received in revised form 18 September 2010
Accepted 4 October 2010 Available online 13 October 2010
Keywords: Regression
Support vector machine Linear programming Generalization error Empirical error
a b s t r a c t
In this paper, we give several results of learning errors for linear programming support vec-tor regression. The corresponding theorems are proved in the reproducing kernel Hilbert space. With the covering number, the approximation property and the capacity of the reproducing kernel Hilbert space are measured. The obtained result (Theorem 2.1) shows that the learning error can be controlled by the sample error and regularization error. The mentioned sample error is summarized by the errors of learning regression function and regularizing function in the reproducing kernel Hilbert space. After estimating the gen-eralization error of learning regression function (Theorem 2.2), the upper bound (Theorem 2.3) of the regularized learning algorithm associated with linear programming support vec-tor regression is estimated.
Crown CopyrightÓ2010 Published by Elsevier Inc. All rights reserved.
1. Introduction
The main aim of this paper is the error analysis for the linear programming support vector regression (SVR) problem in leaning theory. To this end, this paper is organized as follows. We start by giving a brief introduction of the basic techniques of support vector machine (SVM) or SVR in Section1, including the history, motivation, most important publications and some corresponding results with this paper. Moreover, a brief review of linear programming support vector regression (LP-SVR) and quadratic programming one (QP-SVR) in the reproducing kernel Hilbert space is presented in Section2. Learn-ing error analysis of LP-SVR is given in Section3. In Section4, we plus a short conclusion and summary with our research work, also, some future works are attached.
First of all, we briefly describe the historic background of support vector learning algorithm (SVLA). The SVLA is a non-linear generalization of the generalized portrait algorithm developed in the sixties (see,[1,2]).
As such, the SVLA is firmly grounded in the framework of statistical learning theory. In 1995, Cortes and Vapnik (see,[3]) proposed a learning network named support-vector network. Originally, it is a learning machine for two-group classification problems. One of the most important idea is that input vectors had been non-linearly mapped to a very high-dimension fea-ture space. In this feafea-ture space a linear decision surface was constructed. They showed that the training data (separable and non-separable) could be separated without errors. After that, this learning network became more and more popular and re-named as SVM.
Today, SVM has become an important subject in learning theory and has evolved into an active area of research. Mathematically, SVM is from a pattern classification problem based on a given classification ofmpoints (x1,x2,. . .,xm) in
0307-904X/$ - see front matter Crown CopyrightÓ2010 Published by Elsevier Inc. All rights reserved. doi:10.1016/j.apm.2010.10.012
q
The research was supported by the National Natural Science Foundation of China (Nos. 60873206, 61001200) and the Natural Science Foundation of Zhejiang Province of China (No. Y7080235).
⇑ Corresponding author. Tel.: +86 571 86835737. E-mail address:feilongcao@gmail.com(F. Cao).
Contents lists available atScienceDirect
Applied Mathematical Modelling
then-dimensional spaceRn, represented by anmnmatrixA= (x
1,x2,. . .,xm)T, given the membership of each data pointxi, i= 1, 2,. . .,min the classes 1 or1 as specified by a givenmmdiagonal matrixDwith 1 or1 diagonals.
Primarily, SVM (see,[4]) is given by the following quadratical programming with linear inequalities constraints min ða;bÞ2Rnþ1 1 2k
a
k 2 þ1 2kyk 2 ; s:t: DðAa
þbeÞPeþy; yP0: ð1ÞModel(1)can be seen as the original model of QP-SVM. Here,
a
is a vector of separator coefficients (direction vector of clas-sification hyperplane),bis an offset (the control parameter of the distance of hyperplane plane to the origin) ande2Rmstands for a vector of ones.
The decision function of classification is given by
fðxÞ ¼signð
a
TxþbÞ: ð2ÞBy now, many different forms of QP-SVM(1)have been introduced for different purposes (see[4]). In this work, we mainly pay our attention to the learning error or the convergence rate of the proposed algorithm. For the convergence rate of QP-SVM(1), there are many works. We refer the readers to Steinart[5], Zhang[6], Wu and Zhou[7], Zhao and Yin[8], Hong[9], and Wu et al.[10].
Among of forms of SVM, the LP-SVM is important because of its linearity and flexibility for large data setting. Many authors have introduced the LP-SVM. We refer the readers to Bradley and Mangasarian[11], Kecman and Hadaic[12], Niyogi and Girosi[13], Pedroso and Murata[14]and Vapnik[4]. Its primal optimization model is as follows
min ðk;yÞ2R2m 1 me T kþ1 Ce Ty; s:t: DðAATkþbeÞPey; kP0; yP0: ð3Þ
The trade-off factorC=C(m) > 0 depends onmand is crucial.
Many experiments demonstrate that LP-SVM is efficient and perform even better than QP-SVM for some purposes: capa-ble of solving huge sample size procapa-blems (see,[11]), improving the computational speed (see,[14]), and reducing the num-ber of support vectors (see,[12]).
However, little is known for the learning error or the convergence of the LP-SVM. We only find that a classification prob-lem of LP-SVM was studied in Wu and Zhou[15].
The primal goal of this paper is to address in the investigation of regression problem and to provide the error analysis for linear programming support vector regression (LP-SVR).
2. LP-SVR and QP-SVR in the reproducing kernel Hilbert space
For regression problem, SVR can be seen as a SVLA with continuous output or support vector machines for function esti-mation. In 2004, Smola and Schölkopf[16]have given an overview of the basic ideas underlying SVM for function estimation. They indicated that the SVR was more difficult to analysis the learning error and convergence. This is the motivation of this work.
Let (X,d) be a compact metric space and letXRn; Y¼R. In this work, we only discuss the single output regression problem. For multi-output regression problem, there is some difficult coming from the multivariate output space. It seems that it is a very challenge work to study learning errors for multi-output LP-SVR in reproducing Hilbert spaces. In 2009, Liu et al.[17]studied the output space as a Riemannian submanifold to incorporate its geometric structure into the regression process. They proposed a locally linear transformation (LLT), to define the loss functions on the output manifold. An algo-rithm was given under the framework of SVR.
Let
q
be a probability distribution onS=XY. The error (or generalization error) for a functionf:X?Yis defined asEðfÞ ¼
Z S
ðyfðxÞÞ2d
q
:The function that minimizes the error is called the regression function. It is given by
fqðxÞ ¼
Z Y
y d
q
ðyjxÞ;8
x2X;where
q
(jx) is the conditional probability measure atxinduced byq
.The target of regression problem is to learn the regression function or to find good approximation from random samples. The least-square algorithm for the regression problem is a discrete least-square problem associated with a Mercer kernel. Let K:XX!R be continuous, symmetric and positive semidefinite, i.e., for any finite set of distinct points
{x1,x2,. . .,xm}X, the matrixðKðxi;xjÞÞmi;j¼1is positive semidefinite (at the rest part of this paper, we denote this matrix asK).
Such a function is called a Mercer kernel.
The reproducing kernel Hilbert space (RKHS)HKassociated withKis defined (see,[18]) to be the closure of the linear span of the set of functions {Kx=K(x,) :x2X} with the inner producth;iHK¼ h;iK, i.e.,
HK¼spanfKx:x2Xg:
The inner product satisfieshKx,KyiK=K(x,y). That is, X i
a
iKxi; X j bjKyj * + K ¼X i;ja
ibjKðxi;yjÞ: The reproducing property takes the fromhKx;fiK¼fðxÞ;
8
x2X; f2 HK:DenoteC(X) as the space of continuous functions onXwith the normk k1. Let
j
¼supx2X ffiffiffiffiffiffiffiffiffiffiffiffiffiffi Kðx;xÞ p. Then the above repro-ducing property tell us that
kfk16
j
kfkK;8
f2 HK: ð4ÞThroughout this paper, we assume that for someMP0,
q
(jx) is almost everywhere supported on [M,M], that is,jyj6M almost surely (with respect toq
). It follows from the definition of the regression functionfqthatjfq(x)j6M.Letz= {(x1,y1), (x2,y2),. . ., (xm,ym)}(XY)mbe the sample. In
e
-SVR (see,[4,16,19]), the primal goal is to find a functionthat has at most
e
deviation from the actually obtained targetsyifor all the training data, and at the same time the function is as flat as possible. In other words, we do not care about errors as long as they are less thane
, but will not accept any deviation larger than this. This may be important if you want to be sure not to lose more thane
money when dealing with exchange rates, for instance (see,[16], p. 200). The QP-SVR (see,[4,16,19]) in the reproducing Hilbert space can be formulated as following min ða;a;bÞ2R2mþ1 1 2ðka
k 2 þ ka
k2 Þ; s:t:e
e6Kða
a
Þ þbey6e
e: ð5ÞWith the slack variablesn;n2Rm, we arrive at the formulation presented in[4,16,19]
min ða;a;bÞ2R2mþ1 1 2ðk
a
k 2 þ ka
k2 Þ þ1 Cðknk 2 þ knk2Þ; s:t: Kða
a
Þ þbey6e
eþn; y Kða
a
Þ be6e
eþn; n;nP0: ð6ÞMotivated by reducing the number vectors of the 1-norm soft margin SVM, Smola and Schölkopf (see,[4,19]), Vapnik (see,
[16]) introduced the LP-SVM algorithm associated to a Mercer kernelK. In fact, the LP-SVR is based on the following linear programming optimization problem
min ða;a;bÞ2R2mþ1 1 mðe T
a
þeTa
Þ þ1 Cðe T nþeT nÞ; s:t: Kða
a
Þ þbey6e
eþn; y Kða
a
Þ be6e
eþn;a
;a
;n;n P0: ð7ÞRemark 1. There are many methods or algorithms to solve QP-SVR(6)(see,[20–23]), but little of LP-SVR(7). Unlike with QP-SVR, the LP-SVR has the following characteristics
(i) The 1-norm is less sensitive to outliers such as those occurring when the underlying data distributions have pro-nounced tails, hence LP-SVR(7)has a similar effect to that of robust regression (see,[24], pp. 82–87);
(ii) The transformation into its dual does not give any improvement in the structure of the optimization problem. Hence it is best to minimize empirical risk directly, which can be achieved by a linear optimizer. (see,[16], p. 210).
3. Learning errors estimate
If (
a
,a
*,b) solves the optimization problem(7), then the decision-making function of LP-SVR isfz¼ ð
a
a
ÞTKðA;xÞ þb: Set the empirical error asEzðfÞ ¼ 1 m Xm i¼1 ðyifðxiÞÞ2:
Then the LP-SVR scheme(7)can be written as
fz¼arg min
f¼fþb2HKþR
fEzðfÞ þk
X
ðfÞg; ð8Þhere we have denotedXðfÞ ¼ k
a
k l1¼Pm
i¼1
a
iforf¼Pmi¼1a
iKxiwitha
iP0.We focus on the error between the functionfzandfq, i.e.,
EðfzÞ EðfqÞ: ð9Þ
Our main goal is to estimate the error(9)for the least-square regression algorithm(8)by means of properties of
q
andK. From[15], we see that the punishment item is not a Hilbert space norm, which raises the technical difficulty for the math-ematical analysis. Since the solutionfzof the LP-SVM has a reputation similar to the QP-SVM’s, we can estimate the error for the former using the stepping stone:X
fz 6CEzðfzÞ þ kfzk 2
:
In the last section, we have gotten the solutionfz. Next we will analyze the excess generalization errorEðfzÞ EðfqÞ. Denote theregularization erroras
e DðCÞ ¼ inf f¼fþb2H KþR EðfÞ EðfqÞ þ 1 C
X
ðf Þ ; CP0: LetefK;C¼arg minf2HKDðCÞe .To estimateEðfzÞ EðfqÞ, we introduce theregularizing function fK;C2 HK, which depends onKandC, and is defined by
fK;C¼arg min f¼fþb2HKþR EðfÞ þ1 C
X
ðf Þ :The regularization error for a regularization functionfK,Cis defined as
DðCÞ ¼ EðfK;CÞ EðfqÞ þ
1
C
X
ðf
K;CÞ:
The following results is an estimation for errorEðfzÞ EðfqÞ.
Theorem 2.1. Assume C>0 and fK;C2 HK. Then there holds
EðfzÞ EðfqÞ6Sðm;CÞ þ DðCÞ; where S(m, C) is the sample error defined by
Sðm;CÞ ¼ EðfzÞ EzðfzÞ þ EzðfK;CÞ EðfK;CÞ: Proof. We see from the definition offqthat
EðfzÞ EðfqÞ6EðfzÞ EðfqÞ þ
1 C
X
f z and EðfzÞ EðfqÞ þ1 CX
f z ¼½EðfzÞ EzðfzÞ þ E½ zðfK;CÞ EðfK;CÞþ EzðfzÞ þ 1 CX
f z EzðfK;CÞ þ 1 CX
ðf K;CÞ þ EðfK;CÞ EðfqÞ þ 1 CX
ðfK;CÞ : From the definition offzit follows thatEzðfzÞ þ 1 C
X
f z EzðfK;CÞ þ 1 CX
f K;C 60:This enables us to get
EðfzÞ EðfqÞ þ1 C
X
f
z 6 ðEðfzÞ EzðfzÞÞ þ ðEzðfK;CÞ EðfK;CÞÞ
þ EðfK;CÞ EðfqÞ þ1 C
X
f K;C : This finishes the proof ofTheorem 2.1. hWe now give some definitions.Definitions 2.1 and 2.2can be found in[25–28], andDefinitions 2.3 and 2.4can be found in
[15,28].
Definition 2.1.For a subsetFof a metric space and
g
> 0, the covering numberN ðF;g
Þis defined to be the minimal integer l2Nsuch that there existldisks with radiusg
coveringF.LetBR¼ ff2 HK:kfkK6Rg. It is a subset ofC(X). We denote the covering number of unit ballB1as
N ð
e
Þ ¼ N ðB1;e
Þ;e
>0:Definition 2.2. We say that the RKHS associated with the Mercer kernelKhas polynomial complexity exponents> 0 if logN ð
e
Þ6C0ð1=e
Þs;8
e
>0: ð10ÞDefinition 2.3. We say that the probability measure
q
can be approximated byHKwith exponent 0 <b61 if there exists a constantcbsuch thate
DðCÞ6cbCb;
8
C>0: ð11ÞDefinition 2.4. The projection operator
p
is defined on the space of measurable functionsf:X!Rasp
ðfÞðxÞ ¼ 1; if fðxÞ>M; 1; if fðxÞ<M; fðxÞ; if M<fðxÞ<M: 8 > < > :The following theorem gives the bound for deterministic distributions, i.e.,EðfqÞ ¼0.
In order to prove the result, we need the following ratio probability inequalities and an estimate of coving number. These results are standard in the learning theory, and can be found in[7,15,27–29], etc.
Bernstein inequality.Letnbe a random variable onZsatisfyingE(n) =
l
,r
2(n) =r
2. Ifjnl
j6M, then for everye
> 0 thereholds Prob
l
1 m Xm i¼1 nðziÞ >e
( ) 62 exp me
2 2ðr
2þ1 3Me
Þ ( ) :Lemma 2.1. Letnbe a random variable on Z satisfying E(n) =
l
P0,jnl
j6M almost everywhere, andr
26cl
s, 06s
62.Then for any
e
>0 there holdsProb
l
1 m Pm i1nðziÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffil
sþe
s p >e
12s 6exp me
2s 2ðcþ1 3Me
1sÞ ( ) :Lemma 2.2.Let 06
s
61, M>0,c>0, andG be a set of functions on Z such that for every g2 G;EgP0;jgEgj6M, and E(g2)6c(Eg)s. Then fore
>0,Prob sup g2G Eg1 m Pm i¼1gðziÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðEgÞsþ ð
e
Þs p >4e
1s 2 ( ) 6N ðG;e
Þexp me
2s 2 cþ1 3Me
1s ( ) ; where Eg¼RZgðzÞdq
.Denote the functions setGRas
GR¼ fðfðxÞ yÞ2 ðfqyÞ2:f2 BR; R>0g: ð12Þ Lemma 2.3.AssumeGRbe defined by(12). If(10)holds, then there exists a constant c0s>0, such that
logN ðGR;
e
Þ6c0sR
e
s :
FromLemmas 2.2 and 2.3,we have the following Corollary.
Corollary. LetGRbe defined by(12) and (10)hold. Then for every 0<d<1, with confidence at least 1d, there holds
fEðfÞ EðfqÞg fEzðfÞ EzðfqÞg64pffiffiffiffiffiffiffiffi
e
m;Rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
EðfÞ EðfqÞÞ þ
e
m;R qfor all f2 BR, where
e
m,Ris given bye
m;R¼8 cRþ 1 3 c0 sR s m 1 1þs þlogð1=dÞ m ! ; cR¼ ðj
Rþ3mÞ2:Proof. Consider the setGR. Each functiong2 GRhas the formg(z) = (f(x)y)2(fqy)2withf2 BR. Hence
Eg¼ EðfÞ EðfqÞP0; 1 m Xm i¼1 gðziÞ ¼ EzðfÞ EzðfqÞ and gðzÞ ¼ ðfðxÞ fqðxÞÞððfðxÞ yÞ þ ðfqðxÞ yÞÞ:
Sincekfk16
j
kfkK6j
R, andjfq(x)j6Malmost everywhere, we find thatjgðzÞj6ð
j
RþMÞðj
Rþ3MÞ6cR¼ ðj
Rþ3MÞ2: So we havejg(z)Egj6B= 2cRalmost everywhere. Also,Eg2¼EððfðxÞ fqðxÞÞ2ððfðxÞ yÞ þ ðfqðxÞ yÞÞ2Þ6ð
j
Rþ3MÞ2ðEðfÞ EðfqÞÞ: ThusEg26cREgfor eachg2 GR.
ApplyingLemma 2.2with
s
= 1, we deduce that sup f2BR Eg1 m Pm i¼1gðziÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffi Egþe
p 64pffiffiffie
; with confidence at least1 N ðGR;
e
Þexpm
e
2cRþ13B
( )
¼1d: So we can see that
fEðfÞ EðfqÞg fEzðfÞ EzðfqÞg64pffiffiffiffiffiffiffiffi
e
m;Rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðEðfÞ EðfqÞÞ þ
e
m;R q: We see from the Corollary 5.1 of[15](withp=1,
s
= 1) thate
m;R¼8 cRþ 1 3 c0 sR s m 1 1þs þlogð1=dÞ m ! : This enable us to end the proof of Corollary. hTheorem 2.2. SupposeEðfqÞ ¼0. If fK,Cis a function inHKsatisfyingk(yfK,C)2k16M, then for every 0<d<1, with confidence
1dthere holds EðfzÞ617
e
m;Cþ 20Mlogð2=dÞ 3m þ8DðCÞ: wheree
m,Cis given bye
m;C¼8j
2C 5Mlogð2=dÞ 3m þ2DðCÞ þ3M 2 þ1 3 ! csj
sCs 5Mlogð2=dÞ 3m þ2DðCÞ s m 0 B @ 1 C A 1 1þs þlogð1=dÞ m 0 B B @ 1 C C A:Proof. Since EðfqÞ ¼0;ðyfqðxÞÞ2¼0 almost everywhere. We first consider the random variable n= (yfK,C)2. Since
r
2ðnÞ6En26MEn6MDðCÞ:
Applying Bernstein inequality ton, we see by solving the equation me2
2ðr2þMe=3Þ¼logðd=2Þthat with confidence 1d/2, EzðfK;CÞ EðfK;CÞ6 2Mlog 2 d 3m þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2
r
2ðnÞlogð2=dÞ m r 65Mlogð2=dÞ 3m þ DðCÞ:Next we estimate theEðfzÞ EzðfzÞ. By the definition offz, there holds 1 C
X
f z 6EzðfzÞ þ 1 CX
f z 6EzðfK;CÞ þ 1 CX
f K;CandEðfqÞ ¼0;DðCÞ ¼ EðfK;CÞ þ1CX fK;C
. It follows that 1 C
X
f z 6EzðfK;CÞ EðfK;CÞ þ DðCÞ:Recallingfz2 HK, we see from the reproducing property of kernel that
kf zkK¼ Xm i;j¼1
a
ia
jKðxi;xjÞ !1=2 6j
X m i¼1a
ia
j !1=2 ¼j
X
f z 6R¼j
C 5Mlogð2=dÞ 3m þ2DðCÞ :Corollary andRgiven as above imply that
EðfzÞ EzðfzÞ64 ffiffiffiffiffiffiffiffiffi
e
m;C p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi EðfzÞ þe
m;C qwith confidence 1dwhere
e
m,Cis defined in the statement. Putting the above two estimates intoTheorem 2.1, there holdsEðfzÞ64 ffiffiffiffiffiffiffiffiffi
e
m;C p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi EðfzÞ þe
m;C q þ10Mlogð2=dÞ 3m þ4DðCÞ:Solving the quadratic inequality for ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiEðfzÞ þ
e
m;C pleads to
EðfzÞ617
e
m;Cþ20Mlogð2=dÞ
3m þ8DðCÞ:
The proof ofTheorem 2.2is completed. h
The next result is on general distribution satisfying Tsybakov condition (see,[30]).
Theorem 2.3.Assume the hypotheses(10) and(11)with 0<s<1and 0<b61. Take t>1. For every
e
>0 and every 0<d<1there exists two constants csdepending on s and c0sdepending on s andbsuch that with confidence 1d,
EðfzÞ EðfqÞ62cst Rs m 1 1þs þ2tc0 s Cð2bÞ m þ 1 mþ Cð1bÞ m þC b ! :
Proof.DenoteDz¼ EðfzÞ EðfqÞ þ1CX fz . Then we haveX fz 6CDz.Theorem 2.1tells us that
D
z6Sðm;CÞ þ DðCÞ:TakefK;C¼efK;C. By the assumption(11),
D
z6Sðm;CÞ þcbCb:
Recall the expression forS(m,C), we have
Sðm;CÞ ¼ fðEðfzÞ EðfqÞÞ ðEzðfzÞ EzðfqÞÞg þ fðEzðfK;CÞ EzðfqÞÞ ðEðfK;CÞ EðfqÞÞg ¼S1þS2: TaketP1,CP1. ForS1, we apply Corollary withd=et61/e, and know that there is a setVðR1ÞZ
m of measure at most d=etsuch that S16cst Rs m 1 1þs þ R s m 1 2ð1þsÞ
D
12 z ( ) ; wherecs¼32 ðj
Rþ3MÞ2þ13 c0 sþ1is a constant depending only ons.
To estimateS2, considern= (fK,Cy)2 (fqy)2on (Z,
q
). From(4), it follows that kfK;Ck16j
kfK;CkK6j
2
X
ðfK;CÞ6
j
2CDðCÞ: Writen=n1+n2wheren1¼ ðfK;CyÞ 2 ð
p
ðfK;CÞ yÞ 2 ;n2¼ ðp
ðfK;CÞ yÞ 2 ðfqyÞ2:It is easy to check that 06n16ð
j
2CDðCÞ þ3MÞ2¼BC. Herer
2(n1) is bounded by ðj
2CDðCÞ þ3MÞ2Eðn1Þ. Then Bernsteininequality withd=ettells us that there is a setVð2Þ
R Z
mof measure at mostd=etsuch that for everyz2ZmnVð2Þ R , there holds 1 m Xm i¼1 n1ðziÞ En16exp m
e
2 2ðr
2ðn 1Þ þ13BCe
Þ ( ) : Solving the quadratic equation fore
m
e
2 2ðr
2ðn 1Þ þ13BCe
Þ ¼t we have 1 m Xm i¼1 n1ðziÞ En16 1 3BCtþ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 3BCt 2 þ2mr
2ðn 1Þt q m 6 2BCt 3m þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2t mr
2ðn 1Þ r : But the fact 06n16BCimpliesr
2(n1)6BCE(n1). Therefore, we have1 m Xm i¼1 n1ðziÞ En16 7BCt 6m þEðn1Þ;
8
z2Z m nVðR2Þ:Next we considern2. Since both
p
(fK,C) andfqare on [M,M],n2is a random variable satisfyingjn2j6B. Applying Bernsteininequality as above, we know that there exists another subsetVð3Þ
R Z
mwith measure at mostd=etsuch that for every z2ZmnVð3Þ R , there holds 1 m Xm i¼1 n2ðziÞ En26 2Bt 3mþ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2t
r
2ðn 2Þ m r : By the factr
2 (n2)6BE(n2), we have 1 m Xm i¼1 n2ðziÞ En26 7Bt 6mþEðn2Þ;8
z2Z m nVð3Þ R :Combine the above two estimates forn1andn2with the factEn1þEn2¼En6DðCÞ6cbCb. We conclude that S26 7BCtþ7Bt 6m þ DðCÞ6c 0 st Cð2bÞ m þ 1 mþ Cð1bÞ m þC b ! ;
8
z2Zm nVð2Þ R nV ð3Þ R ; wherec00 s¼ 7j4c2 bþ42Mj2cbþ63M2þ7B6 þcbis a constant depending only ons.
Putting the above two estimates forS1andS2, we find that for everyz2ZmnVð
1Þ R nV ð2Þ R nV ð3Þ R there holds
D
z62cst Rs m 1 1þs þ2tc00 s Cð2bÞ m þ 1 mþ Cð1bÞ m þC b ! :Here we have used another elementary inequality: ifa,b> 0 and 0 <
a
< 1, thenx6axaþb; x>0)x6maxfð2aÞ1=ð1aÞ;2bg: The proof ofTheorem 2.3is complete. h
4. Conclusions
In reproducing Hilbert spaces, with the covering number, a new upper bound for estimating learning errors of linear pro-gramming support vector regression has been presented in this paper. This errors bound formulation has been shown that
Bðm;CÞ ¼2cst Rs m 1 1þs þ2tc0 s Cð2bÞ m þ 1 mþ Cð1bÞ m þC b ! :
Heremis the number of sample points,Cis the trade-off who control the regulation item in the model of LP-SVR(7),sis the polynomial complexity exponent of the given reproducing Hilbert spaces,csandc0sare two constants who depend onsand can be estimated byDefinition 2.3andLemma 2.3,t> 1 is a given constant and 0 <b61 is also a given constant.
lim
m!þ1Bðm;cÞ ¼ 2tc0
s
Cb :
It means that the gap can not vanish no matter how to select the sample data points.
Due to the difficulty of calculating the covering number, large body of work can not be done in the field of experiments. The authors think that it will be a very challenge future work.
Acknowledgements
The authors thank referees for their careful reading of the manuscript which improved the technician qualities and pre-sentation. Their very detailed comments and extremely important suggestions made this paper readable and more understandable.
References
[1] V. Vapnik, A. Lerner, Pattern recognition using generalized portrait method, Automat. Rem. Contr. 24 (1963) 774–780.
[2] M.A. Aizerman, E.M. Braverman, L.I. Rozonoer, Theoretical foundations of the potential function method in pattern recognition learning, Automat. Rem. Contr. 25 (1964) 821–837.
[3] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (1995) 273–297. [4] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.
[5] I. Steinwart, Supported vector machines are universally consistent, J. Complexity 18 (2002) 768–791.
[6] T. Zhang, Statistical behavior and consistency of classification methods based on convex risk minimization, Ann. Statist. 32 (2004) 56–85. [7] Q. Wu, D.X. Zhou, Analysis of support vector machine classification, J. Comput. Anal. Appl. 8 (2006) 108–134.
[8] H.B. Zhao, S.D. Yin, Geomechanical parameters identification by particle swarm optimization and support vector machine, Appl. Math. Model. 33 (2009) 3997–4012.
[9] W.C. Hong, Electric load forecasting by support vector model, Appl. Math. Model. 33 (2009) 2444–2454. [10] Q. Wu, Y.M. Ying, D.X. Zhou, Multi-kernel regularized classifiers, J. Complexity 23 (2007) 108–134.
[11] P.S. Bradley, O.L. Mangasarian, Massive data discrimination via linear supported vector machiness, Optim. Methods Softw. 13 (2000) 1–10. [12] V. Kecman, I. Hadaic, Supported vector selection by linear programming, Proc. of IJCNN 5 (2000) 193–198.
[13] P. Niyogi, F. Girosi, On the relationship between genneralization error, hypothesis complexity, and sample complexity for radial basis functions, Neural Comput. 8 (1996) 819–842.
[14] J.P. Pedroso, N. Murata, Supported vector machines with different norms: motivation, formulations and results, Pattern Recogn. Lett. 22 (2001) 1263– 1272.
[15] Q. Wu, D.X. Zhou, SVM soft margin classfiers: linear programming versus quadratic programming, Neural Comput. 17 (2005) 1160–1187. [16] A.J. Smola, B. Schölkopf, A tutorial on support vector regression, Stat. Comput. 14 (2004) 199–222.
[17] G. Liu, Z. Lin, Y. Yu, Multi-output regression on the output manifold, Pattern Recog. 42 (2009) 2737–2743. [18] N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Soc. 68 (1950) 337–404.
[19] V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, New York, 1995.
[20] J.C. Platt, Fast training of support vector machines using sequential minimal optimization, in: Advances in Kernel Methods-Support Vector Learning, MIT Press, 1999, pp. 185–208.
[21] O.L. Mangasarian, D.R. Musicant, Successive overrelaxation for support vector machines, IEEE Tran. Neural Networks 10 (1999) 1032–1037. [22] Y.J. Lee, O.L. Mangarasian, SSVM: a smooth support vector machine for classification, Comput. Optim. Appl. 22 (2001) 5–21.
[23] Y. Yuan, J. Yan, C. Xu, Polynomial smooth support vector machine, Chinese J. Comput. 28 (2005) 9–17. in Chinese. [24] M.H. Hassoun, Fundamentals of Artificial Neural Networks, MIT Press, Cambridge, MA, 1995.
[25] D.X. Zhou, The covering number in learning theory, J. Complexity 18 (2002) 739–767.
[26] Y. Guo, P.L. Bartlett, J. Shawe-Taylor, R.C. Williamson, Covering numbers for support vector machines, IEEE Trans. Inf. Theory 48 (2002) 239–250. [27] F. Cucker, S. Smale, On the mathematical foundations of learning theory, Bull. Amer. Math. Soc. 39 (2001) 1–49.
[28] Q. Wu, D.X. Zhou, Learning rates of least-square regularized regression, Found. Comput. Math. 6 (2006) 171–192. [29] F. Cucker, D.X. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge University Press, Cambridge, 2007. [30] A.B. Tsybako, Optimal aggregation of classifiers in statistical learning, Ann. Statis. 32 (2004) 135–166.