Learning errors of linear programming support vector regression

(1)

Learning errors of linear programming support vector regression

q

Feilong Cao

⇑

, Yubo Yuan

Institute of Metrology and Computational Science, China Jiliang University, Hangzhou 310018, Zhejiang Province, China

a r t i c l e

i n f o

Article history: Received 1 June 2009

Received in revised form 18 September 2010

Accepted 4 October 2010 Available online 13 October 2010

Keywords: Regression

Support vector machine Linear programming Generalization error Empirical error

a b s t r a c t

In this paper, we give several results of learning errors for linear programming support vec-tor regression. The corresponding theorems are proved in the reproducing kernel Hilbert space. With the covering number, the approximation property and the capacity of the reproducing kernel Hilbert space are measured. The obtained result (Theorem 2.1) shows that the learning error can be controlled by the sample error and regularization error. The mentioned sample error is summarized by the errors of learning regression function and regularizing function in the reproducing kernel Hilbert space. After estimating the gen-eralization error of learning regression function (Theorem 2.2), the upper bound (Theorem 2.3) of the regularized learning algorithm associated with linear programming support vec-tor regression is estimated.

1. Introduction

The main aim of this paper is the error analysis for the linear programming support vector regression (SVR) problem in leaning theory. To this end, this paper is organized as follows. We start by giving a brief introduction of the basic techniques of support vector machine (SVM) or SVR in Section1, including the history, motivation, most important publications and some corresponding results with this paper. Moreover, a brief review of linear programming support vector regression (LP-SVR) and quadratic programming one (QP-SVR) in the reproducing kernel Hilbert space is presented in Section2. Learn-ing error analysis of LP-SVR is given in Section3. In Section4, we plus a short conclusion and summary with our research work, also, some future works are attached.

First of all, we brieﬂy describe the historic background of support vector learning algorithm (SVLA). The SVLA is a non-linear generalization of the generalized portrait algorithm developed in the sixties (see,[1,2]).

As such, the SVLA is ﬁrmly grounded in the framework of statistical learning theory. In 1995, Cortes and Vapnik (see,[3]) proposed a learning network named support-vector network. Originally, it is a learning machine for two-group classiﬁcation problems. One of the most important idea is that input vectors had been non-linearly mapped to a very high-dimension fea-ture space. In this feafea-ture space a linear decision surface was constructed. They showed that the training data (separable and non-separable) could be separated without errors. After that, this learning network became more and more popular and re-named as SVM.

Today, SVM has become an important subject in learning theory and has evolved into an active area of research. Mathematically, SVM is from a pattern classiﬁcation problem based on a given classiﬁcation ofmpoints (x1,x2,. . .,xm) in

q

The research was supported by the National Natural Science Foundation of China (Nos. 60873206, 61001200) and the Natural Science Foundation of Zhejiang Province of China (No. Y7080235).

⇑ Corresponding author. Tel.: +86 571 86835737. E-mail address:[email protected](F. Cao).

Contents lists available atScienceDirect

Applied Mathematical Modelling

(2)

then-dimensional spaceRn_{, represented by an}_m_n_matrix_A_{= (}_x

1,x2,. . .,xm)T, given the membership of each data pointxi, i= 1, 2,. . .,min the classes 1 or1 as speciﬁed by a givenmmdiagonal matrixDwith 1 or1 diagonals.

Primarily, SVM (see,[4]) is given by the following quadratical programming with linear inequalities constraints min ða;bÞ2Rnþ1 1 2k

a

k 2 þ1 2kyk 2 ; s:t: DðA

a

þbeÞPeþy; yP0: ð1Þ

Model(1)can be seen as the original model of QP-SVM. Here,

a

is a vector of separator coefﬁcients (direction vector of clas-siﬁcation hyperplane),bis an offset (the control parameter of the distance of hyperplane plane to the origin) ande2Rm

stands for a vector of ones.

The decision function of classiﬁcation is given by

fðxÞ ¼signð

a

T_x_þ_b_Þ_: _ð₂_Þ

By now, many different forms of QP-SVM(1)have been introduced for different purposes (see[4]). In this work, we mainly pay our attention to the learning error or the convergence rate of the proposed algorithm. For the convergence rate of QP-SVM(1), there are many works. We refer the readers to Steinart[5], Zhang[6], Wu and Zhou[7], Zhao and Yin[8], Hong[9], and Wu et al.[10].

Among of forms of SVM, the LP-SVM is important because of its linearity and ﬂexibility for large data setting. Many authors have introduced the LP-SVM. We refer the readers to Bradley and Mangasarian[11], Kecman and Hadaic[12], Niyogi and Girosi[13], Pedroso and Murata[14]and Vapnik[4]. Its primal optimization model is as follows

min ðk;yÞ2R2m 1 me T kþ1 Ce T_y_; s:t: DðAATkþbeÞPey; kP0; yP0: ð3Þ

The trade-off factorC=C(m) > 0 depends onmand is crucial.

Many experiments demonstrate that LP-SVM is efﬁcient and perform even better than QP-SVM for some purposes: capa-ble of solving huge sample size procapa-blems (see,[11]), improving the computational speed (see,[14]), and reducing the num-ber of support vectors (see,[12]).

However, little is known for the learning error or the convergence of the LP-SVM. We only ﬁnd that a classiﬁcation prob-lem of LP-SVM was studied in Wu and Zhou[15].

The primal goal of this paper is to address in the investigation of regression problem and to provide the error analysis for linear programming support vector regression (LP-SVR).

2. LP-SVR and QP-SVR in the reproducing kernel Hilbert space

For regression problem, SVR can be seen as a SVLA with continuous output or support vector machines for function esti-mation. In 2004, Smola and Schölkopf[16]have given an overview of the basic ideas underlying SVM for function estimation. They indicated that the SVR was more difﬁcult to analysis the learning error and convergence. This is the motivation of this work.

Let (X,d) be a compact metric space and letXRn; Y¼R. In this work, we only discuss the single output regression problem. For multi-output regression problem, there is some difﬁcult coming from the multivariate output space. It seems that it is a very challenge work to study learning errors for multi-output LP-SVR in reproducing Hilbert spaces. In 2009, Liu et al.[17]studied the output space as a Riemannian submanifold to incorporate its geometric structure into the regression process. They proposed a locally linear transformation (LLT), to deﬁne the loss functions on the output manifold. An algo-rithm was given under the framework of SVR.

Let

q

be a probability distribution onS=XY. The error (or generalization error) for a functionf:X?Yis deﬁned as

EðfÞ ¼

Z S

ðyfðxÞÞ2d

q

:

The function that minimizes the error is called the regression function. It is given by

fqðxÞ ¼

Z Y

y d

q

ðyjxÞ;

8

x2X;

where

q

(jx) is the conditional probability measure atxinduced by

q

.

The target of regression problem is to learn the regression function or to find good approximation from random samples. The least-square algorithm for the regression problem is a discrete least-square problem associated with a Mercer kernel. Let K:XX!R be continuous, symmetric and positive semidefinite, i.e., for any finite set of distinct points

(3)

{x1,x2,. . .,xm}X, the matrixðKðxi;xjÞÞmi;j¼1is positive semideﬁnite (at the rest part of this paper, we denote this matrix asK).

Such a function is called a Mercer kernel.

The reproducing kernel Hilbert space (RKHS)HKassociated withKis deﬁned (see,[18]) to be the closure of the linear span of the set of functions {Kx=K(x,) :x2X} with the inner producth;iHK¼ h;iK, i.e.,

HK¼spanfKx:x2Xg:

The inner product satisﬁeshKx,KyiK=K(x,y). That is, X i

a

iKxi; X j bjKyj * + K ¼X i;j

a

ibjKðxi;yjÞ: The reproducing property takes the from

hKx;fiK¼fðxÞ;

8

x2X; f2 HK:

DenoteC(X) as the space of continuous functions onXwith the normk k1. Let

j

¼supx2X ffiffiffiffiffiffiffiffiffiffiffiffiffiffi Kðx;xÞ p

. Then the above repro-ducing property tell us that

kfk16

j

kfkK;

8

f2 HK: ð4Þ

Throughout this paper, we assume that for someMP0,

q

(jx) is almost everywhere supported on [M,M], that is,jyj6_M almost surely (with respect to

q

). It follows from the deﬁnition of the regression functionfqthatjfq(x)j6M.

Letz= {(x1,y1), (x2,y2),. . ., (xm,ym)}(XY)mbe the sample. In

e

-SVR (see,[4,16,19]), the primal goal is to ﬁnd a function

that has at most

e

deviation from the actually obtained targetsyifor all the training data, and at the same time the function is as ﬂat as possible. In other words, we do not care about errors as long as they are less than

e

, but will not accept any deviation larger than this. This may be important if you want to be sure not to lose more than

e

money when dealing with exchange rates, for instance (see,[16], p. 200). The QP-SVR (see,[4,16,19]) in the reproducing Hilbert space can be formulated as following min ða;a;bÞ2R2mþ1 1 2ðk

a

k 2 þ k

a

_k2 Þ; s:t:

e

e6_Kð

a

_{Þ þ}_b_e_y6

e

_e: ð5Þ

With the slack variablesn;n2Rm, we arrive at the formulation presented in[4,16,19]

min ða;a;bÞ2R2mþ1 1 2ðk

a

k 2 þ k

a

_k2 Þ þ1 Cðknk 2 þ knk2Þ; s:t: Kð

a

_{Þ þ}_b_e_y₆

_e

_e_þ_n_; y Kð

a

_Þ_b_e₆

_e

_e_þ_n_; n;nP0: ð6Þ

Motivated by reducing the number vectors of the 1-norm soft margin SVM, Smola and Schölkopf (see,[4,19]), Vapnik (see,

[16]) introduced the LP-SVM algorithm associated to a Mercer kernelK. In fact, the LP-SVR is based on the following linear programming optimization problem

min ða;a_;b_Þ2_R2mþ1 1 mðe T

_a

_þ_eT

_a

_{Þ þ}1 Cðe T nþeT nÞ; s:t: Kð

a

_{Þ þ}_b_e_y₆

_e

_e_þ_n_; y Kð

a

_Þ_b_e₆

_e

_e_þ_n_;

a

;

a

_;_n_;_n P0: ð7Þ

Remark 1. There are many methods or algorithms to solve QP-SVR(6)(see,[20–23]), but little of LP-SVR(7). Unlike with QP-SVR, the LP-SVR has the following characteristics

(i) The 1-norm is less sensitive to outliers such as those occurring when the underlying data distributions have pro-nounced tails, hence LP-SVR(7)has a similar effect to that of robust regression (see,[24], pp. 82–87);

(ii) The transformation into its dual does not give any improvement in the structure of the optimization problem. Hence it is best to minimize empirical risk directly, which can be achieved by a linear optimizer. (see,[16], p. 210).

(4)

3. Learning errors estimate

If (

a

,

a

*,b) solves the optimization problem(7), then the decision-making function of LP-SVR is

fz¼ ð

a

ÞTKðA;xÞ þb: Set the empirical error as

EzðfÞ ¼ 1 m Xm i¼1 ðyifðxiÞÞ2:

Then the LP-SVR scheme(7)can be written as

fz¼arg min

f¼f_þb2HKþR

fEzðfÞ þk

X

ðfÞg; ð8Þ

here we have denotedXðf_{Þ ¼ k}

_a

_k l1¼

Pm

i¼1

a

iforf¼Pmi¼1

a

iKxiwith

a

iP0.

We focus on the error between the functionfzandfq, i.e.,

EðfzÞ EðfqÞ: ð9Þ

Our main goal is to estimate the error(9)for the least-square regression algorithm(8)by means of properties of

q

andK. From[15], we see that the punishment item is not a Hilbert space norm, which raises the technical difﬁculty for the math-ematical analysis. Since the solutionfzof the LP-SVM has a reputation similar to the QP-SVM’s, we can estimate the error for the former using the stepping stone:

X

f

z 6CEzðfzÞ þ kfzk 2

:

In the last section, we have gotten the solutionfz. Next we will analyze the excess generalization errorEðfzÞ EðfqÞ. Denote theregularization erroras

e DðCÞ ¼ inf f¼f_þ_b_2H KþR EðfÞ EðfqÞ þ 1 C

X

ðf _Þ ; CP0: LetefK;C¼arg minf2HKDðCÞe .

To estimateEðfzÞ EðfqÞ, we introduce theregularizing function fK;C2 HK, which depends onKandC, and is deﬁned by

fK;C¼arg min f¼f_þb2HKþR EðfÞ þ1 C

X

ðf _Þ :

The regularization error for a regularization functionfK,Cis deﬁned as

DðCÞ ¼ EðfK;CÞ EðfqÞ þ

1

C

X

ðf

K;CÞ:

The following results is an estimation for errorEðfzÞ EðfqÞ.

Theorem 2.1. Assume C>0 and fK;C2 HK. Then there holds

EðfzÞ EðfqÞ6S_ðm;CÞ þ DðCÞ; where S(m, C) is the sample error deﬁned by

Sðm;CÞ ¼ EðfzÞ EzðfzÞ þ EzðfK;CÞ EðfK;CÞ: Proof. We see from the deﬁnition offqthat

EðfzÞ EðfqÞ6EðfzÞ EðfqÞ þ

1 C

X

f z and EðfzÞ EðfqÞ þ1 C

X

f z ¼½EðfzÞ EzðfzÞ þ E½ zðfK;CÞ EðfK;CÞþ EzðfzÞ þ 1 C

X

f z EzðfK;CÞ þ 1 C

X

ðf K;CÞ þ EðfK;CÞ EðfqÞ þ 1 C

X

ðfK;CÞ : From the deﬁnition offzit follows that

EzðfzÞ þ 1 C

X

f z EzðfK;CÞ þ 1 C

X

f K;C 6₀:

(5)

This enables us to get

EðfzÞ EðfqÞ þ1 C

X

f

z 6 ðEðfzÞ EzðfzÞÞ þ ðEzðfK;CÞ EðfK;CÞÞ

þ EðfK;CÞ EðfqÞ þ1 C

X

f K;C : This ﬁnishes the proof ofTheorem 2.1. h

We now give some definitions.Definitions 2.1 and 2.2can be found in[25–28], andDefinitions 2.3 and 2.4can be found in

[15,28].

Deﬁnition 2.1.For a subsetFof a metric space and

g

> 0, the covering numberN ðF;

g

Þis deﬁned to be the minimal integer l2Nsuch that there existldisks with radius

g

coveringF.

LetBR¼ ff2 HK:kfkK6Rg. It is a subset ofC(X). We denote the covering number of unit ballB1as

N ð

e

Þ ¼ N ðB1;

e

Þ;

e

>0:

Deﬁnition 2.2. We say that the RKHS associated with the Mercer kernelKhas polynomial complexity exponents> 0 if logN ð

e

Þ6_C₀_ð₁=

e

Þs;

8 e

>0: ð10Þ

Deﬁnition 2.3. We say that the probability measure

q

can be approximated byHKwith exponent 0 <b61 if there exists a constantcbsuch that

e

DðCÞ6_c_b_Cb;

8

C>0: ð11Þ

Deﬁnition 2.4. The projection operator

p

is deﬁned on the space of measurable functionsf:X!Ras

p

ðfÞðxÞ ¼ 1; if fðxÞ>M; 1; if fðxÞ<M; fðxÞ; if M<fðxÞ<M: 8 > < > :

The following theorem gives the bound for deterministic distributions, i.e.,EðfqÞ ¼0.

In order to prove the result, we need the following ratio probability inequalities and an estimate of coving number. These results are standard in the learning theory, and can be found in[7,15,27–29], etc.

Bernstein inequality.Letnbe a random variable onZsatisfyingE(n) =

l

,

r

2₍_n_{) =}

_r

2_{. If}_j_n

_l

_j₆_M_{, then for every}

_e

_{> 0 there}

holds Prob

l

1 m Xm i¼1 nðziÞ >

e

( ) 6_{2 exp} m

e

2 2ð

r

2_þ1 3M

e

Þ ( ) :

Lemma 2.1. Letnbe a random variable on Z satisfying E(n) =

l

P0,jn

l

j6_{M almost everywhere, and}

r

26_c

l

s_{, 0}6

s

6_2.

Then for any

e

>0 there holds

Prob

l

1 m Pm i1nðziÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

l

s_þ

_e

s p >

e

12s 6_exp m

e

2s 2ðcþ1 3M

e

1sÞ ( ) :

Lemma 2.2.Let 06

s

6_{1, M}_>_0,c_>_{0, and}_G _{be a set of functions on Z such that for every g}_{2 G}_;_Eg_P_0;_jg_Egj6_{M, and} E(g2₎₆_c(Eg)s_{. Then for}

_e

_>_0,

Prob sup g2G Eg1 m Pm i¼1gðziÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðEgÞsþ ð

e

Þs p >4

e

1s 2 ( ) 6_{N ðG};

e

Þexp m

e

2s 2 cþ1 3M

e

1s ( ) ; where Eg¼R_ZgðzÞd

q

.

Denote the functions setGRas

GR¼ fðfðxÞ yÞ2 ðfqyÞ2:f2 BR; R>0g: ð12Þ Lemma 2.3.AssumeGRbe deﬁned by(12). If(10)holds, then there exists a constant c0s>0, such that

logN ðGR;

e

Þ6c0s

R

e

s :

(6)

FromLemmas 2.2 and 2.3,we have the following Corollary.

Corollary. LetGRbe deﬁned by(12) and (10)hold. Then for every 0<d<1, with conﬁdence at least 1d, there holds

fEðfÞ EðfqÞg fEzðfÞ EzðfqÞg6₄pffiffiffiffiffiffiffiffi

e

_m;R

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

EðfÞ EðfqÞÞ þ

e

m;R q

for all f2 BR, where

e

m,Ris given by

e

m;R¼8 cRþ 1 3 _c0 sR s m 1 1þs þlogð1=dÞ m ! ; cR¼ ð

j

Rþ3mÞ2:

Proof. Consider the setGR. Each functiong2 GRhas the formg(z) = (f(x)y)2(fqy)2withf2 BR. Hence

Eg¼ EðfÞ EðfqÞP0; 1 m Xm i¼1 gðziÞ ¼ EzðfÞ EzðfqÞ and gðzÞ ¼ ðfðxÞ fqðxÞÞððfðxÞ yÞ þ ðfqðxÞ yÞÞ:

Sincekfk16

j

kfkK6

j

R, andjfq(x)j6_M_{almost everywhere, we ﬁnd that}

jgðzÞj6_ð

j

_R_þ_M_Þð

j

_R_þ₃_M_Þ6_c_R_{¼ ð}

j

_R_þ₃_M_Þ2: So we havejg(z)Egj6_B_{= 2}_c_R_{almost everywhere. Also,}

Eg2¼EððfðxÞ fqðxÞÞ2ððfðxÞ yÞ þ ðfqðxÞ yÞÞ2Þ6ð

j

Rþ3MÞ2ðEðfÞ EðfqÞÞ: ThusEg2₆_c

REgfor eachg2 GR.

ApplyingLemma 2.2with

s

= 1, we deduce that sup f2BR Eg1 m Pm i¼1gðziÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffi Egþ

e

p 6₄pffiffiffi

e

; with conﬁdence at least

1 N ðGR;

e

Þexp

m

e

2cRþ1₃B

( )

¼1d: So we can see that

fEðfÞ EðfqÞg fEzðfÞ EzðfqÞg64pffiffiffiffiffiffiffiffi

e

m;R

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ðEðfÞ EðfqÞÞ þ

e

m;R q

: We see from the Corollary 5.1 of[15](withp=1,

s

= 1) that

e

m;R¼8 cRþ 1 3 _c0 sR s m 1 1þs þlogð1=dÞ m ! : This enable us to end the proof of Corollary. h

Theorem 2.2. SupposeEðfqÞ ¼0. If fK,Cis a function inHKsatisfyingk(yfK,C)2k16M, then for every 0<d<1, with conﬁdence

1dthere holds EðfzÞ617

e

m;Cþ 20Mlogð2=dÞ 3m þ8DðCÞ: where

e

m,Cis given by

e

m;C¼8

j

2C 5Mlogð2=dÞ 3m þ2DðCÞ þ3M 2 þ1 3 ! cs

j

s_Cs 5Mlogð2=dÞ 3m þ2DðCÞ s m 0 B @ 1 C A 1 1þs þlogð1=dÞ m 0 B B @ 1 C C A:

Proof. Since EðfqÞ ¼0;ðyfqðxÞÞ2¼0 almost everywhere. We ﬁrst consider the random variable n= (yfK,C)2. Since

(7)

r

2_ð

nÞ6_E_n26_ME_n6_M_Dð_C_Þ:

Applying Bernstein inequality ton, we see by solving the equation me2

2ðr2_þM_e₌₃_Þ¼logðd=2Þthat with confidence 1d/2, EzðfK;CÞ EðfK;CÞ6 2Mlog 2 d 3m þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2

r

2_ð_n_Þ_log_ð₂₌_d_Þ m r 65Mlogð2=dÞ 3m þ DðCÞ:

Next we estimate theEðfzÞ EzðfzÞ. By the deﬁnition offz, there holds 1 C

X

f z 6EzðfzÞ þ 1 C

X

f z 6EzðfK;CÞ þ 1 C

X

f K;C

andEðfqÞ ¼0;DðCÞ ¼ EðfK;CÞ þ1CX fK;C

. It follows that 1 C

X

f z 6EzðfK;CÞ EðfK;CÞ þ DðCÞ:

Recallingfz2 HK, we see from the reproducing property of kernel that

kf zkK¼ Xm i;j¼1

a

i

a

jKðxi;xjÞ !1=2 6

j

X m i¼1

a

i

a

j !1=2 ¼

j

X

f z 6R¼

j

C 5Mlogð2=dÞ 3m þ2DðCÞ :

Corollary andRgiven as above imply that

EðfzÞ EzðfzÞ64 ffiffiffiffiffiffiffiffiffi

e

m;C p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi EðfzÞ þ

e

m;C q

with conﬁdence 1dwhere

e

m,Cis deﬁned in the statement. Putting the above two estimates intoTheorem 2.1, there holds

EðfzÞ64 ffiffiffiffiffiffiffiffiffi

e

m;C p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi EðfzÞ þ

e

m;C q þ10Mlogð2=dÞ 3m þ4DðCÞ:

Solving the quadratic inequality for ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiEðfzÞ þ

e

m;C p

leads to

EðfzÞ617

e

m;Cþ

20Mlogð2=dÞ

3m þ8DðCÞ:

The proof ofTheorem 2.2is completed. h

The next result is on general distribution satisfying Tsybakov condition (see,[30]).

Theorem 2.3.Assume the hypotheses(10) and(11)with 0<s<1and 0<b6_{1. Take t}_>_{1. For every}

e

_>_{0 and every 0}_<_d_<₁

there exists two constants csdepending on s and c0sdepending on s andbsuch that with conﬁdence 1d,

EðfzÞ EðfqÞ62cst Rs m 1 1þs þ2tc0 s Cð2bÞ m þ 1 mþ Cð1bÞ m þC b ! :

Proof.DenoteDz¼ EðfzÞ EðfqÞ þ1_CX fz . Then we haveX fz 6CDz.Theorem 2.1tells us that

D

z6Sðm;CÞ þ DðCÞ:

TakefK;C¼efK;C. By the assumption(11),

D

z6Sðm;CÞ þcbC

b_:

Recall the expression forS(m,C), we have

Sðm;CÞ ¼ fðEðfzÞ EðfqÞÞ ðEzðfzÞ EzðfqÞÞg þ fðEzðfK;CÞ EzðfqÞÞ ðEðfK;CÞ EðfqÞÞg ¼S1þS2: TaketP1,CP1. ForS1, we apply Corollary withd=et61/e, and know that there is a setVðR1ÞZ

m of measure at most d=et_{such that} S16cst Rs m 1 1þs þ R s m 1 2ð1þsÞ

D

12 z ( ) ; wherecs¼32 ð

j

Rþ3MÞ2þ1₃ c0 sþ1

is a constant depending only ons.

To estimateS2, considern= (fK,Cy)2 (fqy)2on (Z,

q

). From(4), it follows that kfK;Ck16

j

kfK;CkK6

j

2

_X

ðfK;CÞ6

j

2CDðCÞ: Writen=n1+n2where

(8)

n1¼ ðfK;CyÞ 2 ð

p

ðfK;CÞ yÞ 2 ;n2¼ ð

p

ðfK;CÞ yÞ 2 ðfqyÞ2:

It is easy to check that 06n16ð

j

2CDðCÞ þ3MÞ2¼BC. Here

r

2(n1) is bounded by ð

j

2CDðCÞ þ3MÞ2Eðn1Þ. Then Bernstein

inequality withd=et_{tells us that there is a set}_Vð2Þ

R Z

m_{of measure at most}_d₌_et_{such that for every}_z₂_Zm_n_Vð2Þ R , there holds 1 m Xm i¼1 n1ðziÞ En16exp m

e

2 2ð

r

2_ð_n 1Þ þ13BC

e

Þ ( ) : Solving the quadratic equation for

e

m

e

2 2ð

r

2_ð_n 1Þ þ13BC

e

Þ ¼t we have 1 m Xm i¼1 n1ðziÞ En16 1 3BCtþ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 3BCt 2 þ2m

r

2_ð_n 1Þt q m 6 2BCt 3m þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2t m

r

2_ð_n 1Þ r : But the fact 06n16BCimplies

r

2(n1)6BCE(n1). Therefore, we have

1 m Xm i¼1 n1ðziÞ En16 7BCt 6m þEðn1Þ;

8

z2Z m nVðR2Þ:

Next we considern2. Since both

p

(fK,C) andfqare on [M,M],n2is a random variable satisfyingjn2j6B. Applying Bernstein

inequality as above, we know that there exists another subsetVð3Þ

R Z

m_{with measure at most}_d₌_et_{such that for every} z2Zm_n_Vð3Þ R , there holds 1 m Xm i¼1 n2ðziÞ En26 2Bt 3mþ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2t

r

2_ð_n 2Þ m r : By the fact

r

2 (n2)6BE(n2), we have 1 m Xm i¼1 n2ðziÞ En26 7Bt 6mþEðn2Þ;

8

z2Z m nVð3Þ R :

Combine the above two estimates forn1andn2with the factEn1þEn2¼En6DðCÞ6cbCb. We conclude that S26 7BCtþ7Bt 6m þ DðCÞ6c 0 st Cð2bÞ m þ 1 mþ Cð1bÞ m þC b ! ;

8

z2Zm nVð2Þ R nV ð3Þ R ; wherec00 s¼ 7j4_c2 bþ42Mj2cbþ63M2þ7B

6 þcbis a constant depending only ons.

Putting the above two estimates forS1andS2, we ﬁnd that for everyz2ZmnVð

1Þ R nV ð2Þ R nV ð3Þ R there holds

D

z62cst Rs m 1 1þs þ2tc00 s Cð2bÞ m þ 1 mþ Cð1bÞ m þC b ! :

Here we have used another elementary inequality: ifa,b> 0 and 0 <

a

< 1, then

x6_axa_þ_b; x>0)x6_max_fð₂_a_Þ1=ð1aÞ;2bg: The proof ofTheorem 2.3is complete. h

4. Conclusions

In reproducing Hilbert spaces, with the covering number, a new upper bound for estimating learning errors of linear pro-gramming support vector regression has been presented in this paper. This errors bound formulation has been shown that

Bðm;CÞ ¼2cst Rs m 1 1þs þ2tc0 s Cð2bÞ m þ 1 mþ Cð1bÞ m þC b ! :

Heremis the number of sample points,Cis the trade-off who control the regulation item in the model of LP-SVR(7),sis the polynomial complexity exponent of the given reproducing Hilbert spaces,csandc0sare two constants who depend onsand can be estimated byDeﬁnition 2.3andLemma 2.3,t> 1 is a given constant and 0 <b6_{1 is also a given constant.}

(9)

lim

m!þ1Bðm;cÞ ¼ 2tc0

s

Cb :

It means that the gap can not vanish no matter how to select the sample data points.

Due to the difﬁculty of calculating the covering number, large body of work can not be done in the ﬁeld of experiments. The authors think that it will be a very challenge future work.

Acknowledgements

The authors thank referees for their careful reading of the manuscript which improved the technician qualities and pre-sentation. Their very detailed comments and extremely important suggestions made this paper readable and more understandable.

References

[1] V. Vapnik, A. Lerner, Pattern recognition using generalized portrait method, Automat. Rem. Contr. 24 (1963) 774–780.

[2] M.A. Aizerman, E.M. Braverman, L.I. Rozonoer, Theoretical foundations of the potential function method in pattern recognition learning, Automat. Rem. Contr. 25 (1964) 821–837.

[3] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (1995) 273–297. [4] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.

[5] I. Steinwart, Supported vector machines are universally consistent, J. Complexity 18 (2002) 768–791.

[6] T. Zhang, Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization, Ann. Statist. 32 (2004) 56–85. [7] Q. Wu, D.X. Zhou, Analysis of support vector machine classiﬁcation, J. Comput. Anal. Appl. 8 (2006) 108–134.

[8] H.B. Zhao, S.D. Yin, Geomechanical parameters identiﬁcation by particle swarm optimization and support vector machine, Appl. Math. Model. 33 (2009) 3997–4012.

[9] W.C. Hong, Electric load forecasting by support vector model, Appl. Math. Model. 33 (2009) 2444–2454. [10] Q. Wu, Y.M. Ying, D.X. Zhou, Multi-kernel regularized classiﬁers, J. Complexity 23 (2007) 108–134.

[11] P.S. Bradley, O.L. Mangasarian, Massive data discrimination via linear supported vector machiness, Optim. Methods Softw. 13 (2000) 1–10. [12] V. Kecman, I. Hadaic, Supported vector selection by linear programming, Proc. of IJCNN 5 (2000) 193–198.

[13] P. Niyogi, F. Girosi, On the relationship between genneralization error, hypothesis complexity, and sample complexity for radial basis functions, Neural Comput. 8 (1996) 819–842.

[14] J.P. Pedroso, N. Murata, Supported vector machines with different norms: motivation, formulations and results, Pattern Recogn. Lett. 22 (2001) 1263– 1272.

[15] Q. Wu, D.X. Zhou, SVM soft margin classﬁers: linear programming versus quadratic programming, Neural Comput. 17 (2005) 1160–1187. [16] A.J. Smola, B. Schölkopf, A tutorial on support vector regression, Stat. Comput. 14 (2004) 199–222.

[17] G. Liu, Z. Lin, Y. Yu, Multi-output regression on the output manifold, Pattern Recog. 42 (2009) 2737–2743. [18] N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Soc. 68 (1950) 337–404.

[19] V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, New York, 1995.

[20] J.C. Platt, Fast training of support vector machines using sequential minimal optimization, in: Advances in Kernel Methods-Support Vector Learning, MIT Press, 1999, pp. 185–208.

[21] O.L. Mangasarian, D.R. Musicant, Successive overrelaxation for support vector machines, IEEE Tran. Neural Networks 10 (1999) 1032–1037. [22] Y.J. Lee, O.L. Mangarasian, SSVM: a smooth support vector machine for classiﬁcation, Comput. Optim. Appl. 22 (2001) 5–21.

[23] Y. Yuan, J. Yan, C. Xu, Polynomial smooth support vector machine, Chinese J. Comput. 28 (2005) 9–17. in Chinese. [24] M.H. Hassoun, Fundamentals of Artiﬁcial Neural Networks, MIT Press, Cambridge, MA, 1995.

[25] D.X. Zhou, The covering number in learning theory, J. Complexity 18 (2002) 739–767.

[26] Y. Guo, P.L. Bartlett, J. Shawe-Taylor, R.C. Williamson, Covering numbers for support vector machines, IEEE Trans. Inf. Theory 48 (2002) 239–250. [27] F. Cucker, S. Smale, On the mathematical foundations of learning theory, Bull. Amer. Math. Soc. 39 (2001) 1–49.

[28] Q. Wu, D.X. Zhou, Learning rates of least-square regularized regression, Found. Comput. Math. 6 (2006) 171–192. [29] F. Cucker, D.X. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge University Press, Cambridge, 2007. [30] A.B. Tsybako, Optimal aggregation of classiﬁers in statistical learning, Ann. Statis. 32 (2004) 135–166.