• No results found

Learning errors of linear programming support vector regression

N/A
N/A
Protected

Academic year: 2021

Share "Learning errors of linear programming support vector regression"

Copied!
9
0
0

Loading.... (view fulltext now)

Full text

(1)

Learning errors of linear programming support vector regression

q

Feilong Cao

, Yubo Yuan

Institute of Metrology and Computational Science, China Jiliang University, Hangzhou 310018, Zhejiang Province, China

a r t i c l e

i n f o

Article history: Received 1 June 2009

Received in revised form 18 September 2010

Accepted 4 October 2010 Available online 13 October 2010

Keywords: Regression

Support vector machine Linear programming Generalization error Empirical error

a b s t r a c t

In this paper, we give several results of learning errors for linear programming support vec-tor regression. The corresponding theorems are proved in the reproducing kernel Hilbert space. With the covering number, the approximation property and the capacity of the reproducing kernel Hilbert space are measured. The obtained result (Theorem 2.1) shows that the learning error can be controlled by the sample error and regularization error. The mentioned sample error is summarized by the errors of learning regression function and regularizing function in the reproducing kernel Hilbert space. After estimating the gen-eralization error of learning regression function (Theorem 2.2), the upper bound (Theorem 2.3) of the regularized learning algorithm associated with linear programming support vec-tor regression is estimated.

Crown CopyrightÓ2010 Published by Elsevier Inc. All rights reserved.

1. Introduction

The main aim of this paper is the error analysis for the linear programming support vector regression (SVR) problem in leaning theory. To this end, this paper is organized as follows. We start by giving a brief introduction of the basic techniques of support vector machine (SVM) or SVR in Section1, including the history, motivation, most important publications and some corresponding results with this paper. Moreover, a brief review of linear programming support vector regression (LP-SVR) and quadratic programming one (QP-SVR) in the reproducing kernel Hilbert space is presented in Section2. Learn-ing error analysis of LP-SVR is given in Section3. In Section4, we plus a short conclusion and summary with our research work, also, some future works are attached.

First of all, we briefly describe the historic background of support vector learning algorithm (SVLA). The SVLA is a non-linear generalization of the generalized portrait algorithm developed in the sixties (see,[1,2]).

As such, the SVLA is firmly grounded in the framework of statistical learning theory. In 1995, Cortes and Vapnik (see,[3]) proposed a learning network named support-vector network. Originally, it is a learning machine for two-group classification problems. One of the most important idea is that input vectors had been non-linearly mapped to a very high-dimension fea-ture space. In this feafea-ture space a linear decision surface was constructed. They showed that the training data (separable and non-separable) could be separated without errors. After that, this learning network became more and more popular and re-named as SVM.

Today, SVM has become an important subject in learning theory and has evolved into an active area of research. Mathematically, SVM is from a pattern classification problem based on a given classification ofmpoints (x1,x2,. . .,xm) in

0307-904X/$ - see front matter Crown CopyrightÓ2010 Published by Elsevier Inc. All rights reserved. doi:10.1016/j.apm.2010.10.012

q

The research was supported by the National Natural Science Foundation of China (Nos. 60873206, 61001200) and the Natural Science Foundation of Zhejiang Province of China (No. Y7080235).

⇑ Corresponding author. Tel.: +86 571 86835737. E-mail address:feilongcao@gmail.com(F. Cao).

Contents lists available atScienceDirect

Applied Mathematical Modelling

(2)

then-dimensional spaceRn, represented by anmnmatrixA= (x

1,x2,. . .,xm)T, given the membership of each data pointxi, i= 1, 2,. . .,min the classes 1 or1 as specified by a givenmmdiagonal matrixDwith 1 or1 diagonals.

Primarily, SVM (see,[4]) is given by the following quadratical programming with linear inequalities constraints min ða;bÞ2Rnþ1 1 2k

a

k 2 þ1 2kyk 2 ; s:t: DðA

a

þbeÞPeþy; yP0: ð1Þ

Model(1)can be seen as the original model of QP-SVM. Here,

a

is a vector of separator coefficients (direction vector of clas-sification hyperplane),bis an offset (the control parameter of the distance of hyperplane plane to the origin) ande2Rm

stands for a vector of ones.

The decision function of classification is given by

fðxÞ ¼signð

a

TxþbÞ: ð2Þ

By now, many different forms of QP-SVM(1)have been introduced for different purposes (see[4]). In this work, we mainly pay our attention to the learning error or the convergence rate of the proposed algorithm. For the convergence rate of QP-SVM(1), there are many works. We refer the readers to Steinart[5], Zhang[6], Wu and Zhou[7], Zhao and Yin[8], Hong[9], and Wu et al.[10].

Among of forms of SVM, the LP-SVM is important because of its linearity and flexibility for large data setting. Many authors have introduced the LP-SVM. We refer the readers to Bradley and Mangasarian[11], Kecman and Hadaic[12], Niyogi and Girosi[13], Pedroso and Murata[14]and Vapnik[4]. Its primal optimization model is as follows

min ðk;yÞ2R2m 1 me T kþ1 Ce Ty; s:t: DðAATkþbeÞPey; kP0; yP0: ð3Þ

The trade-off factorC=C(m) > 0 depends onmand is crucial.

Many experiments demonstrate that LP-SVM is efficient and perform even better than QP-SVM for some purposes: capa-ble of solving huge sample size procapa-blems (see,[11]), improving the computational speed (see,[14]), and reducing the num-ber of support vectors (see,[12]).

However, little is known for the learning error or the convergence of the LP-SVM. We only find that a classification prob-lem of LP-SVM was studied in Wu and Zhou[15].

The primal goal of this paper is to address in the investigation of regression problem and to provide the error analysis for linear programming support vector regression (LP-SVR).

2. LP-SVR and QP-SVR in the reproducing kernel Hilbert space

For regression problem, SVR can be seen as a SVLA with continuous output or support vector machines for function esti-mation. In 2004, Smola and Schölkopf[16]have given an overview of the basic ideas underlying SVM for function estimation. They indicated that the SVR was more difficult to analysis the learning error and convergence. This is the motivation of this work.

Let (X,d) be a compact metric space and letXRn; Y¼R. In this work, we only discuss the single output regression problem. For multi-output regression problem, there is some difficult coming from the multivariate output space. It seems that it is a very challenge work to study learning errors for multi-output LP-SVR in reproducing Hilbert spaces. In 2009, Liu et al.[17]studied the output space as a Riemannian submanifold to incorporate its geometric structure into the regression process. They proposed a locally linear transformation (LLT), to define the loss functions on the output manifold. An algo-rithm was given under the framework of SVR.

Let

q

be a probability distribution onS=XY. The error (or generalization error) for a functionf:X?Yis defined as

EðfÞ ¼

Z S

ðyfðxÞÞ2d

q

:

The function that minimizes the error is called the regression function. It is given by

fqðxÞ ¼

Z Y

y d

q

ðyjxÞ;

8

x2X;

where

q

(jx) is the conditional probability measure atxinduced by

q

.

The target of regression problem is to learn the regression function or to find good approximation from random samples. The least-square algorithm for the regression problem is a discrete least-square problem associated with a Mercer kernel. Let K:XX!R be continuous, symmetric and positive semidefinite, i.e., for any finite set of distinct points

(3)

{x1,x2,. . .,xm}X, the matrixðKðxi;xjÞÞmi;j¼1is positive semidefinite (at the rest part of this paper, we denote this matrix asK).

Such a function is called a Mercer kernel.

The reproducing kernel Hilbert space (RKHS)HKassociated withKis defined (see,[18]) to be the closure of the linear span of the set of functions {Kx=K(x,) :x2X} with the inner producth;iHK¼ h;iK, i.e.,

HK¼spanfKx:x2Xg:

The inner product satisfieshKx,KyiK=K(x,y). That is, X i

a

iKxi; X j bjKyj * + K ¼X i;j

a

ibjKðxi;yjÞ: The reproducing property takes the from

hKx;fiK¼fðxÞ;

8

x2X; f2 HK:

DenoteC(X) as the space of continuous functions onXwith the normk k1. Let

j

¼supx2X ffiffiffiffiffiffiffiffiffiffiffiffiffiffi Kðx;xÞ p

. Then the above repro-ducing property tell us that

kfk16

j

kfkK;

8

f2 HK: ð4Þ

Throughout this paper, we assume that for someMP0,

q

(jx) is almost everywhere supported on [M,M], that is,jyj6M almost surely (with respect to

q

). It follows from the definition of the regression functionfqthatjfq(x)j6M.

Letz= {(x1,y1), (x2,y2),. . ., (xm,ym)}(XY)mbe the sample. In

e

-SVR (see,[4,16,19]), the primal goal is to find a function

that has at most

e

deviation from the actually obtained targetsyifor all the training data, and at the same time the function is as flat as possible. In other words, we do not care about errors as long as they are less than

e

, but will not accept any deviation larger than this. This may be important if you want to be sure not to lose more than

e

money when dealing with exchange rates, for instance (see,[16], p. 200). The QP-SVR (see,[4,16,19]) in the reproducing Hilbert space can be formulated as following min ða;a;bÞ2R2mþ1 1 2ðk

a

k 2 þ k

a

k2 Þ; s:t:

e

e6

a

a

Þ þbey6

e

e: ð5Þ

With the slack variablesn;n2Rm, we arrive at the formulation presented in[4,16,19]

min ða;a;bÞ2R2mþ1 1 2ðk

a

k 2 þ k

a

k2 Þ þ1 Cðknk 2 þ knk2Þ; s:t: Kð

a

a

Þ þbey6

e

eþn; y Kð

a

a

Þ be6

e

eþn; n;nP0: ð6Þ

Motivated by reducing the number vectors of the 1-norm soft margin SVM, Smola and Schölkopf (see,[4,19]), Vapnik (see,

[16]) introduced the LP-SVM algorithm associated to a Mercer kernelK. In fact, the LP-SVR is based on the following linear programming optimization problem

min ða;a;bÞ2R2mþ1 1 mðe T

a

þeT

a

Þ þ1 Cðe T nþeT nÞ; s:t: Kð

a

a

Þ þbey6

e

eþn; y Kð

a

a

Þ be6

e

eþn;

a

;

a

;n;n P0: ð7Þ

Remark 1. There are many methods or algorithms to solve QP-SVR(6)(see,[20–23]), but little of LP-SVR(7). Unlike with QP-SVR, the LP-SVR has the following characteristics

(i) The 1-norm is less sensitive to outliers such as those occurring when the underlying data distributions have pro-nounced tails, hence LP-SVR(7)has a similar effect to that of robust regression (see,[24], pp. 82–87);

(ii) The transformation into its dual does not give any improvement in the structure of the optimization problem. Hence it is best to minimize empirical risk directly, which can be achieved by a linear optimizer. (see,[16], p. 210).

(4)

3. Learning errors estimate

If (

a

,

a

*,b) solves the optimization problem(7), then the decision-making function of LP-SVR is

fz¼ ð

a

a

ÞTKðA;xÞ þb: Set the empirical error as

EzðfÞ ¼ 1 m Xm i¼1 ðyifðxiÞÞ2:

Then the LP-SVR scheme(7)can be written as

fz¼arg min

f¼fþb2HKþR

fEzðfÞ þk

X

ðfÞg; ð8Þ

here we have denotedXðfÞ ¼ k

a

k l1¼

Pm

i¼1

a

iforf¼Pmi¼1

a

iKxiwith

a

iP0.

We focus on the error between the functionfzandfq, i.e.,

EðfzÞ EðfqÞ: ð9Þ

Our main goal is to estimate the error(9)for the least-square regression algorithm(8)by means of properties of

q

andK. From[15], we see that the punishment item is not a Hilbert space norm, which raises the technical difficulty for the math-ematical analysis. Since the solutionfzof the LP-SVM has a reputation similar to the QP-SVM’s, we can estimate the error for the former using the stepping stone:

X

f

z 6CEzðfzÞ þ kfzk 2

:

In the last section, we have gotten the solutionfz. Next we will analyze the excess generalization errorEðfzÞ EðfqÞ. Denote theregularization erroras

e DðCÞ ¼ inf f¼fþb2H KþR EðfÞ EðfqÞ þ 1 C

X

ðf Þ ; CP0: LetefK;C¼arg minf2HKDðCÞe .

To estimateEðfzÞ EðfqÞ, we introduce theregularizing function fK;C2 HK, which depends onKandC, and is defined by

fK;C¼arg min f¼fþb2HKþR EðfÞ þ1 C

X

ðf Þ :

The regularization error for a regularization functionfK,Cis defined as

DðCÞ ¼ EðfK;CÞ EðfqÞ þ

1

C

X

ðf

K;CÞ:

The following results is an estimation for errorEðfzÞ EðfqÞ.

Theorem 2.1. Assume C>0 and fK;C2 HK. Then there holds

EðfzÞ EðfqÞ6Sðm;CÞ þ DðCÞ; where S(m, C) is the sample error defined by

Sðm;CÞ ¼ EðfzÞ EzðfzÞ þ EzðfK;CÞ EðfK;CÞ: Proof. We see from the definition offqthat

EðfzÞ EðfqÞ6EðfzÞ EðfqÞ þ

1 C

X

f z and EðfzÞ EðfqÞ þ1 C

X

f z ¼½EðfzÞ EzðfzÞ þ E½ zðfK;CÞ EðfK;CÞþ EzðfzÞ þ 1 C

X

f z EzðfK;CÞ þ 1 C

X

ðf K;CÞ þ EðfK;CÞ EðfqÞ þ 1 C

X

ðfK;CÞ : From the definition offzit follows that

EzðfzÞ þ 1 C

X

f z EzðfK;CÞ þ 1 C

X

f K;C 60:

(5)

This enables us to get

EðfzÞ EðfqÞ þ1 C

X

f

z 6 ðEðfzÞ EzðfzÞÞ þ ðEzðfK;CÞ EðfK;CÞÞ

þ EðfK;CÞ EðfqÞ þ1 C

X

f K;C : This finishes the proof ofTheorem 2.1. h

We now give some definitions.Definitions 2.1 and 2.2can be found in[25–28], andDefinitions 2.3 and 2.4can be found in

[15,28].

Definition 2.1.For a subsetFof a metric space and

g

> 0, the covering numberN ðF;

g

Þis defined to be the minimal integer l2Nsuch that there existldisks with radius

g

coveringF.

LetBR¼ ff2 HK:kfkK6Rg. It is a subset ofC(X). We denote the covering number of unit ballB1as

N ð

e

Þ ¼ N ðB1;

e

Þ;

e

>0:

Definition 2.2. We say that the RKHS associated with the Mercer kernelKhas polynomial complexity exponents> 0 if logN ð

e

Þ6C0ð1=

e

Þs;

8

e

>0: ð10Þ

Definition 2.3. We say that the probability measure

q

can be approximated byHKwith exponent 0 <b61 if there exists a constantcbsuch that

e

DðCÞ6cbCb;

8

C>0: ð11Þ

Definition 2.4. The projection operator

p

is defined on the space of measurable functionsf:X!Ras

p

ðfÞðxÞ ¼ 1; if fðxÞ>M; 1; if fðxÞ<M; fðxÞ; if M<fðxÞ<M: 8 > < > :

The following theorem gives the bound for deterministic distributions, i.e.,EðfqÞ ¼0.

In order to prove the result, we need the following ratio probability inequalities and an estimate of coving number. These results are standard in the learning theory, and can be found in[7,15,27–29], etc.

Bernstein inequality.Letnbe a random variable onZsatisfyingE(n) =

l

,

r

2(n) =

r

2. Ifjn

l

j6M, then for every

e

> 0 there

holds Prob

l

1 m Xm i¼1 nðziÞ >

e

( ) 62 exp m

e

2 2ð

r

2þ1 3M

e

Þ ( ) :

Lemma 2.1. Letnbe a random variable on Z satisfying E(n) =

l

P0,jn

l

j6M almost everywhere, and

r

26c

l

s, 06

s

62.

Then for any

e

>0 there holds

Prob

l

1 m Pm i1nðziÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

l

sþ

e

s p >

e

12s 6exp m

e

2s 2ðcþ1 3M

e

1sÞ ( ) :

Lemma 2.2.Let 06

s

61, M>0,c>0, andG be a set of functions on Z such that for every g2 G;EgP0;jgEgj6M, and E(g2)6c(Eg)s. Then for

e

>0,

Prob sup g2G Eg1 m Pm i¼1gðziÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðEgÞsþ ð

e

Þs p >4

e

1s 2 ( ) 6N ðG;

e

Þexp m

e

2s 2 cþ1 3M

e

1s ( ) ; where Eg¼RZgðzÞd

q

.

Denote the functions setGRas

GR¼ fðfðxÞ yÞ2 ðfqyÞ2:f2 BR; R>0g: ð12Þ Lemma 2.3.AssumeGRbe defined by(12). If(10)holds, then there exists a constant c0s>0, such that

logN ðGR;

e

Þ6c0s

R

e

s :

(6)

FromLemmas 2.2 and 2.3,we have the following Corollary.

Corollary. LetGRbe defined by(12) and (10)hold. Then for every 0<d<1, with confidence at least 1d, there holds

fEðfÞ EðfqÞg fEzðfÞ EzðfqÞg64pffiffiffiffiffiffiffiffi

e

m;R

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

EðfÞ EðfqÞÞ þ

e

m;R q

for all f2 BR, where

e

m,Ris given by

e

m;R¼8 cRþ 1 3 c0 sR s m 1 1þs þlogð1=dÞ m ! ; cR¼ ð

j

Rþ3mÞ2:

Proof. Consider the setGR. Each functiong2 GRhas the formg(z) = (f(x)y)2(fqy)2withf2 BR. Hence

Eg¼ EðfÞ EðfqÞP0; 1 m Xm i¼1 gðziÞ ¼ EzðfÞ EzðfqÞ and gðzÞ ¼ ðfðxÞ fqðxÞÞððfðxÞ yÞ þ ðfqðxÞ yÞÞ:

Sincekfk16

j

kfkK6

j

R, andjfq(x)j6Malmost everywhere, we find that

jgðzÞj6ð

j

RþMÞð

j

Rþ3MÞ6cR¼ ð

j

Rþ3MÞ2: So we havejg(z)Egj6B= 2cRalmost everywhere. Also,

Eg2¼EððfðxÞ fqðxÞÞ2ððfðxÞ yÞ þ ðfqðxÞ yÞÞ2Þ6ð

j

Rþ3MÞ2ðEðfÞ EðfqÞÞ: ThusEg26c

REgfor eachg2 GR.

ApplyingLemma 2.2with

s

= 1, we deduce that sup f2BR Eg1 m Pm i¼1gðziÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffi Egþ

e

p 64pffiffiffi

e

; with confidence at least

1 N ðGR;

e

Þexp

m

e

2cRþ13B

( )

¼1d: So we can see that

fEðfÞ EðfqÞg fEzðfÞ EzðfqÞg64pffiffiffiffiffiffiffiffi

e

m;R

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ðEðfÞ EðfqÞÞ þ

e

m;R q

: We see from the Corollary 5.1 of[15](withp=1,

s

= 1) that

e

m;R¼8 cRþ 1 3 c0 sR s m 1 1þs þlogð1=dÞ m ! : This enable us to end the proof of Corollary. h

Theorem 2.2. SupposeEðfqÞ ¼0. If fK,Cis a function inHKsatisfyingk(yfK,C)2k16M, then for every 0<d<1, with confidence

1dthere holds EðfzÞ617

e

m;Cþ 20Mlogð2=dÞ 3m þ8DðCÞ: where

e

m,Cis given by

e

m;C¼8

j

2C 5Mlogð2=dÞ 3m þ2DðCÞ þ3M 2 þ1 3 ! cs

j

sCs 5Mlogð2=dÞ 3m þ2DðCÞ s m 0 B @ 1 C A 1 1þs þlogð1=dÞ m 0 B B @ 1 C C A:

Proof. Since EðfqÞ ¼0;ðyfqðxÞÞ2¼0 almost everywhere. We first consider the random variable n= (yfK,C)2. Since

(7)

r

2ð

nÞ6En26MEn6MCÞ:

Applying Bernstein inequality ton, we see by solving the equation me2

2ðr2þMe=3Þ¼logðd=2Þthat with confidence 1d/2, EzðfK;CÞ EðfK;CÞ6 2Mlog 2 d 3m þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2

r

2ðnÞlogð2=dÞ m r 65Mlogð2=dÞ 3m þ DðCÞ:

Next we estimate theEðfzÞ EzðfzÞ. By the definition offz, there holds 1 C

X

f z 6EzðfzÞ þ 1 C

X

f z 6EzðfK;CÞ þ 1 C

X

f K;C

andEðfqÞ ¼0;DðCÞ ¼ EðfK;CÞ þ1CX fK;C

. It follows that 1 C

X

f z 6EzðfK;CÞ EðfK;CÞ þ DðCÞ:

Recallingfz2 HK, we see from the reproducing property of kernel that

kf zkK¼ Xm i;j¼1

a

i

a

jKðxi;xjÞ !1=2 6

j

X m i¼1

a

i

a

j !1=2 ¼

j

X

f z 6R¼

j

C 5Mlogð2=dÞ 3m þ2DðCÞ :

Corollary andRgiven as above imply that

EðfzÞ EzðfzÞ64 ffiffiffiffiffiffiffiffiffi

e

m;C p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi EðfzÞ þ

e

m;C q

with confidence 1dwhere

e

m,Cis defined in the statement. Putting the above two estimates intoTheorem 2.1, there holds

EðfzÞ64 ffiffiffiffiffiffiffiffiffi

e

m;C p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi EðfzÞ þ

e

m;C q þ10Mlogð2=dÞ 3m þ4DðCÞ:

Solving the quadratic inequality for ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiEðfzÞ þ

e

m;C p

leads to

EðfzÞ617

e

m;Cþ

20Mlogð2=dÞ

3m þ8DðCÞ:

The proof ofTheorem 2.2is completed. h

The next result is on general distribution satisfying Tsybakov condition (see,[30]).

Theorem 2.3.Assume the hypotheses(10) and(11)with 0<s<1and 0<b61. Take t>1. For every

e

>0 and every 0<d<1

there exists two constants csdepending on s and c0sdepending on s andbsuch that with confidence 1d,

EðfzÞ EðfqÞ62cst Rs m 1 1þs þ2tc0 s Cð2bÞ m þ 1 mþ Cð1bÞ m þC b ! :

Proof.DenoteDz¼ EðfzÞ EðfqÞ þ1CX fz . Then we haveX fz 6CDz.Theorem 2.1tells us that

D

z6Sðm;CÞ þ DðCÞ:

TakefK;C¼efK;C. By the assumption(11),

D

z6Sðm;CÞ þcbC

b:

Recall the expression forS(m,C), we have

Sðm;CÞ ¼ fðEðfzÞ EðfqÞÞ ðEzðfzÞ EzðfqÞÞg þ fðEzðfK;CÞ EzðfqÞÞ ðEðfK;CÞ EðfqÞÞg ¼S1þS2: TaketP1,CP1. ForS1, we apply Corollary withd=et61/e, and know that there is a setVðR1ÞZ

m of measure at most d=etsuch that S16cst Rs m 1 1þs þ R s m 1 2ð1þsÞ

D

12 z ( ) ; wherecs¼32 ð

j

Rþ3MÞ2þ13 c0 sþ1

is a constant depending only ons.

To estimateS2, considern= (fK,Cy)2 (fqy)2on (Z,

q

). From(4), it follows that kfK;Ck16

j

kfK;CkK6

j

2

X

ðfK;CÞ6

j

2CDðCÞ: Writen=n1+n2where

(8)

n1¼ ðfK;CyÞ 2 ð

p

ðfK;CÞ yÞ 2 ;n2¼ ð

p

ðfK;CÞ yÞ 2 ðfqyÞ2:

It is easy to check that 06n16ð

j

2CDðCÞ þ3MÞ2¼BC. Here

r

2(n1) is bounded by ð

j

2CDðCÞ þ3MÞ2Eðn1Þ. Then Bernstein

inequality withd=ettells us that there is a setVð2Þ

R Z

mof measure at mostd=etsuch that for everyz2ZmnVð2Þ R , there holds 1 m Xm i¼1 n1ðziÞ En16exp m

e

2 2ð

r

2ðn 1Þ þ13BC

e

Þ ( ) : Solving the quadratic equation for

e

m

e

2 2ð

r

2ðn 1Þ þ13BC

e

Þ ¼t we have 1 m Xm i¼1 n1ðziÞ En16 1 3BCtþ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 3BCt 2 þ2m

r

2ðn 1Þt q m 6 2BCt 3m þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2t m

r

2ðn 1Þ r : But the fact 06n16BCimplies

r

2(n1)6BCE(n1). Therefore, we have

1 m Xm i¼1 n1ðziÞ En16 7BCt 6m þEðn1Þ;

8

z2Z m nVðR2Þ:

Next we considern2. Since both

p

(fK,C) andfqare on [M,M],n2is a random variable satisfyingjn2j6B. Applying Bernstein

inequality as above, we know that there exists another subsetVð3Þ

R Z

mwith measure at mostd=etsuch that for every z2ZmnVð3Þ R , there holds 1 m Xm i¼1 n2ðziÞ En26 2Bt 3mþ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2t

r

2ðn 2Þ m r : By the fact

r

2 (n2)6BE(n2), we have 1 m Xm i¼1 n2ðziÞ En26 7Bt 6mþEðn2Þ;

8

z2Z m nVð3Þ R :

Combine the above two estimates forn1andn2with the factEn1þEn2¼En6DðCÞ6cbCb. We conclude that S26 7BCtþ7Bt 6m þ DðCÞ6c 0 st Cð2bÞ m þ 1 mþ Cð1bÞ m þC b ! ;

8

z2Zm nVð2Þ R nV ð3Þ R ; wherec00 s¼ 7j4c2 bþ42Mj2cbþ63M2þ7B

6 þcbis a constant depending only ons.

Putting the above two estimates forS1andS2, we find that for everyz2ZmnVð

1Þ R nV ð2Þ R nV ð3Þ R there holds

D

z62cst Rs m 1 1þs þ2tc00 s Cð2bÞ m þ 1 mþ Cð1bÞ m þC b ! :

Here we have used another elementary inequality: ifa,b> 0 and 0 <

a

< 1, then

x6axaþb; x>0)x6max2aÞ1=ð1aÞ;2bg: The proof ofTheorem 2.3is complete. h

4. Conclusions

In reproducing Hilbert spaces, with the covering number, a new upper bound for estimating learning errors of linear pro-gramming support vector regression has been presented in this paper. This errors bound formulation has been shown that

Bðm;CÞ ¼2cst Rs m 1 1þs þ2tc0 s Cð2bÞ m þ 1 mþ Cð1bÞ m þC b ! :

Heremis the number of sample points,Cis the trade-off who control the regulation item in the model of LP-SVR(7),sis the polynomial complexity exponent of the given reproducing Hilbert spaces,csandc0sare two constants who depend onsand can be estimated byDefinition 2.3andLemma 2.3,t> 1 is a given constant and 0 <b61 is also a given constant.

(9)

lim

m!þ1Bðm;cÞ ¼ 2tc0

s

Cb :

It means that the gap can not vanish no matter how to select the sample data points.

Due to the difficulty of calculating the covering number, large body of work can not be done in the field of experiments. The authors think that it will be a very challenge future work.

Acknowledgements

The authors thank referees for their careful reading of the manuscript which improved the technician qualities and pre-sentation. Their very detailed comments and extremely important suggestions made this paper readable and more understandable.

References

[1] V. Vapnik, A. Lerner, Pattern recognition using generalized portrait method, Automat. Rem. Contr. 24 (1963) 774–780.

[2] M.A. Aizerman, E.M. Braverman, L.I. Rozonoer, Theoretical foundations of the potential function method in pattern recognition learning, Automat. Rem. Contr. 25 (1964) 821–837.

[3] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (1995) 273–297. [4] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.

[5] I. Steinwart, Supported vector machines are universally consistent, J. Complexity 18 (2002) 768–791.

[6] T. Zhang, Statistical behavior and consistency of classification methods based on convex risk minimization, Ann. Statist. 32 (2004) 56–85. [7] Q. Wu, D.X. Zhou, Analysis of support vector machine classification, J. Comput. Anal. Appl. 8 (2006) 108–134.

[8] H.B. Zhao, S.D. Yin, Geomechanical parameters identification by particle swarm optimization and support vector machine, Appl. Math. Model. 33 (2009) 3997–4012.

[9] W.C. Hong, Electric load forecasting by support vector model, Appl. Math. Model. 33 (2009) 2444–2454. [10] Q. Wu, Y.M. Ying, D.X. Zhou, Multi-kernel regularized classifiers, J. Complexity 23 (2007) 108–134.

[11] P.S. Bradley, O.L. Mangasarian, Massive data discrimination via linear supported vector machiness, Optim. Methods Softw. 13 (2000) 1–10. [12] V. Kecman, I. Hadaic, Supported vector selection by linear programming, Proc. of IJCNN 5 (2000) 193–198.

[13] P. Niyogi, F. Girosi, On the relationship between genneralization error, hypothesis complexity, and sample complexity for radial basis functions, Neural Comput. 8 (1996) 819–842.

[14] J.P. Pedroso, N. Murata, Supported vector machines with different norms: motivation, formulations and results, Pattern Recogn. Lett. 22 (2001) 1263– 1272.

[15] Q. Wu, D.X. Zhou, SVM soft margin classfiers: linear programming versus quadratic programming, Neural Comput. 17 (2005) 1160–1187. [16] A.J. Smola, B. Schölkopf, A tutorial on support vector regression, Stat. Comput. 14 (2004) 199–222.

[17] G. Liu, Z. Lin, Y. Yu, Multi-output regression on the output manifold, Pattern Recog. 42 (2009) 2737–2743. [18] N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Soc. 68 (1950) 337–404.

[19] V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, New York, 1995.

[20] J.C. Platt, Fast training of support vector machines using sequential minimal optimization, in: Advances in Kernel Methods-Support Vector Learning, MIT Press, 1999, pp. 185–208.

[21] O.L. Mangasarian, D.R. Musicant, Successive overrelaxation for support vector machines, IEEE Tran. Neural Networks 10 (1999) 1032–1037. [22] Y.J. Lee, O.L. Mangarasian, SSVM: a smooth support vector machine for classification, Comput. Optim. Appl. 22 (2001) 5–21.

[23] Y. Yuan, J. Yan, C. Xu, Polynomial smooth support vector machine, Chinese J. Comput. 28 (2005) 9–17. in Chinese. [24] M.H. Hassoun, Fundamentals of Artificial Neural Networks, MIT Press, Cambridge, MA, 1995.

[25] D.X. Zhou, The covering number in learning theory, J. Complexity 18 (2002) 739–767.

[26] Y. Guo, P.L. Bartlett, J. Shawe-Taylor, R.C. Williamson, Covering numbers for support vector machines, IEEE Trans. Inf. Theory 48 (2002) 239–250. [27] F. Cucker, S. Smale, On the mathematical foundations of learning theory, Bull. Amer. Math. Soc. 39 (2001) 1–49.

[28] Q. Wu, D.X. Zhou, Learning rates of least-square regularized regression, Found. Comput. Math. 6 (2006) 171–192. [29] F. Cucker, D.X. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge University Press, Cambridge, 2007. [30] A.B. Tsybako, Optimal aggregation of classifiers in statistical learning, Ann. Statis. 32 (2004) 135–166.

References

Related documents

Inspired by naturally occurring riboswitches, sequestering and liberating of the ribosomal binding site (RBS), which needs to be single stranded for a successful

The objectives of the research were: (1) test the hypothesis that defoliation date and gin-drying temperature do not affect water content – at moisture equilibrium – of

Analysis of images and calculation of scale was carried out using ImageJ (NIH freeware). Scale bars representative of 0.1 mm. B ) Sphere forming efficiency (SFE) and C )

Schultze &amp; Braun represents your interests in a German insolvency and advises you on how to protect your position should your business partner in Germany become

As a joint project of its ten institutional members, the Great Plains Interactive Distance Education Alliance for the Human Sciences is able to give students access

Prepared by Common Project Lead Dawn Bonder, Oregon Collaborative Members Phyllis Albritton, Colorado Peggy Evans, Washington Victoria Wangia, Kansas Ellen Flink, New York

Calculate the percent of time the related service staff are assigned to preschool services. Subtract the percent of time for preschool from 100% to obtain the percent of