Approximation Analysis of Learning Algorithms for Support Vector Regression and Quantile Regression

(1)

Volume 2012, Article ID 902139,17pages doi:10.1155/2012/902139

Research Article

Approximation Analysis of

Learning Algorithms for Support Vector

Regression and Quantile Regression

Dao-Hong Xiang,

1

_{Ting Hu,}

2

_{and Ding-Xuan Zhou}

3

1_{Department of Mathematics, Zhejiang Normal University, Zhejiang 321004, China} 2_{School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China} 3_{Department of Mathematics, City University of Hong Kong, Kowloon, Hong Kong}

Correspondence should be addressed to Ding-Xuan Zhou,[email protected]

Received 14 July 2011; Accepted 14 November 2011 Academic Editor: Yuesheng Xu

Copyrightq2012 Dao-Hong Xiang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We study learning algorithms generated by regularization schemes in reproducing kernel Hilbert spaces associated with an-insensitive pinball loss. This loss function is motivated by the -insen-sitive loss for support vector regression and the pinball loss for quantile regression. Approximation analysis is conducted for these algorithms by means of a variance-expectation bound when a noise condition is satisfied for the underlying probability measure. The rates are explicitly derived under a priori conditions on approximation and capacity of the reproducing kernel Hilbert space. As an application, we get approximation orders for the support vector regression and the quantile regu-larized regression.

1. Introduction and Motivation

In this paper, we study a family of learning algorithms serving both purposes of support vector

regression and quantile regression. Approximation analysis and learning rates will be provided,

which also helps better understanding of some classical learning methods.

Support vector regression is a classical kernel-based algorithm in learning theory introduced in1. It is a regularization scheme in a reproducing kernel Hilbert spaceRKHS HKassociated with an-insensitive lossψ:R → Rdefined for≥0 by

ψu max{|u| −,0}

|u| −, if |u| ≥,

(2)

Here, for learning functions on a compact metric spaceX,K :X×X → Ris a continuous, symmetric, and positive semidefinite function called a Mercer kernel. The associated RKHS HKis defined2as the completion of the linear span of the set of function{Kx Kx,· :

x ∈X}with the inner product·,·_K satisfyingKx, Ky_K Kx, y. LetY Randρbe a

Borel probability measure onZ:X×Y. With a sample z{xi, yi}mi1∈Zmindependently

drawn according toρ, the support vector regression is defined as

fzSVRarg min f∈HK 1 m m i1 ψfxi−yi λf2_K , 1.2

whereλλm>0 is a regularization parameter.

When >0 is fixed, convergence of1.2was analyzed in3. Notice from the original motivation1for the insensitive parameterfor balancing the approximation and sparsity of the algorithm thatshould change with the sample size and usuallym → 0 as the sample sizemincreases. Mathematical analysis for this original algorithm is still open. We will solve this problem in a special case of our approximation analysis for general learning algorithms. In particular, we show howfSVR

z approximates the median functionfρ,1/2, which

is one of the purposes of this paper. Here, forx∈X, the median function valuefρ,1/2xis a

median of the conditional distributionρ· |xofρatx.

Quantile regression, compared with the least squares regression, provides richer infor-mation about response variables such as stretching or compressing tails4. It aims at esti-mating quantile regression functions. With a quantile parameter 0< τ <1, a quantile regression

functionfρ,τ is defined by its valuefρ,τxto be aτ-quantile ofρ· |x, that is, a valueu∈Y

satisfying

ρy∈Y :y≤u |x≥τ, ρy∈Y :y≥u |x≥1−τ. 1.3 Quantile regression has been studied by kernel-based regularization schemes in a learning theory literaturee.g.,5–8. These regularization schemes take the form

fzQRarg min f∈HK 1 m m i1 ψτ fxi−yi λf2_K , 1.4

whereψτ :R → Ris the pinball loss shown inFigure 1defined by

ψτu

1−τu, if u >0,

−τu, if u≤0. 1.5

Motivated by the-insensitive lossψ_{and the pinball loss}_ψ

τ, we propose the

-insen-sitive pinball lossψ

τ :R → Rwith an insensitive parameter≥0 shown inFigure 1defined

as ψ_τu ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ 1−τu−, ifu > , −τu, ifu≤ −, 0, otherwise. 1.6

(3)

0 u −τu (1−τ)u ψτ(u) 0 ψɛ_τ(u) −τ(u+ɛ) (1−τ)(u u −ɛ) −ɛ ɛ Figure 1

This loss function has been applied to online learning for quantile regression in our previous work8. It is applied here to a regularization scheme in the RKHS as

fzf_z_,λ,τ arg min f∈HK 1 m m i1 ψ_τfxi−yi λf2_K . 1.7

The main goal of this paper is to study how the output functionfzgiven by1.7converges

to the quantile regression functionfρ,τ and how explicit learning rates can be obtained with

suitable choices of the parameters λ m−α_, _m−β _{based on a priori conditions on the}

probability measureρ.

2. Main Results on Approximation

Throughout the paper, we assume that the conditional distributionρ· | xis supported on

−1,1for everyx∈X. Then, we see from1.3that we can take values offρ,τto be on−1,1.

So to see howfzapproximatesfρ,τ, it is natural to project values of the output functionfz

onto the same interval by the projection operator introduced in9.

Definition 2.1. The projection operatorπon the space of function onXis defined by

πfx ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ 1, iffx>1, −1, iffx<−1, fx, if −1≤fx≤1. 2.1

Our approximation analysis aims at establishing bounds for the errorπfz−fρ,τ_Lp∗ ρX

in the spaceLpρ∗Xwith somep∗>0 whereρXis the marginal distribution ofρonX.

2.1. Support Vector Regression and Quantile Regression

Our error bounds and learning rates are presented in terms of a noise condition and approx-imation condition onρ.

(4)

Definition 2.2. Letp∈0,∞andq∈1,∞. We say thatρhas aτ-quantile ofp-average type

qif for everyx∈X, there exist aτ-quantilet∗∈Rand constantsax ∈0,2, bx>0 such that

for eachu∈0, ax,

ρy∈t∗−u, t∗ |x≥bxuq−1, ρ

y∈t∗, t∗u |x≥bxuq−1, 2.2

and that the function onXtaking valuebxaqx−1−1atx∈Xlies inLpρX.

Note that condition2.2tells us thatfρ,τx t∗is uniquely defined at everyx∈X.

The approximation condition on ρ is stated in terms of the integral operator LK :

L2

ρX → L

2

ρX defined byLKfx

XKx, ufudρXu. SinceK is positive semidefinite,

LK is a compact positive operator and itsr-th powerLrKis well-defined for anyr > 0. Our

approximation condition is given as

fρ,τ LrK gρ,τ for some 0< r ≤ 1 2, gρ,τ ∈L 2 ρX. 2.3

Let us illustrate our approximation analysis by the following special case which will be proved inSection 5.

Theorem 2.3. LetX ⊂ Rn _and_K _∈_C∞_X_×_X_{. Assume}_f

ρ,τ ∈ HKandρhas aτ-quantile ofp

-average type 2 for somep∈0,∞. Takeλm−p1/p2_and_m−β_with_p₁_/_p₂_≤_β_{≤ ∞}_.

Let 0< η <p1/2p2. Then, withp∗ 2p/p1 >0, for any 0< δ <1, with confidence 1−δ, one has πfz −fρ,τ Lp_ρX∗ ≤ Clog3 δm η−p1/2p2_, _2.4

whereCis a constant independent ofmorδ.

Ifp≥1/2η−2 for 0< η <1/4, we see that the power exponent for the learning rate

2.4is at least1/2−2η. This exponent can be arbitrarily close to 1/2 whenηis small enough. In particular, if we takeτ 1/2,Theorem 2.3provides rates for output functionfzSVR

of the support vector regression1.2to approximate the median functionfρ,1/2.

If we takeβ∞leading to0,Theorem 2.3provides rates for output functionfzQR

of the quantile regression algorithm1.4to approximate the quantile regression functionfρ,τ.

2.2. General Approximation Analysis

To state our approximation analysis in the general case, we need the capacity of the hypo-thesis space measured by covering numbers.

Definition 2.4. For a subsetSofCXandu > 0, the covering numberNS, uis the minimal integerl∈Nsuch that there existldisks with radiusucoveringS.

(5)

The covering numbers of ballsBR {f ∈ HK :fK ≤ R}with radiusR > 0 of the

RKHS have been well studied in the learning theory literature10,11. In this paper, we as-sume for somes >0 andCs>0 that

logNB1, u≤Cs 1 u s , ∀u >0. 2.5

Now we can state our main result which will be proved inSection 5. Forp ∈ 0,∞

andq∈1,∞, we denote θmin 2 q, p p1 ∈0,1. 2.6

Theorem 2.5. Assume2.3with 0 < r ≤1/2 and2.5withs > 0. Suppose thatρhas aτ

-quan-tile ofp-average typeqfor somep ∈0,∞andq∈1,∞. Takeλ m−α_{with 0} _{< α}_≤_{1 and}_{α <}

2s/s2s−θ. Setm−β_with_αr/₁₋_r_≤_β_{≤ ∞}_{. Let}

0< η < 1s22s−sα2s−θ

s2s−θ2s . 2.7

Then, withp∗pq/p1>0, for any 0< δ <1, with confidence 1−δ, one has

πfz −fρ,τ Lp_ρX∗ ≤ C log3 η 2 log3 δm −ϑ_, _2.8

whereCis a constant independent ofmorδand the power indexϑis given in terms ofr, s, p, q, α, andηby ϑ 1 qmin αr 1−r, 1 2s−θ− sα2s−θ−1 2s−θ2s − sη 1s, 1 2s−θ − sα1−2r 1s2−2r, 1 2s−θ− s 1s α 2 − 1 22−θ . 2.9

The indexϑcan be viewed as a function of variablesr, s, p, q, α, η. The restriction 0 < α <2s/s2s−θonαand2.7onηensure thatϑis positive, which verifies the valid learning rate inTheorem 2.5.

Assumption2.5is a measurement of regularity of the kernelKwhenXis a subset of Rn_{. In particular,}_s_{can be arbitrarily small when}_K_{is smooth enough. In this case, the power}

indexϑin2.8can be arbitrarily close to1/qmin{αr/1−r,1/2−θ}. Again, whenβ∞,

0, algorithm 1.7corresponds to algorithm1.4 for quantile regression. In this case,

(6)

Error analysis has been done for quantile regression algorithm1.4in5,6. Under the assumptions thatρsatisfies2.2with somep∈0,∞andq >1 and the2_{empirical covering}

numberNzB1, η, 2 see5for more detailsofB1is bounded as

sup z∈Zm logNz B1, η, 2 ≤a 1 η s withs∈0,2, a≥1, 2.10

it was proved in5that with confidence 1−δ,

πfzQR −fρ,τ Lp_ρX∗ ≤ ⎧ ⎨ ⎩Dτλ Dτλ λ log3/δ m Ks,Cpa λs/2_m p1/p2−s/2 Ks,Cpa λs/2_m 5 32Cplog3/δ m p1/p2 145 log3/δ m ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ 1/q , 2.11 whereCpandKs,Cpare constants independent ofmorλ. Here,Dτλis the regularization error

defined as Dτλ inf f∈HK Eτ f− Eτ fρ,τ λf2_K, _2.12

andEτfis the generalization error associated with the pinball lossψτdefined by

Eτ f Z ψτ fx−ydρ X Y ψτ fx−ydρy|xdρXx. 2.13

Note thatEτfis minimized by the quantile regression functionfρ,τ. Thus, when the

regular-ization errorDτλdecays polynomially asDτλOλr/1−r which is ensured byLemma 2.6

below when2.3is satisfiedandλm−α, thenπfzQR−fρ,τ_Lp∗

ρX Olog3/δm −ϑ_with ϑ 1 qmin αr 1−r, p1 p2−s/2 1−αs 2 ,p1 p2 . 2.14

Sincep1/p2 1/2−θ, we see that this learning rate is comparable to our result in

(7)

2.3. Comparison with Least Squares Regression

There has been a large literature in learning theorydescribed in12for the least squares algorithms: f_zLSarg min f∈HK 1 m m i1 fxi−yi 2 λf2_K . 2.15

It aims at learning the regression functionfρx

Yydρy | x. A crucial property for its

error analysis is the identityElsf− Elsfρ f−fρ2_L2

ρX for the least squares generalization

errorElsf

Zy−fx

2_dρ_{. It yields a variance-expectation bound E}_ξ2 _≤ _4E_ξ_{for the}

random variableξ y−fx2−y−fρx2onZ, ρwheref :X → Yis an arbitrary

mea-surable function. Such a variance-expectation bound with Eξpossibly replaced by its posi-tive powerEξθplays an essential role for analyzing regularization schemes and the power exponentθdepends on strong convexity of the loss. See13and references therein. However, the pinball loss in the quantile regression setting has no strong convexity6and we would not expect a variance-expectation bound for a general distributionρ. Whenρhas aτ-quantile ofp-average typeq, the following variance-expectation bound withθgiven by2.6can be found in5,7 derived by means ofLemma 3.1below.

Lemma 2.6. Ifρhas aτ-quantile ofp-average typeqfor somep∈0,∞andq∈1,∞, then

Eψτ fx−y−ψτ fρ,τx−y 2_≤ Cθ Eτ f− Eτ fρ,τ θ , ∀f:X−→Y, 2.16

where the power indexθis given by2.6and the constantCθisCθ22−θqθγ−1θ_Lp

ρX.

Lemma 2.6overcomes the diﬃculty of quantile regression caused by lack of strong convexity of the pinball loss. It enables us to derive satisfactory learning rates, as in

Theorem 2.5.

3. Insensitive Relation and Error Decomposition

An important relation for quantile regression observed in5assets that the errorπfz−

fρ,τ taken in a suitable Lp

∗

ρX space can be bounded by the excess generalization error

Eτπfz− Eτfρ,τwhen the noise condition is satisfied.

Lemma 3.1. Letp ∈0,∞andq∈ 1,∞. Denotep∗ pq/p1 >0. Ifρhas aτ-quantile of

p-average typeq, then for any measurable functionfonX, one has

f−fρ,τ_Lp∗ ρX ≤Cq,ρ Eτ f− Eτ fρ,τ 1/q , 3.1 whereCq,ρ21−1/qq1/q{bxaqx−1−1}x∈X 1/q Lp_ρX.

(8)

ByLemma 3.1, to estimateπfz−fρ,τ_Lp∗

ρX, we only need to bound the excess

gener-alization errorEτπfz− Eτfρ,τ. This will be done by conducting an error decomposition

which has been developed in the literature for regularization schemes9,13–15. Technical diﬃculty arises for our problem here because the insensitive parameterchanges withm. This can be overcome16by the following insensitive relation

ψτu−≤ψτu≤ψτu, u∈R. 3.2

Now, we can conduct an error decomposition. Define the empirical errorEz,τffor

f:X → Ras Ez,τ f 1 m m i1 ψτ fxi−yi . 3.3

Lemma 3.2. Letλ >0,fzbe defined by1.7and

f_λ0arg min f∈HK Eτ f− Eτ fρ,τ λf2_K. _3.4 Then, Eτ πfz − Eτ fρ,τ λfz 2 K≤ S1S2Dτλ, 3.5 where S1 Eτ πfz − Eτ fρ,τ − Ez,τ πfz − Ez,τ fρ,τ , S2 Ez,τ f_λ0− Ez,τ fρ,τ − Eτ f_λ0− Eτ fρ,τ , 3.6

Proof. The regularized excess generalization errorEτπfz− Eτfρ,τ λfz2_Kcan be

ex-pressed as Eτ πfz − Ez,τ πfz ! Ez,τ πfz λfz 2 K " − ! Ez,τ f_λ0λf_λ02 K " Ez,τ f_λ0− Eτ f_λ0 Eτ f_λ0− Eτ fρ,τ λf_λ02 K . 3.7

(9)

The fact|y| ≤ 1 implies thatEz,τπfz ≤ Ez,τfz. The insensitive relation3.2and the

definition offztell us that

Ez,τ fz λfz 2 K≤ 1 m m i1 ψ_τfzxi−yi λfz 2 K ≤ 1 m m i1 ψ_τf_λ0xi−yi λf_λ02 K≤ Ez,τ f_λ0λf_λ02 K. 3.8

Then, by subtracting and addingEτfρ,τandEz,τfρ,τand notingDτλ Eτf_λ0−Eτfρ,τ

λf_λ02

K, we see that the desired inequality inLemma 3.2holds true.

In the error decomposition3.5, the first two terms are called sample error. The last term is the regularization error defined in2.12. It can be estimated as follows.

Proposition 3.3. Assume2.3. Definef_λ0by3.4. Then, one has

Dτλ≤C0λr/1−r, f_λ0 K≤ # C0λ2r−1/2−2r, 3.9

whereC0is the constantC0 gρ,τL2

ρX gρ,τ

2

L2

ρX.

Proof. Letμλ1/1−r_>_{0 and}

fμ

LKμI

₋₁

LKfρ,τ. 3.10

It can be found in17,18that when2.3holds, we have

fμ−fρ,τ2_L2 ρXμfμ 2 K≤μ 2r_g ρ,τ2_L2 ρX. 3.11 Hence,fμ−fρ,τL2 ρX ≤μ r_g ρ,τL2 ρX andfμ 2 K≤μ2r−1gρ,τ2L2

ρX. Sinceψτis Lipschitz, we know

by takingffμin2.12that Dτλ≤ Z ψτ fμx−y −ψτ fρ,τx−y λfμ2_K ≤fμ−fρ,τ_L1 ρXλ _f_μ2 K≤fμ−fρ,τL2 ρXλ _f_μ2 K ≤μrgρ,τ_L2 ρXλμ 2r−1_g ρ,τ2_L2 ρX gρ,τ_L2 ρX _g_ρ,τ2 L2 ρX λr/1−r. 3.12

This verifies the desired bound forDτλ. By takingf 0 in2.12, we have

λf_λ02

K≤ Dτλ≤C0λ

r/1−r_. _3.13

(10)

4. Estimating Sample Error

This section is devoted to estimating the sample error. This is conducted by using the vari-ance-expectation bound inLemma 2.6.

Denoteκsup_x_∈_X#Kx, x. ForR≥1, denote WR z∈Zm:fz

K≤R

. 4.1

Proposition 4.1. Assume2.3and2.5. LetR≥1 and 0< δ <1. Ifρhas aτ-quantile ofp-average

typeqfor somep∈0,∞andq∈1,∞, then there exists a subsetVR ofZmwith measure at most

δsuch that for any z∈ WR\VR,

Eτ πfz − Eτ fρ,τ λfz 2 K≤2C 1log 3 δλ r/1−r_max 1,λ −1/2−2r m C₂log3 δm −1/2−θ_C 3m−1/2s−θRs/1s, 4.2

whereθis given by2.6andC₁, C₂, C₃are constants given by

C₁4C02κ # C0, C2245324C 1/2−θ θ , C3 406Cs1/1s408CθCs1/2s−θ. 4.3

Proof. Let us first estimate the second partS2of the sample error. It can be decomposed into

two partsS2S2,1S2,2where

S2,1 Ez,τ f_λ0− Ez,τ πf_λ0 −Eτ f_λ0− Eτ πf_λ0 , S2,2 Ez,τ πf_λ0− Ez,τ fρ,τ − Eτ πf_λ0− Eτ fρ,τ . 4.4

For boundingS2,1, we take the random variableξzψτf_λ0x−y−ψτπf_λ0x−y

onZ, ρ. It satisfies 0≤ξ≤ |πf_λ0x−f_λ0x| ≤1f_λ0_∞. Hence,|ξ−Eξ| ≤1f_λ0_∞

and Eξ−Eξ2 ≤Eξ2 _≤₁_f0

λ ∞Eξ. Applying the one-side Bernstein inequality12,

we know that there exists a subsetZ1,δofZmwith measure at least 1−δ/3such that

S2,1≤ 71f_λ0 ∞ log3/δ 6m Eτ f_λ0− Eτ πf_λ0, ∀z∈Z1,δ. 4.5

(11)

For S2,2, we apply the one-side Bernstein inequality again to the random variable

ξz ψτπf_λ0x−y−ψτfρ,τx−y, bound the variance byLemma 2.6withfπf_λ0,

and find that there exists another subsetZ2,δofZmwith measure at least 1−δ/3such that

S2,2≤ 4 log3/δ 3m θ 2 θ/4−2θ 1−θ 2 ₂_C θlog3/δ m 1/2−θ Eτ πf_λ0− Eτ fρ,τ , ∀z∈Z2,δ. 4.6

Next, we estimate the first partS1of the sample error. Consider the function set

Gψτ πfx−y−ψτ fρ,τx−y :f_K≤R . 4.7

A function from this setgz ψτπfx−y−ψτfρ,τx−ysatisfies Eg≥0,|gz| ≤2,

and Eg2 _≤ _C

θEgθ by 2.16. Also, the Lipschitz property of the pinball loss yields

NG, u≤ NB1, u/R. Then, we apply a standard covering number argument with a ratio

inequality12,13,19,20toGand find from the covering number condition2.5that

Probz∈Zm ⎧ ⎪ ⎨ ⎪ ⎩fsup_K≤R $ Eτ πf− Eτ fρ,τ % −$Ez,τ πf− Ez,τ fρ,τ % & Eτ πf− Eτ fρ,τ θ uθ ≤4u1−θ/2 ⎫ ⎪ ⎬ ⎪ ⎭ ≥1− NB1, u R exp − mu2−θ 2Cθ 4/3u1−θ ≥1−exp Cs R u s − mu2−θ 2Cθ 4/3u1−θ . 4.8

Setting the confidence to be 1−δ/3, we takeu∗R, m, δ/3to be the positive solution to the equation Cs R u s − mu2−θ 2Cθ 4/3u1−θ logδ 3. 4.9

Then, there exists a third subsetZ3,δ ofZmwith measure at least 1−δ/3such that

sup f_K≤R $ Eτ πf−Eτ fρ,τ % −$Ez,τ πf−Ez,τ fρ,τ % & Eτ πf−Eτ fρ,τ θ u∗R, m, δ/3θ ≤4 u∗ R, m,δ 3 1−θ/2 , ∀z∈Z3,δ. 4.10

(12)

Thus, forz∈ WR∩Z3,δ, we have S1 Eτ πfz − Eτ fρ,τ − Ez,τ πfz − Ez,τ fρ,τ ≤4 ! u∗ R, m,δ 3 "1−θ/2 Eτ πfz − Eτ fρ,τ θ ! u∗ R, m,δ 3 "θ ≤ 1−θ 2 42/2−θu∗ R, m,δ 3 θ 2 Eτ πfz − Eτ fρ,τ 4u∗ R, m,δ 3 . 4.11

Here, we have used the elementary inequality√ab ≤ √a√band Young’s inequality. Putting this bound and4.5,4.6into3.5, we know that for z∈ WR∩Z3,δ∩Z1,δ∩Z2,δ,

there holds Eτ πfz− Eτ fρ,τ λfz 2 K≤ 1 2 Eτ πfz − Eτ fρ,τ 2Dτλ 20u∗ R, m,δ 3 7 1f_λ0 ∞ log3/δ 6m 4 log3/δ 3m ₂_C θlog3/δ m 1/2−θ , 4.12 which together withProposition 3.3implies

Eτ πfz− Eτ fρ,τ λfz 2 K≤24C0λ r/1−r₅₂₂_C θ1/2−θ log3 δm −1/2−θ 40u∗ R, m,δ 3 2κ#C0log 3 δ λ2r−1/2−2r m . 4.13 Here, we have used the reproducing property inHKwhich yields12

f_∞≤κf_K, ∀f∈ HK. 4.14

Equation4.9can be expressed as

u2s−θ− 4 3mlog 3 δu 1s−θ₋2Cθ m log 3 δu s₋ 4CsRs 3m u 1−θ₋2CθCsRs m 0. 4.15

By Lemma 7.2 in12, the positive solutionu∗R, m, δ/3to this equation can be bounded as

u∗R, m, δ/3≤max 6 mlog 3 δ, 8Cθ m log 3 δ 1/2−θ , 6CsRs m 1/1s , 8CθCsRs m 1/2s−θ ≤6 8Cθ1/2−θ log3 δm −1/2−θ₆_C s1/1s 8CθCs1/2s−θ ×m−1/2s−θRs/1s. 4.16

(13)

Thus, for z∈ WR∩Z3,δ∩Z1,δ∩Z2,δ, the desired bound4.2holds true. Since the measure

of the setZ3,δ∩Z1,δ∩Z2,δis at least 1−δ, our conclusion is proved.

5. Deriving Convergence Rates by Iteration

To applyProposition 4.1 for error analysis, we need someR ≥ 1 for z ∈ WR. One may chooseRλ−1/2_{according to}

fz

K≤λ

−1/2_, _∀_z_∈_Zm _5.1

which is seen by takingf 0 in1.7. This choice is too rough. Recall fromProposition 3.3

thatf_λ0K≤

#

C0λ2r−1/2−2rwhich is a bound for the noise-free limitf_λ0offz. It is much

better thanλ−1/2_{. This motivates us to try similar tight bounds for}_f

z . This target will be

achieved in this section by applyingProposition 4.1iteratively. The iteration technique has been used in13,21to improve learning rates.

Lemma 5.1. Assume2.3with 0< r ≤1/2 and2.5withs >0. Takeλm−α_{with 0}_{< α}_≤_{1 and}

m−β_{with 0}_{< β}_{≤ ∞}_{. Let 0}_{< η <}_{1. If}_ρ_{has a}_τ_{-quantile of}_p_{-average type}_q_{for some}_p_∈₀_,_∞

andq∈1,∞, then for any 0< δ <1, with confidence 1−δ, there holds

fz K≤4C 3 1√2 & C₁ & C₂ log3 η 2 log3 δm θη_, 5.2 whereθηis given by θηmax α−β 2 , α1−2r 2−2r , α 2 − 1 22−θ, α2s−θ−11s 2s−θ2s η ≥0. 5.3

Proof. Puttingλ m−α_{with 0} _{< α}_≤ _{1 and} _m−β_{with 0}_{< β} _{≤ ∞}_into_{Proposition 4.1}_{, we}

know that for anyR≥1 there exists a subsetVRofZmwith measure at mostδsuch that

fz

K≤amR

s/22s_b

m, ∀z∈ WR\VR, 5.4

where withζ: max{α−β/2, α1−2r/2−2r, α/2−1/22−θ} ≥ 0, the constants are given by am & C₃mα/2−1/22s−θ, bm ⎛ ⎝√₂ C₁log3 δ C₂log3 δ ⎞ ⎠_mζ_. _5.5 It follows that WR⊆ WamRs/22sbm ∪VR. 5.6

(14)

Let us apply5.6iteratively to a sequence{Rj_}J

j0defined byR0λ−1/2andRj

amRj−1s/22sbmwhereJ ∈Nwill be determined later. Then,WRj−1⊆ WRj∪VRj−1. By5.1,WR0 Zm_{. So we have} ZmWR0⊆ WR1∪V_R0 ⊆ · · · ⊆ W RJ∪∪_jJ₀−1V_Rj . 5.7 As the measure ofVRj is at most δ, we know that the measure of∪J_j₀−1V_Rj is at most Jδ. Hence,WRJ_{has measure at least 1}₋_Jδ_.

DenoteΔ s/22s<1/2. The definition of the sequence{Rj_}J

j0tells us that RJa_m1ΔΔ2···ΔJ−1R0Δ J J−1 j1 a_m1ΔΔ2···Δj−1b_mΔjbm. 5.8

Let us bound the two terms on the right-hand side. The first term equals

C₃1−ΔJ/21−Δmα2s−θ−1/42s−2θ1−ΔJ/1−Δmα/2ΔJ, 5.9

which is bounded by

C₃mα2s−θ−1/42s−2θ1−Δmα/2−α2s−θ−1/42s−2θ1−ΔΔJ

≤C₃mα2s−θ−11s/2s−θ2sm1/2s−θ2−J. 5.10

TakeJto be the smallest integer greater than or equal to log1/η/log 2. The above expres-sion can be bounded byC₃mα2s−θ−11s/2s−θ2sη_.

The second terms equals

J−1 j1 a1ΔΔ_m 2···Δj−1b_mΔjbm≤ J−1 j1 C₃mα/2−1/22s−θ1−Δj/1−ΔbΔ₁jmζΔjb1mζ, 5.11 whereb1 : √

2&C₁log3/δ &C₂log3/δ. It is bounded by

mα2s−θ−11s/2s−θ2sC₃b1

J−1

j0

mζ−α2s−θ−11s/2s−θ2ssj/22sj. 5.12

When ζ ≤ α2s−θ−11s/2s−θ2s, the above expression is bounded by

C₃b1Jmα2s−θ−11s/2s−θ2s. When ζ ≥ α2s−θ−11s/2s−θ2s, it is

(15)

Based on the above discussion, we obtain

RJ≤C₃C₃b1J

mθη_, _5.13

whereθηmax{α2s−θ−11s/2s−θ2s η, ζ}. So with confidence 1−Jδ,

there holds fz K≤R J_≤_C 3 1√2 & C₁ & C₂ J log3 δm θη_. 5.14

Then, our conclusion follows by replacingδbyδ/Jand notingJ≤2 log3/η. Now, we can prove our main result,Theorem 2.5.

Proof ofTheorem 2.5. TakeRto be the right side of5.2. ByLemma 5.1, there exists a subset

V_R ofZm_{with measure at most}_δ_{such that}_Zm_\_V

R⊆ WR. ApplyingProposition 4.1to this

R, we know that there exists another subsetVR ofZmwith measure at mostδsuch that for

any z∈ WR\VR, Eτ πfz − Eτ fρ,τ ≤2m−βC₁log3 δm −αr/1−r_C 2log 3 δm −1/2−θ C₄ log3 η 2 log3 δm s/1sθη−1/2s−θ_, 5.15 where C₄ C₃4C₃s/1s 1√2 & C₁ & C₂ . 5.16

Since the setVR∪V_R has measure at most 2δ, after scaling 2δtoδand setting the constantC

by

C2Cq,ρ

2C₁C₂ C₄, 5.17

we see that the above estimate together withLemma 3.1gives the error bound

πfz −fρ,τ Lp_ρX∗ ≤C log3 η 2 log3 δm −ϑ _5.18

with confidence 1−δand the power indexϑgive by

ϑ 1 qmin β, αr 1−r, 1 2s−θ − s 1sθη , 5.19

(16)

provided that

θη<

1s

s2s−θ. 5.20

Sinceβ≥αr/1−r, we know thatα−β/2≤α1−2r/2−2r. By the restriction 0< α <

2s/s2s−θonα, we findα1−2r/2−2r<1s/s2s−θandα/2−1/22−θ<

1s/s2s−θ. Moreover, restriction2.7onηtell us thatα2s−θ−11s/2

s−θ2s η <1s/s2s−θ. Therefore, condition5.20is satisfied. The restriction

β≥αr/1−rand the above expression forϑtells us that the power index for the error bound can be exactly expressed by formula2.9. The proof ofTheorem 2.5is complete.

Finally, we proveTheorem 2.3.

Proof ofTheorem 2.3. Sincefρ,τ ∈ HK, we know that2.3holds withr1/2. The noise

condi-tion onρis satisfied withq2 andp∈0,∞. Then,θp/p1∈0,1. SinceX ⊂Rn_and

K ∈ C∞X×X, we know from11that2.5holds true for anys > 0. With 0< η <p

1/2p2, let us choosesto be a positive number satisfying the following four inequalities:

p1 p2 < 2s s2s−θ, 1 3 < 1s$22s−s2s−θp1/p2% s2s−θ2s , p1 p2 −2η≤ 1 2s−θ − s$2s−θp1/p2−1% 2s−θ2s − s 31s, p1 p2 −2η≤ 1 2s−θ. 5.21

The first inequality above tells us that the restrictions onα, βare satisfied by choosingα

p1/p2 1/2−θandp1/p2≤ β ≤ ∞. The second inequality shows that condition2.7for the parameterηrenamed now asη∗is also satisfied by takingη∗ 1/3. Thus, we applyTheorem 2.5and know that with confidence 1−δ,2.8holds with the power indexϑgiven by2.9; butr 1/2, α1/2−θ, andη∗1/3 imply that

ϑ 1 2min α, 1 2s−θ, 1 2s−θ − sα2s−θ−1 2s−θ2s − s 31s . 5.22 The last two inequalities satisfied bysyieldϑ ≥α/2−η. So2.8verifies2.4. This com-pletes the proof of the theorem.

Acknowledgment

The work described in this paper is supported by National Science Foundation of China un-der Grant 11001247 and by a grant from the Research Grants Council of Hong KongProject no. CityU 103709.

(17)

References

1 V. N. Vapnik, Statistical Learning Theory, Adaptive and Learning Systems for Signal Processing, Com-munications, and Control, John Wiley & Sons, New York, NY, USA, 1998.

2 N. Aronszajn, “Theory of reproducing kernels,” Transactions of the American Mathematical Society, vol. 68, pp. 337–404, 1950.

3 H. Tong, D.-R. Chen, and L. Peng, “Analysis of support vector machines regression,” Foundations of

Computational Mathematics, vol. 9, no. 2, pp. 243–257, 2009.

4 R. Koenker, Quantile Regression, vol. 38 of Econometric Society Monographs, Cambridge University Press, Cambridge, UK, 2005.

5 I. Steinwart and A. Christman, “How SVMs can estimate quantile and the median,” in Advances in

Neural Information Processing Systems, vol. 20, pp. 305–312, MIT Press, Cambridge, Mass, USA, 2008.

6 I. Steinwart and A. Christmann, “Estimating conditional quantiles with the help of the pinball loss,”

Bernoulli, vol. 17, no. 1, pp. 211–225, 2011.

7 D. H. Xiang, “Conditional quantiles with varying Gaussians,” online first in Advances in Computational

Mathematics.

8 T. Hu, D. H. Xiang, and D. X. Zhou, “Online learning for quantile regression and support vector regression,” Submitted.

9 D.-R. Chen, Q. Wu, Y. Ying, and D.-X. Zhou, “Support vector machine soft margin classifiers: error analysis,” Journal of Machine Learning Research, vol. 5, pp. 1143–1175, 2004.

10 D.-X. Zhou, “The covering number in learning theory,” Journal of Complexity, vol. 18, no. 3, pp. 739– 767, 2002.

11 D.-X. Zhou, “Capacity of reproducing kernel spaces in learning theory,” IEEE Transactions on

Infor-mation Theory, vol. 49, no. 7, pp. 1743–1752, 2003.

12 F. Cucker and D.-X. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge Mono-graphs on Applied and Computational Mathematics, Cambridge University Press, Cambridge, UK, 2007.

13 Q. Wu, Y. Ying, and D.-X. Zhou, “Learning rates of least-square regularized regression,” Foundations

of Computational Mathematics, vol. 6, no. 2, pp. 171–192, 2006.

14 Q. Wu and D.-X. Zhou, “Analysis of support vector machine classification,” Journal of Computational

Analysis and Applications, vol. 8, no. 2, pp. 99–119, 2006.

15 T. Hu, “Online regression with varying Gaussians and non-identical distributions,” Analysis and

Ap-plications, vol. 9, no. 4, pp. 395–408, 2011.

16 D.-H. Xiang, T. Hu, and D.-X. Zhou, “Learning with varying insensitive loss,” Applied Mathematics

Letters, vol. 24, no. 12, pp. 2107–2109, 2011.

17 S. Smale and D.-X. Zhou, “Learning theory estimates via integral operators and their approxima-tions,” Constructive Approximation, vol. 26, no. 2, pp. 153–172, 2007.

18 S. Smale and D.-X. Zhou, “Online learning with Markov sampling,” Analysis and Applications, vol. 7, no. 1, pp. 87–113, 2009.

19 Y. Yao, “On complexity issues of online learning algorithms,” IEEE Transactions on Information Theory, vol. 56, no. 12, pp. 6470–6481, 2010.

20 Y. Ying, “Convergence analysis of online algorithms,” Advances in Computational Mathematics, vol. 27, no. 3, pp. 273–291, 2007.

21 I. Steinwart and C. Scovel, “Fast rates for support vector machines using Gaussian kernels,” The

(18)

Submit your manuscripts at

http://www.hindawi.com

Hindawi Publishing Corporation

http://www.hindawi.com Volume 2014

Mathematics

Journal of

http://www.hindawi.com Volume 2014 Mathematical Problems in Engineering

Hindawi Publishing Corporation http://www.hindawi.com

Differential Equations

International Journal of

Volume 2014

Applied Mathematics Hindawi Publishing Corporation

Probability and Statistics Hindawi Publishing Corporation

Mathematical PhysicsAdvances in

Complex Analysis

Journal of

Optimization

Journal of Hindawi Publishing Corporation

Combinatorics

International Journal of Hindawi Publishing Corporation

Operations Research

Journal of

Function Spaces

Abstract and Applied Analysis Hindawi Publishing Corporation

http://www.hindawi.com Volume 2014 International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporation http://www.hindawi.com Volume 2014

The Scientific

World Journal

Algebra

Discrete Dynamics in Nature and Society

http://www.hindawi.com Volume 2014 Hindawi Publishing Corporation

Decision Sciences

Discrete Mathematics

Journal of

http://www.hindawi.com Volume 2014 Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Approximation Analysis of Learning Algorithms for Support Vector Regression and Quantile Regression

Research Article