Estimates of the Approximation Error Using Rademacher Complexity: Learning Vector Valued Functions

(1)

Journal of Inequalities and Applications Volume 2008, Article ID 640758,16pages doi:10.1155/2008/640758

Research Article

Estimates of the Approximation Error

Using Rademacher Complexity:

Learning Vector-Valued Functions

Giorgio Gnecco1, 2_{and Marcello Sanguineti}2

1_{Department of Mathematics (DIMA), University of Genova, Via Dodecaneso 35, 16146 Genova, Italy} 2_{Department of Communications, Computer, and System Sciences (DIST), University of Genova,}

Via Opera Pia 13, 16145 Genova, Italy

Correspondence should be addressed to Marcello Sanguineti,[email protected]

Received 15 January 2008; Revised 13 August 2008; Accepted 20 October 2008

Recommended by Ulrich Abel

For certain families of multivariable vector-valued functions to be approximated, the accuracy of approximation schemes made up of linear combinations of computational units containing adjustable parameters is investigated. Upper bounds on the approximation error are derived that depend on the Rademacher complexities of the families. The estimates exploit possible relationships among the components of the multivariable vector-valued functions. All such components are approximated simultaneously in such a way to use, for a desired approximation accuracy, less computational units than those required by componentwise approximation. An application toN-stage optimization problems is discussed.

Copyrightq2008 G. Gnecco and M. Sanguineti. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Various authors have derived upper bounds on the approximation error of certain linear combinations of computational units containing adjustable parameters, calledvariable-basis approximation schemes1, for various families of functions to be approximatedsee, e.g.,2–

(2)

Upper bounds on the approximation error, giving a partial explanation of this eﬃciency, have been derived using tools from statistical learning theory SLT 15. This approach was first applied by Barron2and Girosi4. The latter exploited a well-known theorem by Vapnik and Chervonenkis15, which gives, for a family of real-valued functions, a probabilistic uniform bound on the diﬀerence between the expected and empirical risks associated with a learning problem. Such a bound is expressed in terms of a combinatorial parameter, theVC dimensionof the approximating family. For functions having an integral representation as the convolutionκ∗λof anL1Rdfunctionλwith a bounded functionκ :

Rd _→ _R_{, Girosi}₄_{estimated in terms of VC-dimension the sup-norm error in approximation}

by linear combinations ofκ· −t1, . . . , κ· −tn, with the parameterst1, . . . ,tnvarying inRd;

this is a variable-basis approximation scheme1.

In8, Girosi’s estimate4was extended to the approximation of functions for which the representation κ∗λ holds withλ ∈ LpRd, 1 < p < ∞, with the error measured in a

weighted essential supremum norm. In9, Kon and Raphael used Girosi’s approach4to derive error bounds for approximation in certain Hilbert spaces frequently used in learning theory, called reproducing kernel Hilbert spacesRKSHs; see16,17, 18, Section III.3, and 19. In10, we exploited recent developments of SLT20to improve the approximation bounds from4,8,9. The estimates in10were derived in terms of theRademacher complexity

20 of the families of functions to be approximated. Among recent works dealing with learning from the point of view of approximation theory, we cite18 on the mathematical foundations of learning,21 which uses Rademacher averages, too,22 which considers noise in the sampling data,23, and the references therein. An excellent monograph devoted to this topic is24.

The above-mentioned estimates were derived for scalar functions. They can be applied separately to every component of a vector mapping but, in doing so, one does not exploit mutual dependencies, similarities, and relationships that may hold among the components themselves. This may happen in various contexts, for example, when one has to approximate the optimal policy functions inN-stage optimization problems25. Despite its relevance in a number of applications, in the learning theory community the problem of approximating vector-valued functions has been studied much less than the scalar case. Its importance in learning seems to have been first pointed out in26,27. A framework to study this problem in RKHSs was set down in 28 and further developed in other works; see29 and the references therein.

In this paper, first we improve an estimate, obtained in 10, of the approximation error for scalar functions. Then, we derive upper bounds on the error of approximating

simultaneously all componentsof multivariable vector-valued functions, using a variable-basis approximation scheme in whichall the scalar components share the same adjustable parameters inside the basis functions. In this way, one may obtain the same approximation accuracy using fewer parameters to be optimized than those required by componentwise hence scalar approximation, in which, in general, the adjustable parameters inside the basis functions are diﬀerent for the diﬀerent components. We derive our estimates in terms of the Rademacher complexities of the families of functions f : X ⊂ Rd _→ _Rk_{, whose components have}

(3)

The paper is organized as follows.Section 2describes notations and gives definitions.

Section 3 refines some estimates for scalar functions, obtained in 4 and improved in 10.Section 4contains our upper bounds on the approximation error, for certain families of vector-valued functions. Section 5 compares the bounds derived for scalar and vector functions. Section 6 discusses the application of the results to approximate dynamic programming.

2. Notations and definitions

ByRandRwe denote the sets of real and positive real numbers, respectively, and byNand Nthe sets of natural numbers and positive integers, respectively.

For a real normed linear spaceH,·,f ∈ H, andr >0, we denote byBrf,·the

closed ball of radiusrin the norm·centered atf ∈ H, that is,

Br

f,· h∈ H | h−f ≤r. 2.1

We write Br·instead of Br0,·. When the norm is clear from the context, we write

merelyH,Brf, andBrinstead ofH,·,Brf,·, andBr·, respectively.

For 1 ≤ p < ∞, a positive integer d, and a Lebesgue-measurable set X ⊆ Rd_{, we}

denote byLpXthe space of equivalence classes ofreal-valued functions onXthat have

integrablepth power with respect to the Lebesgue measure, endowed with the standard norm

·p,X. ByL∞Xwe denote the space ofequivalence classes ofreal-valued functions on

X that are essentially bounded with respect to the Lebesgue measure, endowed with the essential supremum norm·_∞,X, and byCXthe space of continuous functions onXwith

the supremum norm. Whenever there is no ambiguity, we omitXfrom the notations. Thed-dimensional Fourier transformis defined30, pages 180, 187as the operator on

L1Rd∩ L2Rd, continuously extended to an operator fromL2RdtoL2Rd, such that for every functionf∈ L1Rd∩ L2Rdone has

ft−→fs 1 2πd/2

Rd

eit,sftdt, 2.2

where·,·denotes the Euclidean inner product inRd_{. For}_f _{∈ L}

2Rd, one hasf2 f2

30, page 187.

For a positive integerd, a setX ⊆ Rd_{, and a family}_F _{of functions on}_X_{, we denote}

byFx : X → Ra function inF, wherexis a parameter used to identify elements inFwe use the notationFx since we will consider familiesF of functions onX having the integral representationfx _XKxtλtdt, whereK:X×X → R. So,Fxtis defined viaKxtand the parameterxused to identify the elements ofFis a pointx∈X.

ByPXwe denote a probability distribution onX; we write merelyPwhen the setXis

clear from the context. For every positive integern, aPX-i.i.d. sequenceis a sequence{ti}ofn

points obtained by samplingXindependentlyntimes according toPX similarly, we define

apX-i.i.d. sequence whenpXis a probability density.

(4)

sequence, and {εi} a sequence of n independent Rademacher random variables. Given a

familyF {Fx}of functionsf :X → R, theRademacher complexityofFis defined as20

RnFEt1,...,tnEε1,...,εn 1

√

nsup_F_x_∈_F

n

i 1

εiFxti

. 2.3

Upper bounds on the Rademacher complexities of various families of functions are available; see, for example,10,20,31.

The VC dimension of a family F {Fx} of real-valued functions on a set X is the maximum number hof points {ti} in X that can be separated into two distinct classes in

all 2h_{possible ways, by using functions of the form}_F_x_t₋_α_{, where the parameters}_x_and_α

vary inXandR,respectively15.

3. A bound for the approximation of scalar functions

Forr >0, theBessel potential of orderris defined as the functionβr :Rd → Rwith the Fourier

transform

βrs 2π−d/2

1s2−r/2. 3.1

We consider the family of functions defined as

F1

r

f:Rd_−→_R_|_f _β

r∗λ, λ∈ L1

Rd_, _3.2

where for two functionsg, h :Rd _→ _R_,_g_∗_hx

Rdgyhx−ydyis theirconvolution. The spaceF1

r is calledBessel potential space of order1; it is a normed space with the norm·F1

r defined for everyf∈ F1

r asfF1

r λ1.

The following result from 10 improves the approximation bound from 4, Proposition 3.1. We let

KBessel_r,x tβrx−t. 3.3

Theorem 3.1see10. There exists an absolute positive constantCsuch that the following holds. Letr, dbe positive integers,r > d, and lethrbe theV Cdimension of{Kr,Besselx }. For everyf ∈ F1rand every positive integern, there existt1, . . . ,tn∈Rdandc1, . . . , cn∈ {−1,1}such that

sup x∈Rd

fx− λ1

n

i 1

ciβrx−ti

≤Cλ1

hr

n. 3.4

In4,10, no upper bound on hr was given up to our knowledge, in the literature

even the boundedness ofhr had not yet been proven till now. In the following we provide

such an upper bound.

Proposition 3.2. Letr, d be positive integers,r > d, and lethr be theV Cdimension of{Kr,Besselx }.

(5)

Proof. Consider the two families of functionsE {KBessel

r,x }and G {Gx}, whereGxt e−x−t2. By inspection of the proof of10, Corollary 5.2we get

βrt

2−d/2

Γr/2 _∞

0

ur−d/2−1_e−t2_/₄_u

e−u_du. _3.5

This integral representation shows thatβrtis a decreasing function oft. Let us define the

two functionsa, b:0,∞ → Ras

atβrt,

bte−t2, b∞ 0.

3.6

Sincebis one-to-one, there exists a functionφ:0,1 → Rsuch thata φ◦b. Indeed, one can takeφ a◦b−1_{. Then}_β

rx−t φe−x−t

2

, that is,E φG, whereφis one-to-onesince it is the composition of two one-to-one functions. Then theV CdimensionsV CEandV CGof EandGare the same. SinceGis a subset of the family of functionsL {KGauss

x }considered in the proof of10, Corollary 5.2, where it was shown thatV CL≤d3, one has also

V CE V CG≤V CL≤d3. 3.7

The proof of the boundV CL≤d3 in10, Corollary 5.2can be slightly adapted to find a better upper bound directly onV CG. Here we report only the modifications that are required. By the same arguments of the proof of10, Corollary 5.2it is easy to show that, for every fixedx, each element of the family{−x−t2_α_}_{, where}_α_{is a real parameter, can be} expressed as a linear combination of thed2 functions

1, t1, . . . , td,t2. 3.8

Hence, by32, Theorem 1,V CGis at mostd2.

Combining10, Theorem 4.5with the proof technique ofProposition 3.2, we get the following estimate. Recall that a function ψ with bounded variation can be written as the diﬀerence of two decreasing functionsψ1andψ233, Theorem 4, page 331. In order to state the result we require in addition thatψ1andψ2are bounded.

Proposition 3.3. Let d be a positive integer, X ⊂ Rd _{a compact domain,} _K _: _Rd _×_Rd _→ _R

continuous, λ ∈ L1X, and let f : X → R have the representation fx

XKxtλtdt, whereKxt ψx−tandψzhas bounded variation. Suppose that there existτ1, τ2 ≥ 0and

a decomposition of ψz as ψz ψ1z−ψ2z such thatψ1zand ψ2zare decreasing and sup_t_∈_R|ψ1z| ≤ τ1,supt∈R|ψ2z| ≤τ2. Then there exist two absolute positive constantsC1andC2

such that for every positive integern, there existt1, . . . ,tn∈Xandc1, . . . , cn∈ {−1,1}for which

sup x∈X

fx− λ1

n

i 1

ciK

x,ti

≤ λ1

C1

d2C2

τ1τ2

1

(6)

Proof. SettingK1,xtψ1x−tandK2,xtψ2x−t, we get

fx

X

Kxtλtdt

X

K1,xtλtdt−

X

K2,xtλtdt. 3.10

Since the functionsψ1andψ2are decreasing, exploiting similar arguments as in the proof of Proposition 3.2we conclude that the upper boundd2 holds for theV Cdimensions of the families of functions{K1,x}and{K2,x}, too. Then the statement follows by10, Theorem 4.3 and Lemma 4.4and the subadditivity property of the Rademacher complexity34, Theorem 12, point 7.

4. Bounds for the approximation of vector-valued functions

In10, the following approximation error bound was obtained in terms of the Rademacher complexity for some quite general families of functions. It improves, at least asymptotically, the bound derived in4for the same families of functions.

Theorem 4.1see10. LetX ⊂Rd_{be a compact domain,}_K_:_X_×_X _→ _R_{continuous, and}_{τ >}₀

such that, for allxandt, one has|Kx,t| ≤τ. Letλ∈ L1X, letfbe a real-valued function onX

having the representationfx _XKx,tλtdt, andRnthe Rademacher complexity of the family

{Kx,·}. There exists an absolute positive constantCsuch that for every positive integernthere exist

t1, . . . ,tn∈Xandc1, . . . , cn∈ {−1,1}for which

sup x∈X

fx− λ1,X

n

i 1

ciKx,ti

≤Cλ1,X

Rnτ

1

n. 4.1

Note that the approximation scheme in4.1requiresnbinary parameters andndreal parameters.

Remark 4.2. Inspection of the proof of Theorem 4.1 shows that if λt is a nonnegative function, thenci 1,i 1, . . . , n. Thus, ifKx,tis convexor concave, bothfxand its

sparse approximationλ1/n

_n

i 1ciKx,tiare convexconcave, resp., too. Preservation of

convexityor concavityis an interesting property of approximators, sometimes exploited in applications35.

We first derive for vector-valued functions an immediate corollary ofTheorem 4.1. Then, we turn our attention to the simultaneous approximation of all the components of a vector-valued function, which allows one to exploit similarities that may hold among them.

Corollary 4.3. LetX ⊂ Rd_{be a compact domain,}_K _: _X_×_X _→ _R_continuous,_{τ >} ₀_{such that,}

for allxandt, one has|Kx,t| ≤τ, andRnthe Rademacher complexity of the family{Kx,·}. Let

f:X → Rk_{be such that each of its components has the representation}_f mx

XKx,tλmtdt, whereλm∈ L1X,m 1, . . . , k. Then for every >0it is possible to approximate in the supremum

norm onXeach componentfmoffwith an error at mostby using approximations of the form

fmx

λm1,X

nm nm

i 1

cm,iK

x,tm,i

(7)

where, fori 1, . . . , nmand m 1, . . . , k,tm,i ∈ X and cm,i ∈ {−1,1}, provided that for every

m 1, . . . , kone has

nm≥

C2_R

nmτ 2

λm2₁_,X

2 , 4.3

whereCis an absolute positive constant.

Proof. It follows immediately by Theorem 4.1 applied to each of the k components of f, choosing each nm such that the correspondent approximation error is upper bounded

by.

To approximate all thek components off, the approximation scheme4.2requires n k_m ₁nmbinary parameters, andndreal parameters.

Let us now consider the case where similarities among the k components of fare present. We would like to exploit them to obtain a desired approximation accuracy with a number of parameters smaller than the one given by 4.2, by letting some parameters be shared among thekapproximators. Assume that, in the expressionfmx

XKx,tλmtdt,

m 1, . . . , k, for the components off, eachλm∈ L1Xcan be written as

λmt λmtλt, 4.4

whereλm are measurable functions taking only the values−1 and1 and 0/λ ∈ L1Xis a nonnegative function there is no loss of generality in assuming that λ is nonnegative. Setting

Kmx,tKx,tλmt, 4.5

one has

fmx

X

Kmx,tλtdt, 4.6

which can be written as

fmx

λ1,X

X

Kmx,t λ

t

λ1,X

dt. 4.7

As noted in4, for everyx∈Xthe representation4.7can be regarded as the expected value of the random variable sgnλtKmx,twith respect to the probability density|λt|/λ1,X.

(8)

Exploiting the particular nature of the functionsλmt, which can take only the values

−1 and1, the following proposition shows that the Rademacher complexities of the families

{Kmx,·}and{Kx,·}are the same.

Proposition 4.4. Let Rn and Rm,n be the Rademacher complexities of the families {Kx,·} and

{Kmx,·}, respectively. Then for everym 1, . . . , k,Rm,n Rn.

Proof. It follows directly from the definition of the Rademacher complexity, as eachλmtcan

take only the values−1 and1 and the Rademacher random variablesiare independent and

symmetrically distributed around 0.

To derive an extension ofTheorem 4.1to families of vector-valued functions, we will exploit the following estimate from10. Recall that for a familyF {Fx}of functions onX and a probability distributionPX onX, theexpected riskassociate with a functionFx ∈ F is defined as

RFx

X

FxtdPXt. 4.8

So,RFx EPX{Fxt}, whereEPXis the expectation operator. Theempirical riskassociate with the functionFxt∈Fand the sequence{ti}of samples is defined as

Remp

Fx,ti

1

n

i 1

Fxti. 4.9

Theorem 4.5see10. LetPX be a probability distribution onX ⊆ Rd,{ti}aPX-i.i.d. sequence ofnpoints in X,τ > 0, andF a family of −τ, τ-valued continuous functions with Rademacher complexityRn. Then there exists an absolute constantCsuch that for all0< δ <1, with probability at least1−δ,one has

sup

Fx∈F

_R

Fx−Remp

Fx,ti≤2τ C

1 nmax

_R

nτ

2τ 2

,ln1 δ

. 4.10

In the next theoremTheorem 4.7, we will improve the estimate fromTheorem 4.1

for approximation of vector-valued functions fork 1 and in the limit δ → 0 one gets

Theorem 4.1. To derive such an extension, we will need the density result stated in the following lemma. ForX ⊂ Rd _compact,_K _: _X _×_X _→ _R_{continuous, and} _λ _{∈ L}

1X, we let

Fλ

f:X−→R|fx

X

Kx,tλboundtλtdt, λbound∞,X≤1

,

Gλ

g:X−→R|gx

X

Kx,tλbintλtdt, λbin:X−→ {±1}measurable

.

4.11

Lemma 4.6. LetX⊂Rd_{be compact,}_K_:_X_×_X _→ _R_{continuous, and}_λ_{∈ L}

(9)

Proof. Given an element ofFλ, sayfboundx

XKx,tλboundtλtdt, and an arbitrary >0,

let us find an element ofGλ, saygbinx

XKx,tλbintλtdt, such that supx∈X|fboundx−

gbinx|< .

AsKx,tis uniformly continuous onX ×X, for everyη > 0 it is possible to find a partition{Pi:i 1, . . . , Nη}ofXsuch that for everyi, j∈ {1, . . . , Nη}one has

Kmaxi,j−Kmini,j≤η, 4.12

whereKmaxi,jmaxx,t∈Pi×PjKx,tandKmini,jminx,t∈Pi×PjKx,t. Letx∈Pi. As

Pi

Kx,tλboundtλtdt

Pj

Kx,t−Kmini,jλboundtλtdt

Kmini,j

Pj

λboundt−λbint

λtdt

Pj

Kmini,j−Kx,tλbintλtdt

Pj

Kx,tλbintλtdt,

4.13

we get

sup x∈Pi

Pj

Kx,tλboundtλtdt−

Pj

Kx,tλbintλtdt

≤sup

x∈Pi

Pj

sup

x∈Pi Kmini,j

Pj

_λ_bound_t₋_λ_bin_t_λ_t_d_t

sup

x∈Pi

Pj

.

4.14

As to the first and third terms in the right-hand side of4.14, by H ¨older’s inequality and4.12we get

sup x∈Pi

Pj

≤sup

x∈Pi

_Kx,·−Kmini,j_∞_,P_jλboundλ₁_,P_j

≤ηλ1,Pj,

sup x∈Pi

Pj

≤sup

x∈Pi

_Kx,·−Kmini,j_∞,Pjλbinλ1,Pj

≤ηλ1,Pj.

4.15

Now we consider the second term in the right-hand side of4.14. Letc_P

jλboundtλtdt and dividePjinto two setsAjandBj, such that

(10)

by33, Theorem 6, page 300, asλbound∞ ≤ 1 impliesc≤ λ1,Pj. Choosingλbinsuch that

λbint sgnλt, for allt∈Ajandλbint −sgnλt, for allt∈Bj, simple computations

give

Pj

_λ_bound_t₋_λ_bin_t_λ_t_d_t ₀_. _4.16

Summing up, takingη /2 for every >0 we get

sup x∈X

fboundx−gbinx

sup x∈X

Pj Pj

Kx,tλboundtλtdt−

Pj

Kx,tλbintλtdt

≤λ1.

4.17

The statement follows by letting → 0 and considering correspondingly refined partitions.

Now we can prove our main result.

Theorem 4.7. LetX ⊂Rd_{be a compact domain,}_K_:_X_×_X _→ _R_continuous,_{τ >}₀_{such that, for}

allxandt, one has|Kx,t| ≤ τ, andRnbe the Rademacher complexity of the family{Kx,·}. Let

0/λ∈ L1Xbe a nonnegative function onXandf:X → Rksuch that each of its components has

the representationfmx

XKmx,tλtdt. Form 1, . . . , k, setKmx,t Kx,tλmt, where eachλmis a measurable function such thatλm∞,X≤1. Then there exists an absolute constantCsuch that for every positive integernthere existt1, . . . ,tn∈Xandcm,1, . . . , cm,n∈ {−1,1},m 1, . . . , k, for which one has simultaneously for allm 1, . . . , k

sup x∈X

fmx−

λ1,X

n

i 1

cm,iK

x,ti

≤2τ Cλ1,X

1 nmax

_R

nτ

2τ 2

,lnk

. 4.18

Proof. For every x ∈ X, fmx/λ1,X

XKmx,tλt/λ1,Xdt can be considered as

the expected risk of the function sgnλ·Kmx,· with respect to the probability density

|λt|/λ1,X. Combining Proposition 4.4, Theorem 4.5, and Lemma 4.6, for every positive

integerlandm 1, . . . , k, it is possible to find an approximationλl_bin_,mtofλmtsuch that

for everyδm > 0 and everyλt/λ1,X-i.i.d. sequence{ti}, we get with probability at least

1−δm

sup x∈X

fmx

λ1,X −

1 n n i 1

λl_bin_,mtiK

x,ti

≤2τC

1 nmax

_R

nτ

2τ 2

,ln 1 δm

1

l. 4.19

For simplicity and without any loss of generality, we takeδm δ <1/k, for allm 1, . . . , k.

By standard probability arguments we get with probability at least 1−kδ

sup x∈X

fmx

λ1,X −

1 n n i 1

λl_bin_,mtiK

x,ti

≤2τ C

1 nmax

_R

nτ

2τ 2

,ln1 δ

1

(11)

simultaneously for all components off. Then, there exists a choicetl,δ_i ofti,i 1, . . . , n, for

which the above-written upper bound holds.

AsXis compact, by lettingl → ∞andδ → 1/kit is possible to extract a subsequence denoted—with a little abuse—by{tl,δ_i }, such that, for eachi 1, . . . , n,tl,δ_i → t∗_i ∈X. Similarly, since for everylone has a finite number of binary parameterscl,δ_m,i λl

bin,mt l,δ

i , one can also

extract a subsequence denoted—with a little abuse—by{cl,δ_m,i}, such that, for everyi 1, . . . , n andm 1, . . . , k, cl,δ_m,i → c∗_m,i∈ {−1,1}asl → ∞andδ → 1/k. Then

sup x∈X

fmx−

λ1,X

n

i 1

c_m,i∗ Kx,t∗_i≤sup x∈X

fmx−

λ1,X

n

i 1

cl,δ_m,iKx,tl,δ_i

sup

x∈X

λ1,X

n

i 1

cl,δ_m,iKx,tl,δ_i −c∗_m,iKx,t∗_i.

4.21

AsKx,tis uniformly continuous onX×X, in the limit we get

sup x∈X

fmx−

λ1,X

n

i 1

c_m,i∗ Kx,t∗_i≤2τCλ1,X

1 nmax

_R

nτ

2τ 2

,lnk

. 4.22

Remark 4.8. Note that Theorem 4.7 does not require knowledge of the Rademacher complexities of the families{Kmx,·}, which might not be related in a simple way to that

of{Kx,·}diﬀerently fromProposition 4.4, where a very special structure ofλmthas been

used.

Remark 4.9. IfR_nτ/2τ2 ≥ lnkin particular, this holds for the scalar casek 1, one simply gets

sup x∈X

fmx−

λ1,X

n

i 1

cm,iK

x,ti

≤Cλ1,X

Rnτ

1 n,

which has the same form of the bound given inTheorem 4.1but holds simultaneously for all the components off.

Corollary 4.10. LetX⊂Rd_{be a compact domain,}_K_:_X_×_X _→ _R_continuous,_{τ >}₀_{such that, for}

allxandt, one has|Kx,t| ≤τ, and letRnbe the Rademacher complexity of the family{Kx,·}. Let

0/λ ∈ L1Xbe a nonnegative function onXandf:X → Rksuch that every component has the

representationfmx

XKmx,tλtdt. Form 1, . . . , k, setKmx,t Kx,tλmt, where each

λmis a measurable function such thatλm∞,X≤1. Then for each >0it is possible to approximate in the supremum norm onXeach componentfmoffwith an error at mostby using approximations of the form

fmx

λ1,X

n n i 1

cm,iK

x,ti

(12)

whereti∈Xandcm,i∈ {−1,1}, fori 1, . . . ,nandm 1, . . . , k, provided that

n≥ C

2_λ2 1,X

2 max

Rnτ

2

, 4τ2lnk, 4.24

whereCis an absolute constant.

Proof. It follows immediately by applying Theorem 4.7, by choosing n such that the approximation error for each of thekcomponents offis upper bounded by.

5. Comparison between the scalar and the vector approximation schemes

To approximate in the supremum norm onXall thekcomponents of a functionf:X⊂Rd _→

Rk_{with an error at most}_>_{0, the vector approximation scheme}_4.23_requires

ink binary parameters, iind real parameters,

wheren≥C2_λ2

1,X/2max{Rnτ

2_, ₄_τ2_ln_k_}_.

To the same end, the scalar approximation scheme4.2requires

inbinary parameters, iin d real parameters,

wheren k_m ₁nmandnm≥C2Rnm τ

2_λ

m2₁_,X/2.

In contrast to4.2, in4.23the real parameterstiare the same for all the components.

In the following, we compare the lower bounds on the numbers of real parameters given by 4.2and4.23. To this end, we focus on a case in which the dependence of the Rademacher complexityRnonncan be estimated from above.

Recall that areproducing kernel Hilbert space (RKHS)is a Hilbert space Hformed by functions defined on a nonempty setX such that for everyx ∈ X the evaluation functional

Fx, defined for anyf∈ HasFxf fx, is bounded16,19. RKHSs can be characterized in terms ofkernels, which forX ⊆Rd_are_{symmetric positive-semidefinite}_functions_K_:_X_×_X _→ _R_,

that is, symmetric functions such that for all positive integersm, allw1, . . . , wm∈Rm, and

allx1, . . . ,xm∈Xmsatisfy the conditionmi,j 1wiwjKxi,xj≥0.A kernel is called positive-definiteif the previous inequality is strict for allw1, . . . , wm/ 0, . . . ,0. If the kernelKis also

continuous, then it is called aMercer kernel. For every kernelK :X×X → Randx∈X, we define the functionKx:X → Ras

Kx·Kx,·. 5.1

If there exists a functionκ:Rd _→ _R_{such that the kernel}_K_{can be written as}_K_x_t_κ_x₋_t_,

(13)

representerofx, such thatFxf f, Kx,for allf ∈ Hthis property is called thereproducing

property. It is easy to check that the functionK :X×Xdefined for allx,y∈XasKx,y

x,y, where·,·denotes the ordinary scalar product inRd_{restricted to}_X_{, is a kernel. On the}

other hand, every kernelK:X×X → Rgenerates an RKHSHKXthat is the completion of

the linear span of the set{Kx :x∈X}, with the inner product defined asKx, Ky_K Kx,y; we denote by·Kthe induced normsee, e.g.,16and19, page 81.

For a compact domainX⊂Rd_{, a positive-definite, continuous kernel}_K_:_X_×_X _→ _R_,

and a probability measureμonX, we define the integral operatorTK:L2μ → L2μas

TKfx

X

Kx,tftdμt. 5.2

The following proposition gives an upper bound on the Rademacher complexity of the family GK {Kx,·}associate with the kernelKdefiningTK. By TrTK,we denote thetraceofTK,

that is, the sum of its eigenvalues.

Proposition 5.1. LetX⊂Rd_{be a compact domain,}_K_:_X_×_X _→ _R_{a positive-definite Mercer kernel,}

μa probability measure onX, andTK :L2μ → L2μthe integral operator defined by5.2. Let

HKXbe the RKHS associate with the kernelK, GK {Kx,·}, andsK supx∈XKx,·K

maxx∈X

Kx,x. If the largest eigenvaluel1TKofTKsatisfiesl1TK≥1/n, then

Rn

GK

≤sK

TrTK

.

Proof. LetBKbe the unit ball ofHKXcentered at 0. By the definition ofsK,GK ⊆ sKBK.

Then, the statement follows by20, Theorem 15, parts 1 and 3and20, Theorem 16.

Let us assume that the required approximation error > 0 is suﬃciently small, so that the assumptionl1TK ≥ 1/nofProposition 5.1is satisfied withnequal to eachnm in

Corollary 4.3and toninCorollary 4.10. Then, taking for simplicity of comparisonλm1,X

λ1,X which holds, e.g., when the functionsλmtake values in{±1}and using the upper

bound on the Rademacher complexity given inProposition 5.1, the lower bounds onnandn given byCorollary 4.3andCorollary 4.10become, respectively,

in≥kC2_s

K

TrTK τ

2

λ2

1,X/2;

iin≥C2_λ2

1,X/2max{sK

TrTK τ

2

,4τ2_ln_k_}_.

(14)

6. Application toN-stage optimization problems

Let us consider the solution, by the well-known dynamic programmingDPalgorithm14, of the followingfinite-horizon, discrete-time dynamic optimization problem, modeled as in37: givenx0∈X, findx1, . . . ,xN ∈Xsuch that

Jox0

sup

N−1

t 0

βthxt,xt1

βNhN

xN

,

wherext,xt1

∈D, t 0,1, . . . , N−1.

6.1

The vectorxt∈X ⊆ Rd represents the state of a dynamical system,Xis the set to which the

state vector belongsstate space,D ⊆X×Xis the graph of acorrespondencethat models the transition from one stage to the following one,h: D → Ris atransition reward,hN :X →

Ris thefinal reward associate with the final stage, 0 < β ≤ 1 is a discount factor, and Jo _is

the so-calledvalue function. Using an economic terminology,βt_h_x

t,xt1andβNhNxNare

actualized values of transition and final rewards, respectively. For simplicity of notations and without loss of generality, we assume thatX,h, andDdo not depend on the timet.

Dynamic programming considers the followingNsubproblems:

J_NoxN hNxN; 6.2

givenxi∈X, findxi1, . . . ,xN∈Xsuch that

J_ioxi

sup

N−1

t i

βt−ihxt,xt1

βN−ihN

xN

, i N−1, . . . ,0,

wherext,xt1

∈D, t i, . . . , N−1.

6.3

Under suitable conditions on the problem formulation 38, Theorem 4.1 can be exploited to find sparse approximations for both theith stage value functionsJo

i : X → R

and each component of the optimalith stage policy functionsgo_i :X → X,i N−1, . . . ,0. For the latter, since each component is dealt with separately, in general one expects that the parameterst1, . . . ,tn∈Xandc1, . . . , cn ∈ {−1,1}for which the bound given inTheorem 4.1

holds will be diﬀerent from one component to the other. In order to reduce the number of parameters needed to obtain a desired accuracy of approximation of the optimal policy functions, it is useful to have at one’s disposal an approximation error bound for which some of the parameters are common to the approximators of all the components of each optimal policy function. If similaritiescorrelationsamong the components of each optimal policy function to be approximated exist and can be modeled as inTheorem 4.7, the latter gives such an upper bound. In dynamic optimization problems, these similarities can arise in several waysalthough their analysis is problem-dependent.

For example, applying the dynamic programming algorithm at stageN−1, for each

xN−1∈Xone has to find

xo

N goN−1

xN−1

arg max xN∈X

hxN−1,xN

βhN

xN

(15)

Similarities among the components ofgo_N₋₁may be present, depending on the properties of hN−1andhN, which are given in the problem formulation.

Another case in which there can be similarities among the components of the optimal policy function may occur in the case of infinite-horizon, discrete-time dynamic optimization problemsmodeled as follows: givenx0∈Xand 0< β <1, findx1,x2, . . . ,∈Xsuch that

Jox0

sup

∞

t 0

βthxt,xt1

,

wherext,xt1

∈D, t 0,1, . . . .

6.5

It is known that under suitable hypotheses an optimal policy has one or more stationary points, that is, there existsx0 ∈ X such that gox0 x0. In such a case, ifgo is Lipschitz continuous which holds under some conditions; see 37 and its Lipschitz constant is “suﬃciently small,” then one hasgo_x∼_x_{on a “su}_ﬃ_{ciently large” neighborhood of}_x

0.

Acknowledgment

The authors were partially supported by a PRIN grant from the Italian Ministry for University and Research, project “Models and Algorithms for Robust Network Optimization.”

References

1 V. K ˚urkov´a and M. Sanguineti, “Comparison of worst case errors in linear and neural network approximation,”IEEE Transactions on Information Theory, vol. 48, no. 1, pp. 264–275, 2002.

2 A. R. Barron, “Neural net approximation,” inProceedings of the 7th Yale Workshop on Adaptive and Learning Systems, K. Narendra, Ed., pp. 69–72, Yale University Press, New Haven, Conn, USA, May 1992.

3 A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,”IEEE Transactions on Information Theory, vol. 39, no. 3, pp. 930–945, 1993.

4 F. Girosi, “Approximation error bounds that use VC-bounds,” inProceedings of the International Conference on Artificial Neural Networks (ICANN ’95), pp. 295–302, Paris, France, October 1995.

5 Y. Makovoz, “Uniform approximation by neural networks,”Journal of Approximation Theory, vol. 95, no. 2, pp. 215–228, 1998.

6 V. E. Maiorov, “On best approximation by ridge functions,”Journal of Approximation Theory, vol. 99, no. 1, pp. 68–94, 1999.

7 A. Pinkus, “Approximation theory of the MLP model in neural networks,”Acta Numerica, vol. 8, no. 1, pp. 143–195, 1999.

8 M. A. Kon, L. A. Raphael, and D. A. Williams, “Extending Girosi’s approximation estimates for functions in Sobolev spaces via statistical learning theory,”Journal of Analysis and Applications, vol. 3, no. 2, pp. 67–90, 2005.

9 M. A. Kon and L. A. Raphael, “Approximating functions in reproducing kernel Hilbert spaces via statistical learning theory,” inWavelets and Splines: Athens 2005, G. Chen and M.-J. Lai, Eds., Modern Methods in Mathematics, pp. 271–286, Nashboro Press, Brentwood, Tenn, USA, 2006.

10 G. Gnecco and M. Sanguineti, “Approximation error bounds via Rademacher’s complexity,”Applied Mathematical Sciences, vol. 2, no. 1–4, pp. 153–176, 2008.

11 V. K ˚urkov´a and M. Sanguineti, “Geometric upper bounds on rates of variable-basis approximation,” IEEE Transactions on Information Theory, vol. 54, no. 12, pp. 5681–5688, 2008.

12 V. K ˚urkov´a and M. Sanguineti, “Learning with generalization capability by kernel methods of bounded complexity,”Journal of Complexity, vol. 21, no. 3, pp. 350–367, 2005.

13 V. K ˚urkov´a and M. Sanguineti, “Approximate minimization of the regularized expected error over kernel models,”Mathematics of Operations Research, vol. 33, no. 3, pp. 747–756, 2008.

(16)

15 V. N. Vapnik,The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 1995.

16 N. Aronszajn, “Theory of reproducing kernels,”Transactions of the American Mathematical Society, vol. 68, pp. 337–404, 1950.

17 J. Buescu and A. C. Paix˜ao, “Inequalities for diﬀerentiable reproducing kernels and an application to positive integral operators,”Journal of Inequalities and Applications, vol. 2006, Article ID 53743, 9 pages, 2006.

18 F. Cucker and S. Smale, “On the mathematical foundations of learning,”Bulletin of the American Mathematical Society, vol. 39, no. 1, pp. 1–49, 2002.

19 C. Berg, J. P. R. Christensen, and P. Ressel,Harmonic Analysis on Semigroups: Theory of Positive Definite and Related Function, vol. 100 ofGraduate Texts in Mathematics, Springer, New York, NY, USA, 1984.

20 S. Mendelson, “A few notes on statistical learning theory,” inAdvanced Lectures on Machine Learning, S. Mendelson and A. Smola, Eds., vol. 2600 ofLecture Notes in Computer Science, pp. 1–40, Springer, Berlin, Germany, 2003.

21 G. Blanchard, O. Bousquet, and L. Zwald, “Statistical properties of kernel principal component analysis,”Machine Learning, vol. 66, no. 2-3, pp. 259–294, 2007.

22 S. Smale and D.-X. Zhou, “Shannon sampling and function reconstruction from point values,”Bulletin of the American Mathematical Society, vol. 41, no. 3, pp. 279–305, 2004.

23 S. Smale and D.-X. Zhou, “Learning theory estimates via integral operators and their approxima-tions,”Constructive Approximation, vol. 26, no. 2, pp. 153–172, 2007.

24 F. Cucker and D.-X. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge Monographs on Applied and Computational Mathematics, Cambridge University Press, Cambridge, UK, 2007.

25 D. P. Bertsekas,Dynamic Programming and Optimal Control. Vol. II, Athena Scientific, Belmont, Mass, USA, 2nd edition, 2001.

26 F. A. Mussa-Ivaldi, “From basis functions to basis fields: vector field approximation from sparse data,” Biological Cybernetics, vol. 67, no. 6, pp. 479–489, 1992.

27 F. A. Mussa-Ivaldi and F. Gandolfo, “Networks that approximate vector-valued mappings,” in Proceedings of IEEE International Conference on Neural Networks (ICNN ’93), vol. 3, pp. 1973–1978, San Francisco, Calif, USA, March-April 1993.

28 C. A. Micchelli and M. Pontil, “On learning vector-valued functions,”Neural Computation, vol. 17, no. 1, pp. 177–204, 2005.

29 A. Caponnetto, C. A. Micchelli, M. Pontil, and Y. Ying, “Universal multi-task kernels,”Journal of Machine Learning Research, vol. 9, pp. 1615–1646, 2008.

30 W. Rudin,Real and Complex Analysis, McGraw-Hill, New York, NY, USA, 3rd edition, 1987.

31 C. S. Ong, X. Mary, S. Canu, and A. J. Smola, “Learning with non positive kernels,” inProceedings of the 21st International Conference on Machine Learning (ICML ’04), pp. 639–646, Banﬀ, Canada, July 2004.

32 E. Sontag, “VC dimension of neural networks,” inNeural Networks and Machine Learning, C. Bishop, Ed., pp. 69–95, Springer, Berlin, Germany, 1998.

33 A. N. Kolmogorov and S. V. Fom¯ın,Introductory Real Analysis, Dover, New York, NY, USA, 1975.

34 P. L. Bartlett and S. Mendelson, “Rademacher and Gaussian complexities: risk bounds and structural results,”Journal of Machine Learning Research, vol. 3, pp. 463–482, 2002.

35 O. L. Mangasarian, J. B. Rosen, and M. E. Thompson, “Convex kernel underestimation of functions with multiple local minima,”Computational Optimization and Applications, vol. 34, no. 1, pp. 35–45, 2006.

36 A. Friedman,Foundations of Modern Analysis, Dover, New York, NY, USA, 1982.

37 L. Montrucchio, “Lipschitz continuous policy functions for strongly concave optimization problems,” Journal of Mathematical Economics, vol. 16, no. 3, pp. 259–273, 1987.