• No results found

Estimates of the Approximation Error Using Rademacher Complexity: Learning Vector Valued Functions

N/A
N/A
Protected

Academic year: 2020

Share "Estimates of the Approximation Error Using Rademacher Complexity: Learning Vector Valued Functions"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

Journal of Inequalities and Applications Volume 2008, Article ID 640758,16pages doi:10.1155/2008/640758

Research Article

Estimates of the Approximation Error

Using Rademacher Complexity:

Learning Vector-Valued Functions

Giorgio Gnecco1, 2and Marcello Sanguineti2

1Department of Mathematics (DIMA), University of Genova, Via Dodecaneso 35, 16146 Genova, Italy 2Department of Communications, Computer, and System Sciences (DIST), University of Genova,

Via Opera Pia 13, 16145 Genova, Italy

Correspondence should be addressed to Marcello Sanguineti,[email protected]

Received 15 January 2008; Revised 13 August 2008; Accepted 20 October 2008

Recommended by Ulrich Abel

For certain families of multivariable vector-valued functions to be approximated, the accuracy of approximation schemes made up of linear combinations of computational units containing adjustable parameters is investigated. Upper bounds on the approximation error are derived that depend on the Rademacher complexities of the families. The estimates exploit possible relationships among the components of the multivariable vector-valued functions. All such components are approximated simultaneously in such a way to use, for a desired approximation accuracy, less computational units than those required by componentwise approximation. An application toN-stage optimization problems is discussed.

Copyrightq2008 G. Gnecco and M. Sanguineti. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Various authors have derived upper bounds on the approximation error of certain linear combinations of computational units containing adjustable parameters, calledvariable-basis approximation schemes1, for various families of functions to be approximatedsee, e.g.,2–

(2)

Upper bounds on the approximation error, giving a partial explanation of this efficiency, have been derived using tools from statistical learning theory SLT 15. This approach was first applied by Barron2and Girosi4. The latter exploited a well-known theorem by Vapnik and Chervonenkis15, which gives, for a family of real-valued functions, a probabilistic uniform bound on the difference between the expected and empirical risks associated with a learning problem. Such a bound is expressed in terms of a combinatorial parameter, theVC dimensionof the approximating family. For functions having an integral representation as the convolutionκλof anL1Rdfunctionλwith a bounded functionκ :

Rd R, Girosi4estimated in terms of VC-dimension the sup-norm error in approximation

by linear combinations ofκ· −t1, . . . , κ· −tn, with the parameterst1, . . . ,tnvarying inRd;

this is a variable-basis approximation scheme1.

In8, Girosi’s estimate4was extended to the approximation of functions for which the representation κλ holds withλ ∈ LpRd, 1 < p < ∞, with the error measured in a

weighted essential supremum norm. In9, Kon and Raphael used Girosi’s approach4to derive error bounds for approximation in certain Hilbert spaces frequently used in learning theory, called reproducing kernel Hilbert spacesRKSHs; see16,17, 18, Section III.3, and 19. In10, we exploited recent developments of SLT20to improve the approximation bounds from4,8,9. The estimates in10were derived in terms of theRademacher complexity

20 of the families of functions to be approximated. Among recent works dealing with learning from the point of view of approximation theory, we cite18 on the mathematical foundations of learning,21 which uses Rademacher averages, too,22 which considers noise in the sampling data,23, and the references therein. An excellent monograph devoted to this topic is24.

The above-mentioned estimates were derived for scalar functions. They can be applied separately to every component of a vector mapping but, in doing so, one does not exploit mutual dependencies, similarities, and relationships that may hold among the components themselves. This may happen in various contexts, for example, when one has to approximate the optimal policy functions inN-stage optimization problems25. Despite its relevance in a number of applications, in the learning theory community the problem of approximating vector-valued functions has been studied much less than the scalar case. Its importance in learning seems to have been first pointed out in26,27. A framework to study this problem in RKHSs was set down in 28 and further developed in other works; see29 and the references therein.

In this paper, first we improve an estimate, obtained in 10, of the approximation error for scalar functions. Then, we derive upper bounds on the error of approximating

simultaneously all componentsof multivariable vector-valued functions, using a variable-basis approximation scheme in whichall the scalar components share the same adjustable parameters inside the basis functions. In this way, one may obtain the same approximation accuracy using fewer parameters to be optimized than those required by componentwise hence scalar approximation, in which, in general, the adjustable parameters inside the basis functions are different for the different components. We derive our estimates in terms of the Rademacher complexities of the families of functions f : X ⊂ Rd Rk, whose components have

(3)

The paper is organized as follows.Section 2describes notations and gives definitions.

Section 3 refines some estimates for scalar functions, obtained in 4 and improved in 10.Section 4contains our upper bounds on the approximation error, for certain families of vector-valued functions. Section 5 compares the bounds derived for scalar and vector functions. Section 6 discusses the application of the results to approximate dynamic programming.

2. Notations and definitions

ByRandRwe denote the sets of real and positive real numbers, respectively, and byNand Nthe sets of natural numbers and positive integers, respectively.

For a real normed linear spaceH,·,f ∈ H, andr >0, we denote byBrf,·the

closed ball of radiusrin the norm·centered atf ∈ H, that is,

Br

f,· h∈ H | hfr. 2.1

We write Br·instead of Br0,·. When the norm is clear from the context, we write

merelyH,Brf, andBrinstead ofH,·,Brf,·, andBr·, respectively.

For 1 ≤ p < ∞, a positive integer d, and a Lebesgue-measurable set X ⊆ Rd, we

denote byLpXthe space of equivalence classes ofreal-valued functions onXthat have

integrablepth power with respect to the Lebesgue measure, endowed with the standard norm

·p,X. ByL∞Xwe denote the space ofequivalence classes ofreal-valued functions on

X that are essentially bounded with respect to the Lebesgue measure, endowed with the essential supremum norm·,X, and byCXthe space of continuous functions onXwith

the supremum norm. Whenever there is no ambiguity, we omitXfrom the notations. Thed-dimensional Fourier transformis defined30, pages 180, 187as the operator on

L1Rd∩ L2Rd, continuously extended to an operator fromL2RdtoL2Rd, such that for every functionf∈ L1Rd∩ L2Rdone has

ft−→fs 1 2πd/2

Rd

eit,sftdt, 2.2

where·,·denotes the Euclidean inner product inRd. Forf ∈ L

2Rd, one hasf2 f2

30, page 187.

For a positive integerd, a setX ⊆ Rd, and a familyF of functions onX, we denote

byFx : X → Ra function inF, wherexis a parameter used to identify elements inFwe use the notationFx since we will consider familiesF of functions onX having the integral representationfx XKxtλtdt, whereK:X×X → R. So,Fxtis defined viaKxtand the parameterxused to identify the elements ofFis a pointxX.

ByPXwe denote a probability distribution onX; we write merelyPwhen the setXis

clear from the context. For every positive integern, aPX-i.i.d. sequenceis a sequence{ti}ofn

points obtained by samplingXindependentlyntimes according toPX similarly, we define

apX-i.i.d. sequence whenpXis a probability density.

(4)

sequence, and {εi} a sequence of n independent Rademacher random variables. Given a

familyF {Fx}of functionsf :X → R, theRademacher complexityofFis defined as20

RnFEt1,...,tnEε1,...,εn 1

nsupFxF

n

i 1

εiFxti

. 2.3

Upper bounds on the Rademacher complexities of various families of functions are available; see, for example,10,20,31.

The VC dimension of a family F {Fx} of real-valued functions on a set X is the maximum number hof points {ti} in X that can be separated into two distinct classes in

all 2hpossible ways, by using functions of the formFxtα, where the parametersxandα

vary inXandR,respectively15.

3. A bound for the approximation of scalar functions

Forr >0, theBessel potential of orderris defined as the functionβr :Rd → Rwith the Fourier

transform

βrs 2πd/2

1s2−r/2. 3.1

We consider the family of functions defined as

F1

r

f:Rd−→R|f β

rλ, λ∈ L1

Rd, 3.2

where for two functionsg, h :Rd R,ghx

Rdgyhxydyis theirconvolution. The spaceF1

r is calledBessel potential space of order1; it is a normed space with the norm·F1

r defined for everyf∈ F1

r asfF1

r λ1.

The following result from 10 improves the approximation bound from 4, Proposition 3.1. We let

KBesselr,x tβrxt. 3.3

Theorem 3.1see10. There exists an absolute positive constantCsuch that the following holds. Letr, dbe positive integers,r > d, and lethrbe theV Cdimension of{Kr,Besselx }. For everyf ∈ F1rand every positive integern, there existt1, . . . ,tn∈Rdandc1, . . . , cn∈ {−1,1}such that

sup x∈Rd

fxλ1

n

n

i 1

ciβrxti

1

hr

n. 3.4

In4,10, no upper bound on hr was given up to our knowledge, in the literature

even the boundedness ofhr had not yet been proven till now. In the following we provide

such an upper bound.

Proposition 3.2. Letr, d be positive integers,r > d, and lethr be theV Cdimension of{Kr,Besselx }.

(5)

Proof. Consider the two families of functionsE {KBessel

r,x }and G {Gx}, whereGxt ext2. By inspection of the proof of10, Corollary 5.2we get

βrt

2−d/2

Γr/2

0

urd/2−1et2/4u

eudu. 3.5

This integral representation shows thatβrtis a decreasing function oft. Let us define the

two functionsa, b:0,∞ → Ras

atβrt,

btet2, b∞ 0.

3.6

Sincebis one-to-one, there exists a functionφ:0,1 → Rsuch thata φb. Indeed, one can takeφ ab−1. Thenβ

rxt φext

2

, that is,E φG, whereφis one-to-onesince it is the composition of two one-to-one functions. Then theV CdimensionsV CEandV CGof EandGare the same. SinceGis a subset of the family of functionsL {KGauss

x }considered in the proof of10, Corollary 5.2, where it was shown thatV CLd3, one has also

V CE V CGV CLd3. 3.7

The proof of the boundV CLd3 in10, Corollary 5.2can be slightly adapted to find a better upper bound directly onV CG. Here we report only the modifications that are required. By the same arguments of the proof of10, Corollary 5.2it is easy to show that, for every fixedx, each element of the family{−x−t2α}, whereαis a real parameter, can be expressed as a linear combination of thed2 functions

1, t1, . . . , td,t2. 3.8

Hence, by32, Theorem 1,V CGis at mostd2.

Combining10, Theorem 4.5with the proof technique ofProposition 3.2, we get the following estimate. Recall that a function ψ with bounded variation can be written as the difference of two decreasing functionsψ1andψ233, Theorem 4, page 331. In order to state the result we require in addition thatψ1andψ2are bounded.

Proposition 3.3. Let d be a positive integer, X ⊂ Rd a compact domain, K : Rd ×Rd R

continuous, λ ∈ L1X, and let f : X → R have the representation fx

XKxtλtdt, whereKxt ψx−tandψzhas bounded variation. Suppose that there existτ1, τ2 ≥ 0and

a decomposition of ψz as ψz ψ1z−ψ2z such thatψ1zand ψ2zare decreasing and suptR|ψ1z| ≤ τ1,supt∈R|ψ2z| ≤τ2. Then there exist two absolute positive constantsC1andC2

such that for every positive integern, there existt1, . . . ,tnXandc1, . . . , cn∈ {−1,1}for which

sup xX

fxλ1

n

n

i 1

ciK

x,ti

λ1

C1

d2C2

τ1τ2

1

(6)

Proof. SettingK1,xtψ1x−tandK2,xtψ2x−t, we get

fx

X

Kxtλtdt

X

K1,xtλtdt

X

K2,xtλtdt. 3.10

Since the functionsψ1andψ2are decreasing, exploiting similar arguments as in the proof of Proposition 3.2we conclude that the upper boundd2 holds for theV Cdimensions of the families of functions{K1,x}and{K2,x}, too. Then the statement follows by10, Theorem 4.3 and Lemma 4.4and the subadditivity property of the Rademacher complexity34, Theorem 12, point 7.

4. Bounds for the approximation of vector-valued functions

In10, the following approximation error bound was obtained in terms of the Rademacher complexity for some quite general families of functions. It improves, at least asymptotically, the bound derived in4for the same families of functions.

Theorem 4.1see10. LetX ⊂Rdbe a compact domain,K:X×X Rcontinuous, andτ >0

such that, for allxandt, one has|Kx,t| ≤τ. Letλ∈ L1X, letfbe a real-valued function onX

having the representationfx XKx,tλtdt, andRnthe Rademacher complexity of the family

{Kx,·}. There exists an absolute positive constantCsuch that for every positive integernthere exist

t1, . . . ,tnXandc1, . . . , cn∈ {−1,1}for which

sup xX

fxλ1,X

n

n

i 1

ciKx,ti

1,X

R

1

n. 4.1

Note that the approximation scheme in4.1requiresnbinary parameters andndreal parameters.

Remark 4.2. Inspection of the proof of Theorem 4.1 shows that if λt is a nonnegative function, thenci 1,i 1, . . . , n. Thus, ifKx,tis convexor concave, bothfxand its

sparse approximationλ1/n

n

i 1ciKx,tiare convexconcave, resp., too. Preservation of

convexityor concavityis an interesting property of approximators, sometimes exploited in applications35.

We first derive for vector-valued functions an immediate corollary ofTheorem 4.1. Then, we turn our attention to the simultaneous approximation of all the components of a vector-valued function, which allows one to exploit similarities that may hold among them.

Corollary 4.3. LetX ⊂ Rdbe a compact domain,K : X×X Rcontinuous,τ > 0such that,

for allxandt, one has|Kx,t| ≤τ, andRnthe Rademacher complexity of the family{Kx,·}. Let

f:X → Rkbe such that each of its components has the representationf mx

XKx,tλmtdt, whereλm∈ L1X,m 1, . . . , k. Then for every >0it is possible to approximate in the supremum

norm onXeach componentfmoffwith an error at mostby using approximations of the form

fmx

λm1,X

nm nm

i 1

cm,iK

x,tm,i

(7)

where, fori 1, . . . , nmand m 1, . . . , k,tm,iX and cm,i ∈ {−1,1}, provided that for every

m 1, . . . , kone has

nm

C2R

nmτ 2

λm21,X

2 , 4.3

whereCis an absolute positive constant.

Proof. It follows immediately by Theorem 4.1 applied to each of the k components of f, choosing each nm such that the correspondent approximation error is upper bounded

by.

To approximate all thek components off, the approximation scheme4.2requires n km 1nmbinary parameters, andndreal parameters.

Let us now consider the case where similarities among the k components of fare present. We would like to exploit them to obtain a desired approximation accuracy with a number of parameters smaller than the one given by 4.2, by letting some parameters be shared among thekapproximators. Assume that, in the expressionfmx

XKx,tλmtdt,

m 1, . . . , k, for the components off, eachλm∈ L1Xcan be written as

λmt λmtλt, 4.4

whereλm are measurable functions taking only the values−1 and1 and 0 ∈ L1Xis a nonnegative function there is no loss of generality in assuming that λ is nonnegative. Setting

Kmx,tKx,tλmt, 4.5

one has

fmx

X

Kmx,tλtdt, 4.6

which can be written as

fmx

λ1,X

X

Kmx,t λ

t

λ1,X

dt. 4.7

As noted in4, for everyxXthe representation4.7can be regarded as the expected value of the random variable sgnλtKmx,twith respect to the probability density|λt|1,X.

(8)

Exploiting the particular nature of the functionsλmt, which can take only the values

−1 and1, the following proposition shows that the Rademacher complexities of the families

{Kmx,·}and{Kx,·}are the same.

Proposition 4.4. Let Rn and Rm,n be the Rademacher complexities of the families {Kx,·} and

{Kmx,·}, respectively. Then for everym 1, . . . , k,Rm,n Rn.

Proof. It follows directly from the definition of the Rademacher complexity, as eachλmtcan

take only the values−1 and1 and the Rademacher random variablesiare independent and

symmetrically distributed around 0.

To derive an extension ofTheorem 4.1to families of vector-valued functions, we will exploit the following estimate from10. Recall that for a familyF {Fx}of functions onX and a probability distributionPX onX, theexpected riskassociate with a functionFxF is defined as

RFx

X

FxtdPXt. 4.8

So,RFx EPX{Fxt}, whereEPXis the expectation operator. Theempirical riskassociate with the functionFxtFand the sequence{ti}of samples is defined as

Remp

Fx,ti

1

n

n

i 1

Fxti. 4.9

Theorem 4.5see10. LetPX be a probability distribution onX ⊆ Rd,{ti}aPX-i.i.d. sequence ofnpoints in X,τ > 0, andF a family ofτ, τ-valued continuous functions with Rademacher complexityRn. Then there exists an absolute constantCsuch that for all0< δ <1, with probability at least1−δ,one has

sup

FxF

R

FxRemp

Fx,ti≤2τ C

1 nmax

R

2τ 2

,ln1 δ

. 4.10

In the next theoremTheorem 4.7, we will improve the estimate fromTheorem 4.1

for approximation of vector-valued functions fork 1 and in the limit δ → 0 one gets

Theorem 4.1. To derive such an extension, we will need the density result stated in the following lemma. ForX ⊂ Rd compact,K : X ×X Rcontinuous, and λ ∈ L

1X, we let

f:X−→R|fx

X

Kx,tλboundtλtdt, λbound∞,X≤1

,

g:X−→R|gx

X

Kx,tλbintλtdt, λbin:X−→ {±1}measurable

.

4.11

Lemma 4.6. LetX⊂Rdbe compact,K:X×X Rcontinuous, andλ∈ L

(9)

Proof. Given an element of, sayfboundx

XKx,tλboundtλtdt, and an arbitrary >0,

let us find an element of, saygbinx

XKx,tλbintλtdt, such that supxX|fboundx

gbinx|< .

AsKx,tis uniformly continuous onX ×X, for everyη > 0 it is possible to find a partition{Pi:i 1, . . . , Nη}ofXsuch that for everyi, j∈ {1, . . . , Nη}one has

Kmaxi,jKmini,jη, 4.12

whereKmaxi,jmaxx,tPi×PjKx,tandKmini,jminx,tPi×PjKx,t. LetxPi. As

Pi

Kx,tλboundtλtdt

Pj

Kx,tKmini,jλboundtλtdt

Kmini,j

Pj

λboundtλbint

λtdt

Pj

Kmini,jKx,tλbintλtdt

Pj

Kx,tλbintλtdt,

4.13

we get

sup xPi

Pj

Kx,tλboundtλtdt

Pj

Kx,tλbintλtdt

≤sup

xPi

Pj

Kx,tKmini,jλboundtλtdt

sup

xPi Kmini,j

Pj

λboundtλbintλtdt

sup

xPi

Pj

Kmini,jKx,tλbintλtdt

.

4.14

As to the first and third terms in the right-hand side of4.14, by H ¨older’s inequality and4.12we get

sup xPi

Pj

Kx,tKmini,jλboundtλtdt

≤sup

xPi

Kx,·−Kmini,j,Pjλboundλ1,Pj

ηλ1,Pj,

sup xPi

Pj

Kmini,jKx,tλbintλtdt

≤sup

xPi

Kx,·−Kmini,j,Pjλbinλ1,Pj

ηλ1,Pj.

4.15

Now we consider the second term in the right-hand side of4.14. LetcP

boundtλtdt and dividePjinto two setsAjandBj, such that

(10)

by33, Theorem 6, page 300, asλbound∞ ≤ 1 impliescλ1,Pj. Choosingλbinsuch that

λbint sgnλt, for alltAjandλbint −sgnλt, for alltBj, simple computations

give

Pj

λboundtλbintλtdt 0. 4.16

Summing up, takingη /2 for every >0 we get

sup xX

fboundxgbinx

sup xX

Pj Pj

Kx,tλboundtλtdt

Pj

Kx,tλbintλtdt

λ1.

4.17

The statement follows by letting → 0 and considering correspondingly refined partitions.

Now we can prove our main result.

Theorem 4.7. LetX ⊂Rdbe a compact domain,K:X×X Rcontinuous,τ >0such that, for

allxandt, one has|Kx,t| ≤ τ, andRnbe the Rademacher complexity of the family{Kx,·}. Let

0∈ L1Xbe a nonnegative function onXandf:X → Rksuch that each of its components has

the representationfmx

XKmx,tλtdt. Form 1, . . . , k, setKmx,t Kx,tλmt, where eachλmis a measurable function such thatλm,X≤1. Then there exists an absolute constantCsuch that for every positive integernthere existt1, . . . ,tnXandcm,1, . . . , cm,n∈ {−1,1},m 1, . . . , k, for which one has simultaneously for allm 1, . . . , k

sup xX

fmx

λ1,X

n

n

i 1

cm,iK

x,ti

≤2τ Cλ1,X

1 nmax

R

2τ 2

,lnk

. 4.18

Proof. For every xX, fmx/λ1,X

XKmx,tλt/λ1,Xdt can be considered as

the expected risk of the function sgnλ·Kmx,· with respect to the probability density

|λt|1,X. Combining Proposition 4.4, Theorem 4.5, and Lemma 4.6, for every positive

integerlandm 1, . . . , k, it is possible to find an approximationλlbin,mtofλmtsuch that

for everyδm > 0 and everyλt/λ1,X-i.i.d. sequence{ti}, we get with probability at least

1−δm

sup xX

fmx

λ1,X

1 n n i 1

λlbin,mtiK

x,ti

≤2τC

1 nmax

R

2τ 2

,ln 1 δm

1

l. 4.19

For simplicity and without any loss of generality, we takeδm δ <1/k, for allm 1, . . . , k.

By standard probability arguments we get with probability at least 1−

sup xX

fmx

λ1,X

1 n n i 1

λlbin,mtiK

x,ti

≤2τ C

1 nmax

R

2τ 2

,ln1 δ

1

(11)

simultaneously for all components off. Then, there exists a choicetl,δi ofti,i 1, . . . , n, for

which the above-written upper bound holds.

AsXis compact, by lettingl → ∞andδ → 1/kit is possible to extract a subsequence denoted—with a little abuse—by{tl,δi }, such that, for eachi 1, . . . , n,tl,δitiX. Similarly, since for everylone has a finite number of binary parameterscl,δm,i λl

bin,mt l,δ

i , one can also

extract a subsequence denoted—with a little abuse—by{cl,δm,i}, such that, for everyi 1, . . . , n andm 1, . . . , k, cl,δm,icm,i∈ {−1,1}asl → ∞andδ → 1/k. Then

sup xX

fmx

λ1,X

n

n

i 1

cm,iKx,ti≤sup xX

fmx

λ1,X

n

n

i 1

cl,δm,iKx,tl,δi

sup

xX

λ1,X

n

n

i 1

cl,δm,iKx,tl,δicm,iKx,ti.

4.21

AsKx,tis uniformly continuous onX×X, in the limit we get

sup xX

fmx

λ1,X

n

n

i 1

cm,iKx,ti≤2τCλ1,X

1 nmax

R

2τ 2

,lnk

. 4.22

Remark 4.8. Note that Theorem 4.7 does not require knowledge of the Rademacher complexities of the families{Kmx,·}, which might not be related in a simple way to that

of{Kx,·}differently fromProposition 4.4, where a very special structure ofλmthas been

used.

Remark 4.9. IfRnτ/2τ2 ≥ lnkin particular, this holds for the scalar casek 1, one simply gets

sup xX

fmx

λ1,X

n

n

i 1

cm,iK

x,ti

1,X

R

1 n,

which has the same form of the bound given inTheorem 4.1but holds simultaneously for all the components off.

Corollary 4.10. LetX⊂Rdbe a compact domain,K:X×X Rcontinuous,τ >0such that, for

allxandt, one has|Kx,t| ≤τ, and letRnbe the Rademacher complexity of the family{Kx,·}. Let

0 ∈ L1Xbe a nonnegative function onXandf:X → Rksuch that every component has the

representationfmx

XKmx,tλtdt. Form 1, . . . , k, setKmx,t Kx,tλmt, where each

λmis a measurable function such thatλm,X≤1. Then for each >0it is possible to approximate in the supremum norm onXeach componentfmoffwith an error at mostby using approximations of the form

fmx

λ1,X

n n i 1

cm,iK

x,ti

(12)

wheretiXandcm,i∈ {−1,1}, fori 1, . . . ,nandm 1, . . . , k, provided that

nC

2λ2 1,X

2 max

Rnτ

2

, 4τ2lnk, 4.24

whereCis an absolute constant.

Proof. It follows immediately by applying Theorem 4.7, by choosing n such that the approximation error for each of thekcomponents offis upper bounded by.

5. Comparison between the scalar and the vector approximation schemes

To approximate in the supremum norm onXall thekcomponents of a functionf:X⊂Rd

Rkwith an error at most >0, the vector approximation scheme4.23requires

ink binary parameters, iind real parameters,

wheren≥C2λ2

1,X/2max{R

2, 4τ2lnk}.

To the same end, the scalar approximation scheme4.2requires

inbinary parameters, iin d real parameters,

wheren km 1nmandnmC2Rnm τ

2λ

m21,X/2.

In contrast to4.2, in4.23the real parameterstiare the same for all the components.

In the following, we compare the lower bounds on the numbers of real parameters given by 4.2and4.23. To this end, we focus on a case in which the dependence of the Rademacher complexityRnonncan be estimated from above.

Recall that areproducing kernel Hilbert space (RKHS)is a Hilbert space Hformed by functions defined on a nonempty setX such that for everyxX the evaluation functional

Fx, defined for anyf∈ HasFxf fx, is bounded16,19. RKHSs can be characterized in terms ofkernels, which forX ⊆Rdaresymmetric positive-semidefinitefunctionsK:X×X R,

that is, symmetric functions such that for all positive integersm, allw1, . . . , wm∈Rm, and

allx1, . . . ,xmXmsatisfy the conditionmi,j 1wiwjKxi,xj≥0.A kernel is called positive-definiteif the previous inequality is strict for allw1, . . . , wm/ 0, . . . ,0. If the kernelKis also

continuous, then it is called aMercer kernel. For every kernelK :X×X → RandxX, we define the functionKx:X → Ras

Kx·Kx,·. 5.1

If there exists a functionκ:Rd Rsuch that the kernelKcan be written asKxt κxt,

(13)

representerofx, such thatFxf f, Kx,for allf ∈ Hthis property is called thereproducing

property. It is easy to check that the functionK :X×Xdefined for allx,yXasKx,y

x,y, where·,·denotes the ordinary scalar product inRdrestricted toX, is a kernel. On the

other hand, every kernelK:X×X → Rgenerates an RKHSHKXthat is the completion of

the linear span of the set{Kx :xX}, with the inner product defined asKx, KyK Kx,y; we denote by·Kthe induced normsee, e.g.,16and19, page 81.

For a compact domainX⊂Rd, a positive-definite, continuous kernelK:X×X R,

and a probability measureμonX, we define the integral operatorTK:L2μ → L2μas

TKfx

X

Kx,tftt. 5.2

The following proposition gives an upper bound on the Rademacher complexity of the family GK {Kx,·}associate with the kernelKdefiningTK. By TrTK,we denote thetraceofTK,

that is, the sum of its eigenvalues.

Proposition 5.1. LetX⊂Rdbe a compact domain,K:X×X Ra positive-definite Mercer kernel,

μa probability measure onX, andTK :L2μ → L2μthe integral operator defined by5.2. Let

HKXbe the RKHS associate with the kernelK, GK {Kx,·}, andsK supxXKx,·K

maxx∈X

Kx,x. If the largest eigenvaluel1TKofTKsatisfiesl1TK≥1/n, then

Rn

GK

sK

TrTK

.

Proof. LetBKbe the unit ball ofHKXcentered at 0. By the definition ofsK,GKsKBK.

Then, the statement follows by20, Theorem 15, parts 1 and 3and20, Theorem 16.

Let us assume that the required approximation error > 0 is sufficiently small, so that the assumptionl1TK ≥ 1/nofProposition 5.1is satisfied withnequal to eachnm in

Corollary 4.3and toninCorollary 4.10. Then, taking for simplicity of comparisonλm1,X

λ1,X which holds, e.g., when the functionsλmtake values in{±1}and using the upper

bound on the Rademacher complexity given inProposition 5.1, the lower bounds onnandn given byCorollary 4.3andCorollary 4.10become, respectively,

inkC2s

K

TrTK τ

2

λ2

1,X/2;

iin≥C2λ2

1,X/2max{sK

TrTK τ

2

,4τ2lnk}.

(14)

6. Application toN-stage optimization problems

Let us consider the solution, by the well-known dynamic programmingDPalgorithm14, of the followingfinite-horizon, discrete-time dynamic optimization problem, modeled as in37: givenx0∈X, findx1, . . . ,xNXsuch that

Jox0

sup

N−1

t 0

βthxt,xt1

βNhN

xN

,

wherext,xt1

D, t 0,1, . . . , N−1.

6.1

The vectorxtX ⊆ Rd represents the state of a dynamical system,Xis the set to which the

state vector belongsstate space,DX×Xis the graph of acorrespondencethat models the transition from one stage to the following one,h: D → Ris atransition reward,hN :X

Ris thefinal reward associate with the final stage, 0 < β ≤ 1 is a discount factor, and Jo is

the so-calledvalue function. Using an economic terminology,βthx

t,xt1andβNhNxNare

actualized values of transition and final rewards, respectively. For simplicity of notations and without loss of generality, we assume thatX,h, andDdo not depend on the timet.

Dynamic programming considers the followingNsubproblems:

JNoxN hNxN; 6.2

givenxiX, findxi1, . . . ,xNXsuch that

Jioxi

sup

N−1

t i

βtihxt,xt1

βNihN

xN

, i N−1, . . . ,0,

wherext,xt1

D, t i, . . . , N−1.

6.3

Under suitable conditions on the problem formulation 38, Theorem 4.1 can be exploited to find sparse approximations for both theith stage value functionsJo

i : X → R

and each component of the optimalith stage policy functionsgoi :XX,i N−1, . . . ,0. For the latter, since each component is dealt with separately, in general one expects that the parameterst1, . . . ,tnXandc1, . . . , cn ∈ {−1,1}for which the bound given inTheorem 4.1

holds will be different from one component to the other. In order to reduce the number of parameters needed to obtain a desired accuracy of approximation of the optimal policy functions, it is useful to have at one’s disposal an approximation error bound for which some of the parameters are common to the approximators of all the components of each optimal policy function. If similaritiescorrelationsamong the components of each optimal policy function to be approximated exist and can be modeled as inTheorem 4.7, the latter gives such an upper bound. In dynamic optimization problems, these similarities can arise in several waysalthough their analysis is problem-dependent.

For example, applying the dynamic programming algorithm at stageN−1, for each

xN−1∈Xone has to find

xo

N goN−1

xN−1

arg max xNX

hxN−1,xN

βhN

xN

(15)

Similarities among the components ofgoN1may be present, depending on the properties of hN−1andhN, which are given in the problem formulation.

Another case in which there can be similarities among the components of the optimal policy function may occur in the case of infinite-horizon, discrete-time dynamic optimization problemsmodeled as follows: givenx0∈Xand 0< β <1, findx1,x2, . . . ,Xsuch that

Jox0

sup

t 0

βthxt,xt1

,

wherext,xt1

D, t 0,1, . . . .

6.5

It is known that under suitable hypotheses an optimal policy has one or more stationary points, that is, there existsx0 ∈ X such that gox0 x0. In such a case, ifgo is Lipschitz continuous which holds under some conditions; see 37 and its Lipschitz constant is “sufficiently small,” then one hasgoxxon a “suciently large” neighborhood ofx

0.

Acknowledgment

The authors were partially supported by a PRIN grant from the Italian Ministry for University and Research, project “Models and Algorithms for Robust Network Optimization.”

References

1 V. K ˚urkov´a and M. Sanguineti, “Comparison of worst case errors in linear and neural network approximation,”IEEE Transactions on Information Theory, vol. 48, no. 1, pp. 264–275, 2002.

2 A. R. Barron, “Neural net approximation,” inProceedings of the 7th Yale Workshop on Adaptive and Learning Systems, K. Narendra, Ed., pp. 69–72, Yale University Press, New Haven, Conn, USA, May 1992.

3 A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,”IEEE Transactions on Information Theory, vol. 39, no. 3, pp. 930–945, 1993.

4 F. Girosi, “Approximation error bounds that use VC-bounds,” inProceedings of the International Conference on Artificial Neural Networks (ICANN ’95), pp. 295–302, Paris, France, October 1995.

5 Y. Makovoz, “Uniform approximation by neural networks,”Journal of Approximation Theory, vol. 95, no. 2, pp. 215–228, 1998.

6 V. E. Maiorov, “On best approximation by ridge functions,”Journal of Approximation Theory, vol. 99, no. 1, pp. 68–94, 1999.

7 A. Pinkus, “Approximation theory of the MLP model in neural networks,”Acta Numerica, vol. 8, no. 1, pp. 143–195, 1999.

8 M. A. Kon, L. A. Raphael, and D. A. Williams, “Extending Girosi’s approximation estimates for functions in Sobolev spaces via statistical learning theory,”Journal of Analysis and Applications, vol. 3, no. 2, pp. 67–90, 2005.

9 M. A. Kon and L. A. Raphael, “Approximating functions in reproducing kernel Hilbert spaces via statistical learning theory,” inWavelets and Splines: Athens 2005, G. Chen and M.-J. Lai, Eds., Modern Methods in Mathematics, pp. 271–286, Nashboro Press, Brentwood, Tenn, USA, 2006.

10 G. Gnecco and M. Sanguineti, “Approximation error bounds via Rademacher’s complexity,”Applied Mathematical Sciences, vol. 2, no. 1–4, pp. 153–176, 2008.

11 V. K ˚urkov´a and M. Sanguineti, “Geometric upper bounds on rates of variable-basis approximation,” IEEE Transactions on Information Theory, vol. 54, no. 12, pp. 5681–5688, 2008.

12 V. K ˚urkov´a and M. Sanguineti, “Learning with generalization capability by kernel methods of bounded complexity,”Journal of Complexity, vol. 21, no. 3, pp. 350–367, 2005.

13 V. K ˚urkov´a and M. Sanguineti, “Approximate minimization of the regularized expected error over kernel models,”Mathematics of Operations Research, vol. 33, no. 3, pp. 747–756, 2008.

(16)

15 V. N. Vapnik,The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 1995.

16 N. Aronszajn, “Theory of reproducing kernels,”Transactions of the American Mathematical Society, vol. 68, pp. 337–404, 1950.

17 J. Buescu and A. C. Paix˜ao, “Inequalities for differentiable reproducing kernels and an application to positive integral operators,”Journal of Inequalities and Applications, vol. 2006, Article ID 53743, 9 pages, 2006.

18 F. Cucker and S. Smale, “On the mathematical foundations of learning,”Bulletin of the American Mathematical Society, vol. 39, no. 1, pp. 1–49, 2002.

19 C. Berg, J. P. R. Christensen, and P. Ressel,Harmonic Analysis on Semigroups: Theory of Positive Definite and Related Function, vol. 100 ofGraduate Texts in Mathematics, Springer, New York, NY, USA, 1984.

20 S. Mendelson, “A few notes on statistical learning theory,” inAdvanced Lectures on Machine Learning, S. Mendelson and A. Smola, Eds., vol. 2600 ofLecture Notes in Computer Science, pp. 1–40, Springer, Berlin, Germany, 2003.

21 G. Blanchard, O. Bousquet, and L. Zwald, “Statistical properties of kernel principal component analysis,”Machine Learning, vol. 66, no. 2-3, pp. 259–294, 2007.

22 S. Smale and D.-X. Zhou, “Shannon sampling and function reconstruction from point values,”Bulletin of the American Mathematical Society, vol. 41, no. 3, pp. 279–305, 2004.

23 S. Smale and D.-X. Zhou, “Learning theory estimates via integral operators and their approxima-tions,”Constructive Approximation, vol. 26, no. 2, pp. 153–172, 2007.

24 F. Cucker and D.-X. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge Monographs on Applied and Computational Mathematics, Cambridge University Press, Cambridge, UK, 2007.

25 D. P. Bertsekas,Dynamic Programming and Optimal Control. Vol. II, Athena Scientific, Belmont, Mass, USA, 2nd edition, 2001.

26 F. A. Mussa-Ivaldi, “From basis functions to basis fields: vector field approximation from sparse data,” Biological Cybernetics, vol. 67, no. 6, pp. 479–489, 1992.

27 F. A. Mussa-Ivaldi and F. Gandolfo, “Networks that approximate vector-valued mappings,” in Proceedings of IEEE International Conference on Neural Networks (ICNN ’93), vol. 3, pp. 1973–1978, San Francisco, Calif, USA, March-April 1993.

28 C. A. Micchelli and M. Pontil, “On learning vector-valued functions,”Neural Computation, vol. 17, no. 1, pp. 177–204, 2005.

29 A. Caponnetto, C. A. Micchelli, M. Pontil, and Y. Ying, “Universal multi-task kernels,”Journal of Machine Learning Research, vol. 9, pp. 1615–1646, 2008.

30 W. Rudin,Real and Complex Analysis, McGraw-Hill, New York, NY, USA, 3rd edition, 1987.

31 C. S. Ong, X. Mary, S. Canu, and A. J. Smola, “Learning with non positive kernels,” inProceedings of the 21st International Conference on Machine Learning (ICML ’04), pp. 639–646, Banff, Canada, July 2004.

32 E. Sontag, “VC dimension of neural networks,” inNeural Networks and Machine Learning, C. Bishop, Ed., pp. 69–95, Springer, Berlin, Germany, 1998.

33 A. N. Kolmogorov and S. V. Fom¯ın,Introductory Real Analysis, Dover, New York, NY, USA, 1975.

34 P. L. Bartlett and S. Mendelson, “Rademacher and Gaussian complexities: risk bounds and structural results,”Journal of Machine Learning Research, vol. 3, pp. 463–482, 2002.

35 O. L. Mangasarian, J. B. Rosen, and M. E. Thompson, “Convex kernel underestimation of functions with multiple local minima,”Computational Optimization and Applications, vol. 34, no. 1, pp. 35–45, 2006.

36 A. Friedman,Foundations of Modern Analysis, Dover, New York, NY, USA, 1982.

37 L. Montrucchio, “Lipschitz continuous policy functions for strongly concave optimization problems,” Journal of Mathematical Economics, vol. 16, no. 3, pp. 259–273, 1987.

References

Related documents

Ninety four patients, 14 male, 80 female, aged 18-73 years (mean age: 40.4±12.0) with DTC, who underwent radioiodine ablation after thyroidectomy and showed only

treatment ( Fig. The absence of mitochondrial damage in treated resistant cells specifically suggests that the mechanism of resistance in this model acts upstream of

Students with a GED certificate may be considered for admission provided they have a minimum ACT of 26 or SAT I (Critical Reasoning + Math) of 1180. Students who have completed

Parishes volunteer to take on the program, and Catholic Youth Services will work with them to support and guide the young person/people to undertake the work of a youth

Approaches to treatment that help patients with OCD include behavioral therapy (involving exposure to feared situations and the prevention of compul- sive behavior), cognitive

5 NIGHT TOUR OPTIONS - BARCELONA, SPAIN / ROME, ITALY / DUBLIN, IRELAND 6 NIGHTS LONDON (MILLENIUM BAILEYS). 4 NIGHTS LIVERPOOL (CROWNE PLAZA) - COACH TRANSFERS AS

EndroPurnomo (2015) supported the statement by stating that cutting tubers by ¼ parts has proven best effect on plant height, number of tillers, number of leaves, number of

Central to the development of these strategies was the concept of service user involvement, mental health promotion, establishment of community mental health teams and the