A Better Approximation for Balls
Gerald H. L. Cheang
Division of Mathematics,School of Science,National Institute of Education,
Nanyang Technological University,469,Bukit Timah Road,
Singapore259756
E-mail: cheanghlgnie.edu.sg
and Andrew R. Barron Department of Statistics,Yale University,
P.O.Box208290,New Haven,Connecticut06520-8290
E-mail: Andrew.Barronyale.edu
Communicated by Allan Pinkus
Received May 3, 1999; accepted in revised form October 22, 1999
Unexpectedly accurate and parsimonious approximations for balls in Rd and
related functions are given using half-spaces. Instead of a polytope (an intersection of half-spaces) which would require exponentially many half-spaces (of order (1
=)d)
to have a relative accuracy=, we useT=c(d2=2) pairs of indicators of half-spaces and threshold a linear combination of them. In neural network terminology, we are using a single hidden layer perceptron approximation to the indicator of a ball. A special role in the analysis is played by probabilistic methods and approximation of Gaussian functions. The result is then applied to functions that have variationVf
with respect to a class of ellipsoids. Two hidden layer feedforward sigmoidal neural nets are used to approximate such functions. The approximation error is shown to be bounded by a constant timesVfT112+VfdT124, whereT1 is the number of nodes in the outer layer andT2is the number of nodes in the inner layer of the approximationfT1,T2. 2000 Academic Press
1. INTRODUCTION
There already exists a rich literature on approximation of convex bodies with other sorts of convex bodies and polytopes. See, for example, Gruber [7], Fejes Toth [6]. Like other convex bodies, a ball is an infinite interesection of tangent half-spaces. For a unit ballBin Rd,
B= ,
a#Sd&1
[a}x1], (1)
doi:10.1006jath.1999.3441, available online at http:www.idealibrary.com on
183
0021-90450035.00
Copyright2000 by Academic Press All rights of reproduction in any form reserved.
File: 640J 344102 . By:SD . Date:28:03:00 . Time:10:51 LOP8M. V8.B. Page 01:01 Codes: 2074 Signs: 1284 . Length: 45 pic 0 pts, 190 mm
whereSd&1
is the unit sphere inRd. If we approximate it with the intersec-tion of T, Td+1, of the half-spaces in (1), then we are approximating the ball with a T-faced polytope PT.
There are results that bound the approximation error between convex bodies and their polytope approximators. Dudley [5] has shown that for each convex body B, there exists a constant csuch that for every Tthere is a polytopePT achieving
$H(B,P T)
c
T2(d&1), (2)
where $H is the Hausdorff metric. Results from Schneider and Wieacker [17] and Gruber and Kenderov [8] have shown that for a convex body with sufficiently smooth boundary such as the ball B, there exists a constant csuch that for every polytopePT,
$(B,PT)
c
T2(d&1), (3)
where$ can be either the Hausdorff or the Lebesgue measure of the sym-metric difference. Hence for an approximation error of=, we would require a polytope with many faces of order (1
=)
(d&1)2
, which is exponential in d. To avoid this curse of dimensionality, we will use T half-spaces in the approximation in a different manner.
To illustrate the idea, consider the set of points in at least k out of n
given half-spaces. For instance, if we were given theT=9 half-spaces deter-mining the polygon approximation in Fig. 1, k=9 yields the nonagon
File: 640J 344103 . By:SD . Date:28:03:00 . Time:10:52 LOP8M. V8.B. Page 01:01 Codes: 2174 Signs: 1470 . Length: 45 pic 0 pts, 190 mm
FIGURE 2
inscribed in the circle. In Fig. 2, we use T=9 half-spaces, but we set the threshold at k=8 to obtain the star-shaped approximation shown. In higher dimensions, our approximation will look somewhat like a jagged multi-faceted star-shaped object.
Here we can think of the T half-spaces as providing a test for mem-bership in the set. Instead of requiring allT tests to be passed, we permit membership with at least kpassed out ofT. An extension of this idea is to weigh each test and determine membership by a weighted count exceeding a threshold.
While polygon approximation may appear superior in the low-dimen-sional example given in the figure, in high dimensions, polytopes have extremely poor accuracy as shown in (3). In contrast we show that the use of a weighted count to determine membership in a set permits accuracy that avoids the curse of dimensionality. Indeed, with 2T=cd2=2indicators
of half-spaces, where cis a constant, we threshold a linear combination of them, in order to obtain accuracy =. Note that the number of indicators of half-spaces needed is only quadratic indand not exponential indas in the classical method.
Our approximation to a ball takes the form
N2T=
{
x#Rd: :2T
i=1
ci1[ai}xbi]k
=
.Let f2T=1N2T be the indicator (characteristic) function of this set. In
neural network terminology, we are using a two hidden layer perceptron approximation to the indicator of a ball. We show that there is a constant
csuch that for everyTandd, there is such an approximationN2Tthat the
Hausdorff distance between a ballBRof radiusR andN2T satisfies
$H(B
R,N2T)c}R
d(d+1)
T ,
where c is some real-valued constant. A special role in the analysis is played by probabilistic methods and approximation of Gaussian functions.
2. SOME BACKGROUND AND THE GAUSSIAN FUNCTION A single hidden-layer feedforward sigmoidal network is a family of real-valued functions fT(x) of the form
fT(x)=: T
i=1
ci,(ai}x+bi)+k, x#Rd (4)
parametrized by internal weight vectors ai in Rd, internal location
parameters bi in R, external weights ci and a constant term k (Cybenko
[4] and Haykin [9]). By a sigmoidal function, we mean any nondecreasing function on R with distinct finite limits at + and &. Such a network has dinputs,Thidden nodes and a linear output unit. It implements ridge-functions ,(ai}x+bi) on the nodes in the hidden layer. Here we will exclusively use the Heaviside function,(z)=1[z0], in which case (4) is a
linear combination of indicators of half spaces. Such a network is also called a perceptron network (Rosenblatt [15, 16]). Thresholding the out-put of a single hidden-layer neural net at level k1, we obtain fT(x)=
,(fT(x)&k1) which equals
fT(x)=,
\
: Ti=1
ci,(ai}x+bi)+k$
+
. (5)For simplicity in the notation, we will often omit the parametersai, bi, ci,
and kin the arguments of fT andfT.
To approximate a ball we first consider approximation of the Gaussian functionf(x)=exp(&|x|22) and then take level sets. A level set of a func-tionfat levelkis simply the set[x#Rd: f(x)k]. Using the fact that the
Gaussian is a positive definite function with Fourier transform (2?)&d2 exp(&|||22), so that f has a representation in the convex hull of sinusoids, it is known that f(x) can be expressed using the convex hull of indicators of half-spaces (see Barron [1, 2], Hornik et al. [11], Yukich et al. [19]). We take advantage of a similar representation here. We use | } | to denote the EuclideanL2norm.
Let BK be a ball of radiusK large enough that it would contain Band
N2T. As shown in Appendix A, on BK the Gaussian function satisfies
f(x)=
|
Rd|
|a|K &|a|K 1[a}x+b0]sin(b) exp(&|a|22) (2?)d2 db da+exp\
& K2 2+
. (6) Here exp(&K22) is the value of the Gaussian evaluated on the surface of the ball BK. As we will see later, we can arrange for the neural net level setN2T to be entirely contained in Band hence take K=1.
Decomposing the integral representation of finto positive and negative parts, we have f(x)&exp
\
&K 2 2+
=f1(x)&f2(x) (7) =|
Rd|
|a|K &|a|K 1[a}x+b0]sin+(b) exp(&|a|22) (2?)d2 db da &|
Rd|
|a|K &|a|K 1[a}x+b0]sin&(b) exp(&|a|22) (2?)d2 db da =&1|
1[a}x+b0]dV1&&2|
1[a}x+b0]dV2, (8) where V1 is the probability measure for (a,b) on Rd with density1[&|a|K<b< |a|K]sin+(b)(exp(&|a|22)(2?)d2&1) with normalizing constant
&1=
|
Rd|
|a|K &|a|K sin+(b)exp(&|a| 22) (2?)d2 db daand similarly forV2and&2(with sin&(b) in place of sin+(b) ). We denote the positive part of sin(b) by sin+(b) and the negative part of sin(b) by sin&(b). The total variation of the measure used to representf is
&=&1+&2
=
|
Rd|
|a|K &|a|K |sin(b)|exp(&|a| 22) (2?)d2 db da <|
Rd 2 |a|Kexp(&|a|22) (2?)d2 da (9) 2K-d. (10)An integral representation of the Gaussian as an expected value invites Monte Carlo approximation by a sample average. In particular, both f1(x) andf2(x) in (7) are expected values of indicators of half-spaces inRd. Thus a 2T-term neural net approximation tof(x) is then
f2T(x)&exp
\
&K 2 2+
= &1 T : T i=1 ,(ai}x+bi)&&2 T : 2T i=T+1 ,(ai}x+bi), (11) where the parameters (ai,bi)Ti=1 are drawn at random independently from the distribution V1 and (ai,bi)2T
i=T+1 from V2. The sampling scheme is simple. For example, to obtain an approximation for f1(x), first draw a
from a standard multi-variate normal distribution over Rd, then draw b
from [&|a|K, |a|K] with density proportional to sin+(b).
We now bound the L approximation error between f(x) and f2T(x).
We will draw on symmetrization techniques and the concept of Orlicz norms in empirical process theory (see, for example, Pollard [13]), and the theory of VapnikC8 ervonenkis classes of sets (Vapnik and C8ervonenkis [18]). With the particular choice of 9(x)=1
5exp(x
2) used by Pollard [13], the Orlicz norm of a random variable Zis defined by
&Z&9=inf
{
C>0 :Eexp\
Z2
C2
+
5=
.We examine the approximation error betweenf1(x) andf1,T(x), itsT-term
neural net approximation, first.
From empirical process theory, the following lemma is obtained. See Appendix B.
Lemma 1. Let !=(a,b) and g
x(!)=,(a}x+b).If h(x)= ,(a}x+b)
P(da,db) for x#BK for some probability measure P on (a,b), then there exist !1,!2, ...,!T such that
sup x#BK
}
1 T : T i=1 gx(!i)&h(x)}
34 d+1 T . (12)Recall that for the approximation of the Gaussian function, the approximation f2T less exp(&K22) can be split up into two parts,
approximate the positive and negative parts f1 and f2 respectively. Using Lemma 1, we see that
sup x#BK
}
&1 T : T i=1 gx(!i)&f1(x)}
34&1 d+1 T (13) and similarly, sup x#BK}
&2 T : 2T i=T+1 gx(!i)&f2(x)}
34&2 d+1 T . (14)Hence by the triangle inequality, sup
x#BK
|f2T(x)&f(x)| 34(&1+&2)
d+1 T =34&
d+1 T 68Kd(d+1) T . (15)3. BOUNDING THE HAUSDORFF DISTANCE OF THE APPROXIMATION
The Hausdorff distance between two sets Fand Gis defined as
$H(F,G)=max[sup x#F inf y#G |x&y|, sup y#G inf x#F |x&y|].
The norm | } | is the usual Euclidean norm inRd. We bound the Hausdorff
distance between the ball and its approximating set $H(B,N
2T) in this
section. The ball is assumed to be centered at the origin. However, we apply the result later to other balls and ellipsoids that are not necessarily centered at the origin. Note that the unit ballBinRdmay be represented as
B=
{
x: exp\
&|x| 2 2+
exp\
& 1 2+=
. We defineN2Tas N2T=[x: f2T(x)exp(&12)+K=T].Let f(x)=exp(&|x|22) and let f
2T(x) be the approximation with T
pairs of indicators. Here
=T:=68
d(d+1)T ,
for which we have theLerror between the Gaussian and its approximant
bounded above by
sup
x#BK
|f2T(x)&f(x)| =T. (16)
We are going to bound the Hausdorff distance between B andN2T, using
this sup norm bound on the error between the functions f and f2T which yieldB andN2Tas level sets.
Theorem 1. Let BR be a ball of radius R in Rd centered at the origin,
and let N2T be the level set of the neural net approximation.For sufficiently
large T, such that =T12(exp(& 1 4)&exp(& 1 2)), $H(B R,N2T)318R
d(d+1) T .Proof. The ballB coincides with the level set off at the level exp(&1 2). Let T be such that =T is less than 21Kexp(&
1
2). Choose r0 such that exp(&r2
02)=exp(& 1
2)+2K=T. LetBr0be the ball of radius r0 centered at
the origin. If x#N2T, then exp(&12)f2T(x)&K=Texp(&|x|22) which
implies that x#B. Similarly ifx#Br0, then exp(&
1
2)+K=Texp(&|x|22)
&K=Tf2T(x), which implies thatx#N2T. Thus
Br
0/N2T/B.
BothB and its approximating setN2T are sandwiched betweenBr0andB.
Consequently
$H(B,N
2T)1&r0.
The function g(r)=exp(&r22) has derivative &rg(r) of magnitude which is largest at r=1. Now
r0=-2 log(1(e&12+2K=T))
=-1&2 log(1+2K=Te12),
which is close to 1. If T is large enough that =T is less than 21K
(exp(&1
4)&exp(&12)), thenr0g(r0)(1-2) exp(&12), and hence using the mean-value theorem
$H(B,N 2T)1&r0 g(r0)&g(1) r0g(r0) 2-2e K=T 136-2e K
d(d+1) T . (17)Now we set K. From Section 2, BK need only be large enough to cover
both the unit ball B and its approximation set N2T, which we have
arranged to be contained inB. Thus we can takeBKto beB, whenceK=1.
Again when=T12(exp(& 1 4)&exp(& 1 2)), we have $H(B,N 2T)136-2e
d(d+1) T 318 d(d+1) T . (18)For a ball BR of radius R, the Hausdorff distance between it and its approximation set is simply 318R-d(d+1)
T . This concludes the proof of
Theorem 1. K
4. AN L1BOUND
Let BR be a ball of radiusR, N2T the level set induced by the
approxi-mation as explained in Section 1, + is the Lebesgue measure, and$ is the Hausdorff distance between BR and N2T as obtained above. Since the
symmetric differenceBRqN2Tis included in the shellBR+$"BR&$, one has
|
|1BR&1N2T| +(dx) +(BR)= +(BRqN2T) +(BR) +(BR)&+(BR&$) +(BR) 1&\
1&$ R+
d d $ R 318dd(d+1) T , (19)Theorem 2. The relative Lebesgue measure of the symmetric difference
+(BRqN2T)+(BR) between BR and its approximation set N2T is bounded
above by +(BRqN2T) +(BR) 318d
d(d+1) T . 5. ELLIPSOID APPROXIMATIONConsider an ellipsoid E=[x:x$Mx1] centered at the origin with
M=A$Astrictly positive definite with ad_dpositive definite square rootA. Equivalently E=[x: exp(&x$A$Ax2)exp(&12)] is the level set of a Gaussian surface. In a similar manner to the ball, it can also be accurately and parsimoniously approximated by a single hidden-layer neural net. Let the eigenvalues of A be r1r2 } } } rd with the corresponding
eigen-vectors [r1,r2, ...,rd]. If the approximating set for the unit ball takes the form [x:2T
i=1ci1[ai}xbi]k], then the one for the ellipsoid Eis
E2T=
{
x: :2T
i=1
ci1[ai}Axbi]k
=
.We are interested in bounding $H(E,E
2T), the Hausdorff distance between
the ellipsoid and its approximating set.
Theorem 3. The Hausdorff distance between the ellipsoid E and its
approximating set E2Tis bounded above by
318rd
d(d+1)
T . (20)
Proof. The matrix transformation A transforms the unit ball to an ellipsoid by stretching the unit radius to lengthriin theridirection and the
approximating set NT is similarly stretched in the same way to E2T. For
the ballBr0(as defined in the proof of Theorem 1), the matrix
transforma-tionAtransforms it to an ellipsoidE$ by stretching its radius to lengthrir0 in theridirection. Thus the order of inclusivity is still preserved after the
transformation and
E$/E2T/E.
Note that the ellipsoids E and E$ are similar, centered at the origin and aligned along the same axes. The only difference is in the scale.
The two extreme parts of the ellipsoids are along the directionr1andrd.
The ellipsoid is contained in a ball of radius rd. Thus $H(E,E2T) is
bounded above by the greatest distance betweenEandE$, and this occurs along the direction of rd, and hence is bounded above by the Hausdorff
distance between that of a ball of radius rd (containing the ellipsoid) and
a ball of radius rdr0, and that is in turn bounded above by 318rd
d(d+1)
T . K
The error is the same as for approximation of a ball except that the radius of the ball is replaced by the maximal eigenvalue (length of major axis).
Now consider an ellipsoid E with axial lengths r1 } } } rd&1rd=R
and its approximating set E2T. The ellipsoid E$=(1&$
R)E is a scaled
down version ofEand it has axial lengthsr1(1&$
R) } } } rd&1(1&
$ R)
rd(1&R$)=R&$. Recall that the approximation set E2T is obtained by
scalingN2T(the approximation set for the unit ball) by a factor ofrialong
theith axis of the ellipsoid E. The Hausdorff distance betweenE andE2T
is$ which is bounded by 318R-d(d+1)
T from Theorem 3.
Corollary 1. The measure of the symmetric difference +(EqE2T)
between E and its approximation set E2Tis bounded above by
+(EqE2T)318+(E)d
d(d+1)T .
Proof. Since the difference EqE2T is included in the shell E"E$, we
obtain
|
|1E&1E2T|+(dx)=+(E)&+(E2T) +(E)&+(E$) =+(E)&\
1&$ R+
d +(E) =+(E)\
1&\
1&$ R+
d+
+(E)d $ R 318+(E)dd(d+1) T . K (21)6. REMARKS
In earlier work of one of the authors (Cheang [3]), an L2 approxima-tion bound of order T&16 was obtained with T pairs of half-spaces in a neural net with a ramp sigmoid applied to the output. Here our order
T&12 bound gives an improved rate.
The integral representation to the Gaussian on BKmay also be written
exp
\
&|x| 2 2+
=|
Rd|
|a|K &|a|K 1[sgn(b)a}x+bsgn(b)0]|sin(b)| exp(&|a|22) (2?)d2 db da &1 2|
Rd|
|a|K &|a|Ksin&(b)exp(&|a| 22) (2?)d2 db da +exp
\
&K2
2
+
. (22)Sampling from the distribution V proportional to |sin(b)| exp(&|a2|2), the approximation to the ball takes the form
NT=
{
x#Rd: : Ti=1
1[ai}xbi]k
=
,that is,xis inNTif it is in at leastkof the half-spaces. This approximation
achieves
$H(B,N
T)318
d(d+1)
T . (23)
In particular, when 2Tsigmoids are used in the approximation,
$H(B,N
2T)318
d(d+1) 2T
when the representation (22) is used, reducing the constant by a factor of 1-2 from the bound in Theorem 1.
It may be possible to extend our results to neural network approxima-tion of other classes of closed convex sets with smooth boundaries, for example, to classes of sets of the form D=[x#Rd: f(x) }f(x)1], where
f:RdÄRdhas a strictly positive definite derivative. If this is achieved, the
results pertaining to functions which have total finite variation with respect to a class of ellipsoids (in the following section) could be extended to those for a class of convex sets with some suitable smoothness properties.
7. APPROXIMATION BOUNDS FOR TWO LAYER NETS The second (outer) layer of a two layer net takes a linear combination of level sets Hof functions represented by linear combinations on the first (inner) layer. The class of sets represented by level sets of combinations of first layer nodes include half-spaces and rectangles, and (as we have seen) approximations to ellipsoids.
A function fis said to have variationVf,Hwith respect to a class of sets
H if Vf,His the infimum of numbers V such that fVis in the closure of the convex hull of signed indicators of sets inH, where the closure is taken in L2(PX). A special case of finite variation is the case we call total
varia-tion with respect to a class of sets. Suppose f(x) defined over a bounded region S inRd. We say thatf has total variationV with respect to a class
of sets H=[H!:!#5] if there exist some signed measure v over the
measurable space 5 and
f(x)=
|
5
1H!(x)v(d!) for x#S, (24)
and ifvhas finite total variationV. The setsH!are parametrized by!in5.
In our context, the H! are half spaces in Rd where the ! consist of the
location and orientation parameters. In the event that the representation (24) is not unique, we take the measure v that yields the smallest total variationV.
The function class FV,Hof functions with variationVf,Hbounded byV
arises naturally when thinking of the functions obtained by linear combina-tions on a layer of a network where the sum of absolute values of the coefficients of linear combination are bounded byVand the level sets from the preceding layer yield the sets inH. In our analysis of the two layer case we will take advantage of both L approximations bounds (used to yield
approximations to the indicators of ellipsoids in the inner layer) and L2
approximation bounds for convex hulls of indicators of ellipsoids (essen-tially achieved by the outer layer of the network). First we state a simple
L2 approximation bound, which is a counterpart to the L bound of
Lemma 1, but with smaller constants (and without requirement of integral representation).
Lemma 2. If f has variation V
f with respect to a class of setsH then for
each T there exists H1, ...,HTand c1, ...,cTwith T
i=1|ci| Vfsuch that the
approximation fT(x)=T
i=1ci1Hi(x) achieves
&f&fT&2
Vf
This lemma as a tool of approximation theory and two probabilistic proofs (one based on probabilistic sampling and one based on a greedy algorithm) are in Barron [2] and (with a somewhat larger constant) an earlier form of the greedy algorithm proof of the approximation result is in Jones [12]. The probabilistic sampling bound on L2 norms of averages used in the proof is classical Hilbert theory.
Proof. The proof is based on the Monte Carlo sampling idea as in Section 2. First fix T and suppose that f is not identically constant. (Equality occurs in (25) only if f is identically constant.) Since f is in the closure of the convex hull ofG=[\Vf1H:H#H], one takes af that is
a (potentially very large) finite convex combination with &f&f&2<$. In particular we take$==-Tand=small, say=<Vf&-V2
f&&f&24, which
is less than &2f&.
By the triangle inequality,
&f&fT&2&f&f&2+&f&fT&2
=
-T+&f&fT&2. (26)
Suppose f=i pigi with gi in G, and pi>0 with i pi=1. Since f is an
expectation, we apply the Monte Carlo sampling technique. Draw indices
i1, ...,iTindependently according to the distributionpiin the representation
off and letfT=T1 T
j=1gij. Then
Ei&f&fT&2
2=
Ei&gi&2&&f&2
T
V
2
f&&f&
2
T
V
2
f&&f&
24
T , (27)
and so there exists a choice of such anfTwith
&f&fT&2 2<
V2
f&&f&
24
T .
That is,
&f&fT&2<
-V2
f&&f&
24
-T . (28)
As a consequence of the lemma above, we have the following corollary involving approximation with a class of ellipsoids. Let !be the parameters that define the ellipsoids, and 1E!(x) the indicator of the ellipsoid.
Corollary 3. If f has variation Vf=Vf,Ewith respect to the classEof
ellipsoids then there is a choice of ellipsoids E1, ...,ET and s1, ...,sT1#
[&1, +1], and ci=Vf siT1such that
fT 1(x)=: T1 i=1 ci1Ei (29) satisfies
&fT1&f&2
Vf -T1
. (30)
The indicators of ellipsoids have two layer sigmoidal network approxi-mations consisting of a single outer node and a single hidden inner layer. These approximations to 1Eimay be substituted into the approximation in
(29) to yield a two hidden layer approximation to f.
Let E=[E!:!#5]be the set of ellipsoids with+(E!)+(S) where+is
the Lebesgue measure. Let PXbe the uniform probability measure overS, and let E2T
2 be the neural net level set with 2T2 sigmoids that is used to
approximateE. Using the bound in Corollary 1, for eachE#E,
|
S|1E(x)&1E2T 2(x)| 2P X(dx)= +((E&E2T2)&S) +(S) +(E) +(S)318d d(d+1) 2 318dd(d+1) T2 . (31)After replacing the indicators of the ellipsoids in (29) with their neural net approximations, we obtain
fT 1, 2T2=: T1 i=1 ci,
\
: T2 j=1 |ij,(aij}x&bij)&di+
. (32) The following theorem bounds the mean-squared approximation error. An ellipsoid inEis denoted by E.Theorem 4. If f has variation Vf with respect to the class of ellipsoidsE,
there exist a choice of parameters(aij,bij,ci,di,|ij)such that a two hidden
layer net with step activation function achieves approximation error bounded by
&f&fT1, 2T2&2
Vf -T1+Vf
\
318d d(d+1) T2+
12 , (33) and &f&fT 1, 2T2&1 Vf -T1 +Vf318dd(d+1) T2 ,where&}&p denotes the Lp(PX) norm; provided that T2 is large enough that 68-d(d+1)T212(exp(&
1
4)&exp(& 1 2)).
Proof. By the triangle inequality,
&f&fT
1, 2T2&2&f&fT1&2+&fT1&fT1, 2T2&2. (34)
Now
&f&fT1&2
Vf
-T1
from Corollary 1. The other term on the right hand side of (34) is bounded as follows. Let Ei be the neural net level set of the approximation to Ei
from Section 5. Then
&fT1&fT1, 2T2&2=
"
: T1 i=1 ci(1Ei&1Ei)"
2 1 T1 : T1 i=1|ci|&1Ei&1Ei&2
Vf
\
318dd(d+1)
T2
+
12, (35)
where (31) bounds the last inequality (35). K
The proof of the L1 bound is similar (using &f&fT1&1&f&fT1&2
Vf-T1) except that the square root in (35) is not used in bounding
&1Ei&1Ei&1.
We conclude with two examples of functions with variation with respect to a class of balls (ellipsoids).
Example 1. Convex Combination of Balls. Let B(a,b) denote a ball centered at a with radius b. InR3, the function
f(x)=4? 3&-2?(x 2 1+x 2 2+x 2 3) 12+ ? 3-2(x 2 1+x 2 2+x 2 3) 32 =
|
1B(%, 1)(x) 1B(0, 1)(%)d% (36)is a convex combination of indicators of balls. Thus
fT 1(x)= 4? 3T1 : T1 i=1 1B(% i, 1)(x) (37)
is an approximation tof(x) where the %i's are sampled from the uniform
distribution in a unit ball. We then approximate each ball 1B(%i, 1)(x) with
the form (5).
Example 2. A Radial Function Let +2,
1 2( |x| &++2), +&2< |x| + f(x)=
{
1 2(++2& |x| ), +< |x| ++2 (38) 0, otherwise. Then f(x)=|
R1(%&1,%+1)( |x| )211[&1, 1](%&+)d% (39) and thus f(x) can be approximated by
fT 1(x)= 1 T1 : T1 i=1 [1B(0,%i+1)(x)&1B(0,%i&1)(x)], (40)
where %itiid Uniform (+&1,++1).
APPENDIX A
Starting with the right hand side of (6) and recalling that |a}x| |a|K
|
Rd|
|a|K &|a|K 1[a}x+b0]sin(b) exp(&|a|22) (2?)d2 db da (41) = &Im|
Rd|
|a|K &|a|K 1[a}x+b0]exp(&ib) exp(&|a|22) (2?)d2 db da = &Im|
Rd_
|
a}x+ |a|K a}x& |a|K 1[s0]exp(&is)ds&
exp(ia}x) exp(&|a|22) (2?)d2 da = &Im|
Rd_
|
a}x+ |a|K 0exp(&is)ds
&
exp(ia}x) exp(&|a|22)
(2?)d2 da
=Imi
|
Rd
[1&exp(&ia}x) exp(&i|a|K)]exp(ia}x) exp(&|a| 22) (2?)d2 da =exp
\
&|x| 2 2+
&|
Rd exp(&i|a|K) exp(&|a|22) (2?)d2 da (42) =f(x)&exp\
&K 2 2+
. (43) In (41), we did a substitutions=a}x+b. APPENDIX BWe prove a more general version of Lemma 1. Let a parameterized class of setsH=[H!:!#5]inRdbe given where5is a measurable space. Let
H =[Hx:x#Rd] with Hx=[!:x#H!] be the dual class of sets in 5
parametrized by x.
First we define some terms that will be used in the lemma. Let G be a class of functions mapping fromXtoRand letx1, ...,xN#X. We say that
x1, ...,xN are shattered by G if there exists r#RN such that for each
b=(b1, ...,bN) #[0, 1]N, there is ang#Gsuch that for eachi,
g(xi)
{
ri <riif bi=1
if bi=0.
The pseudo-dimension is defined as
dimP(G)=max[N:_x1, ...,xN,Gshatters x1, ...,xN] (44)
if such a maximum exists, andotherwise. For the class of unit step func-tions,(a}x+b), the pseudo-dimension and the VC-dimensionDcoincide and isd+1. The=-packing numberDT(=,Lp) for a subset of a metric space
is defined as the largest numbermfor which there exist pointst1, ...,tm in
the subset of the metric space withdp(ti,tj)>=fori{j, wheredpis theLp
metric.
Lemma 4. IfH has VC-dimension D and if h is a function in the convex
hull of the indicators of sets inH which possesses an integral representation
h(x)=
|
1H!(x)P(d!) for x#BK,then there is a choice of !1,!2, ...,!T such that the approximation hT(x)=
1 T T i=11H !i(x) satisfies sup x#BK |hT(x)&h(x)| 34
D T. (45)Remark. Such a uniform approximation bound holds over any subset of Rdin which the integral representation holds. In our application we use
BK, the ball of radius K.
Proof. Let gx(!)=1Hx(!)=1H!(x) and let _i be independent random
variables taking the values \1 with probablity1
2. Define!
=(!1,!2, ...,!T), where the!i are independently and identically distributed with respect to
P( } ), and _
=(_1,_2, ...,_T). By symmetrization, using Jensen's inequality as in Pollard [13, p. 7], for C>0, we have
E9
\
supx#BK| T i=1 gx(!i)&Th(x)| C+
E! E_ 9\
2 supx#BK| T i=1_igx(!i)| C+
. (46)Conditioning on !, we need to find an upper bound to E_
9(2 supx#BK |T
i=1_igx(!i)| ). This involves bounding the Orlicz norm &2 supx#BK
|T
i=1_igx(!i)|&9with !
fixed. Using a result in Pollard [13, pp. 3537],
"
2 sup x#BK}
: T i=1 _igx(!i)}"
9 18-T|
1 0 -logDT(=,L2)d=, (47)where DT(=,L2) is the L2 =-packing number for H , where the L2 norm
on 5 is taken with respect to the empirical probability measure on
From Pollard [13, p. 14], DT(=,L2)
\
3 =+
D , (48)uniformly over all !1,!2, ...,!T. We now work out an upper bound to
1
0-logDT(=,L2)d=. From the CauchySchwartz inequality,
|
01-logDT(=,L2)d=|
1 0 logDT(=,L2)d= =Dlog 3&D|
1 0 log= d= -(1+log 3)D. (49)Substituting (49) into (47), we see that
"
2 sup x#BK}
: n i=1 _igx(!i)}"
9 18-(1+log 3)TD. (50)From the definition of the Orlicz norm, the choice of C0=
18-(1+log 3)TD ensures that
E_9
\
2 supx#BK| T i=1_igx(!i)| C0+
1, and hence, E9\
supx#BK| T i=1gx(!i)&Th(x)| C0+
1. (51)Thus we conclude that there exists!1,!2, ...,!T such that
9
\
supx#BK| T i=1gx(!i)&Th(x)| C0+
1, whence sup x#BK}
1 T : T i=1 gx(!i)&h(x)}
18-D(1+log 3) log 5 -T 34D T. K (52)In our case, !=(a,b) andgx(!)=1H!(x)=1[a}x+b0]. The dual class of
sets in 5 are Hx=[!:gx(!)=1]=[(a,b) :a}x+b0]. Since (a,b) #
Rd_R, which is a vector space of dimension d+1, the class of sets H =
[Hx:x#Rd] has VC-dimension D=d+1 (Pollard [12, p. 20], Haussler
[10]). Thus Lemma 1.
REFERENCES
1. A. R. Barron, Neural net approximation,in``Proc. of the 7th Yale Workshop on Adaptive and Learning Systems,'' 1992.
2. A. R. Barron, Universal approximation bounds for superpositions of a sigmoidal function,
IEEE Trans.Inform.Theory39(1993), 930945.
3. G. H. L. Cheang, Neural network approximation and estimation of functions,in``Proc. of the 1994 IEEE-IMS Workshop on Info. Theory and Stat.,'' 1994.
4. G. Cybenko, Approximations by superpositions of a sigmoidal function, Math.Control Signals Systems2(1989), 303314.
5. R. Dudley, Metric entropy of some classes of sets with differentiable boundaries,
J.Approx.Theory10(1974), 227236; Correction,J.Approx.Theory26(1979), 192193. 6. L. Fejes Toth, ``Lagerungen in der Ebene, auf der Kugel und im Raum,'' Springer-Verlag,
Berlin, 1953, 1972.
7. P. M. Gruber, Approximation of convex bodies, in ``Convexity and its Appplications'' (P. M. Gruber and J. M. Wills, Eds.), pp. 131162, Birkhauser, Basel, 1983.
8. P. M. Gruber and P. Kenderov, Approximation of convex bodies by polytopes, Rend.
Circ.Mat.Palermo31(1982), 195225.
9. S. S. Haykin, ``Neural Networks: a Comprehensive Foundation,'' Macmillan, New York, 1994.
10. D. Haussler, Decision theoretic generalizations of the PAC model for neural net and other learning applications, Inform.Comp.100, No. 1 (1992), 78150.
11. K. Hornik, M. B. Stinchcombe, and H. White, Multi-layer feedforward networks are universal approximators, Neural Networks2(1988), 359336.
12. L. K. Jones, A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training, Ann. Statist. 20
(1990), 608613.
13. D. Pollard, ``Convergence of Stochastic Processes,'' Springer-Verlag, Berlin, 1984. 14. D. Pollard, ``Empirical Processes: Theory and Applications,'' NSF-CBMS Regional Conf.
Ser. Probab. Statist., Vol. 2, SIAM, Philadelphia, 1990.
15. F. Rosenblatt, The perceptron: A probabilistic model for information storage and organization in the brain,Psych.Rev.65(1958), 386408.
16. F. Rosenblatt, ``Principles of Neurodynamics,'' Spartan, Washington, DC, 1962. 17. R. Schneider and J. A. Wieacker, Approximation of convex bodies by polytopes, Bull.
London Math.Soc.13(1981), 149156.
18. V. N. Vapnik and A. Ya. C8ervonenkis, On the uniform convergence of relative frequencies
of events to their probabilities,Theory Probab.Appl.16, No. 2 (1971), 264280. 19. J. E. Yukich, M. B. Stinchcombe, and H. White, Sup-norm approximation bounds for
networks through probabilistic methods, IEEE Trans. Inform. Theory 41 (1995), 10211027.