A Better Approximation for Balls

(1)

A Better Approximation for Balls

Gerald H. L. Cheang

Division of Mathematics,School of Science,National Institute of Education,

Nanyang Technological University,469,Bukit Timah Road,

Singapore259756

E-mail: cheanghlgnie.edu.sg

and Andrew R. Barron Department of Statistics,Yale University,

P.O.Box208290,New Haven,Connecticut06520-8290

E-mail: Andrew.Barronyale.edu

Communicated by Allan Pinkus

Received May 3, 1999; accepted in revised form October 22, 1999

Unexpectedly accurate and parsimonious approximations for balls in Rd _and

related functions are given using half-spaces. Instead of a polytope (an intersection of half-spaces) which would require exponentially many half-spaces (of order (1

=)d)

to have a relative accuracy=, we useT=c(d2₌2_{) pairs of indicators of half-spaces} and threshold a linear combination of them. In neural network terminology, we are using a single hidden layer perceptron approximation to the indicator of a ball. A special role in the analysis is played by probabilistic methods and approximation of Gaussian functions. The result is then applied to functions that have variationVf

with respect to a class of ellipsoids. Two hidden layer feedforward sigmoidal neural nets are used to approximate such functions. The approximation error is shown to be bounded by a constant timesVfT112+VfdT124, whereT1 is the number of nodes in the outer layer andT2is the number of nodes in the inner layer of the approximationfT₁,T₂. 2000 Academic Press

1. INTRODUCTION

There already exists a rich literature on approximation of convex bodies with other sorts of convex bodies and polytopes. See, for example, Gruber [7], Fejes Toth [6]. Like other convex bodies, a ball is an infinite interesection of tangent half-spaces. For a unit ballBin Rd_,

B= ,

a#Sd&1

[a}x1], (1)

doi:10.1006jath.1999.3441, available online at http:www.idealibrary.com on

183

0021-90450035.00

(2)

File: 640J 344102 . By:SD . Date:28:03:00 . Time:10:51 LOP8M. V8.B. Page 01:01 Codes: 2074 Signs: 1284 . Length: 45 pic 0 pts, 190 mm

whereSd&1

is the unit sphere inRd_{. If we approximate it with the} intersec-tion of T, Td+1, of the half-spaces in (1), then we are approximating the ball with a T-faced polytope PT.

There are results that bound the approximation error between convex bodies and their polytope approximators. Dudley [5] has shown that for each convex body B, there exists a constant csuch that for every Tthere is a polytopePT achieving

$H₍_B_,_P T)

c

T2(d&1), (2)

where $H _{is the Hausdorff metric. Results from Schneider and Wieacker} [17] and Gruber and Kenderov [8] have shown that for a convex body with sufficiently smooth boundary such as the ball B, there exists a constant csuch that for every polytopePT,

$(B,PT)

c

T2(d&1), (3)

where$ can be either the Hausdorff or the Lebesgue measure of the sym-metric difference. Hence for an approximation error of=, we would require a polytope with many faces of order (1

=)

(d&1)2

, which is exponential in d. To avoid this curse of dimensionality, we will use T half-spaces in the approximation in a different manner.

To illustrate the idea, consider the set of points in at least k out of n

given half-spaces. For instance, if we were given theT=9 half-spaces deter-mining the polygon approximation in Fig. 1, k=9 yields the nonagon

(3)

File: 640J 344103 . By:SD . Date:28:03:00 . Time:10:52 LOP8M. V8.B. Page 01:01 Codes: 2174 Signs: 1470 . Length: 45 pic 0 pts, 190 mm

FIGURE 2

inscribed in the circle. In Fig. 2, we use T=9 half-spaces, but we set the threshold at k=8 to obtain the star-shaped approximation shown. In higher dimensions, our approximation will look somewhat like a jagged multi-faceted star-shaped object.

Here we can think of the T half-spaces as providing a test for mem-bership in the set. Instead of requiring allT tests to be passed, we permit membership with at least kpassed out ofT. An extension of this idea is to weigh each test and determine membership by a weighted count exceeding a threshold.

While polygon approximation may appear superior in the low-dimen-sional example given in the figure, in high dimensions, polytopes have extremely poor accuracy as shown in (3). In contrast we show that the use of a weighted count to determine membership in a set permits accuracy that avoids the curse of dimensionality. Indeed, with 2T=cd2₌2_indicators

of half-spaces, where cis a constant, we threshold a linear combination of them, in order to obtain accuracy =. Note that the number of indicators of half-spaces needed is only quadratic indand not exponential indas in the classical method.

Our approximation to a ball takes the form

N2T=

{

x#Rd: :

2T

i=1

ci1[ai}xbi]k

=

.

Let f2T=1N₂_T be the indicator (characteristic) function of this set. In

neural network terminology, we are using a two hidden layer perceptron approximation to the indicator of a ball. We show that there is a constant

(4)

csuch that for everyTandd, there is such an approximationN2Tthat the

Hausdorff distance between a ballBRof radiusR andN2T satisfies

$H₍_B

R,N2T)c}R

d(d+1)

T ,

where c is some real-valued constant. A special role in the analysis is played by probabilistic methods and approximation of Gaussian functions.

2. SOME BACKGROUND AND THE GAUSSIAN FUNCTION A single hidden-layer feedforward sigmoidal network is a family of real-valued functions f_T(x) of the form

fT(x)=: T

i=1

ci,(ai}x+bi)+k, x#Rd (4)

parametrized by internal weight vectors ai in Rd, internal location

parameters bi in R, external weights ci and a constant term k (Cybenko

[4] and Haykin [9]). By a sigmoidal function, we mean any nondecreasing function on R with distinct finite limits at + and &. Such a network has dinputs,Thidden nodes and a linear output unit. It implements ridge-functions ,(a_i}x+b_i) on the nodes in the hidden layer. Here we will exclusively use the Heaviside function,(z)=1[z0], in which case (4) is a

linear combination of indicators of half spaces. Such a network is also called a perceptron network (Rosenblatt [15, 16]). Thresholding the out-put of a single hidden-layer neural net at level k₁, we obtain fT(x)=

,(fT(x)&k1) which equals

fT(x)=,

\

: T

i=1

ci,(ai}x+bi)+k$

+

. (5)

For simplicity in the notation, we will often omit the parametersai, bi, ci,

and kin the arguments of fT andfT.

To approximate a ball we first consider approximation of the Gaussian functionf(x)=exp(&|x|2_{2) and then take level sets. A level set of a} func-tionfat levelkis simply the set[x#Rd_: _f₍_x₎_k]_{. Using the fact that the}

Gaussian is a positive definite function with Fourier transform (2?)&d2 exp(&|||2_{2), so that} _f _{has a representation in the convex hull of} sinusoids, it is known that f(x) can be expressed using the convex hull of indicators of half-spaces (see Barron [1, 2], Hornik et al. [11], Yukich et al. [19]). We take advantage of a similar representation here. We use | } | to denote the EuclideanL2norm.

(5)

Let BK be a ball of radiusK large enough that it would contain Band

N2T. As shown in Appendix A, on BK the Gaussian function satisfies

f(x)=

|

Rd

|

|a|K &|a|K 1[a}x+b0]sin(b) exp(&|a|2₂₎ (2?)d2 db da+exp

\

& K2 2

+

. (6) Here exp(&K2_{2) is the value of the Gaussian evaluated on the surface of} the ball B_K. As we will see later, we can arrange for the neural net level set

N2T to be entirely contained in Band hence take K=1.

Decomposing the integral representation of finto positive and negative parts, we have f(x)&exp

\

&K 2 2

+

=f1(x)&f2(x) (7) =

|

Rd

|

|a|K &|a|K 1[a}x+b0]sin+(b) exp(&|a|2₂₎ (2?)d2 db da &

|

Rd

|

|a|K &|a|K 1[a}x+b0]sin&(b) exp(&|a|2₂₎ (2?)d2 db da =&1

|

1[a}x+b0]dV1&&2

|

1[a}x+b0]dV2, (8) where V₁ is the probability measure for (a,b) on Rd _{with density}

1[&|a|K<b< |a|K]sin+(b)(exp(&|a|22)(2?)d2&1) with normalizing constant

&₁=

|

Rd

|

|a|K &|a|K sin+₍_b₎exp(&|a| 2₂₎ (2?)d2 db da

and similarly forV2and&2(with sin&(b) in place of sin+(b) ). We denote the positive part of sin(b) by sin+₍_b_{) and the negative part of sin(}_b_{) by} sin&₍_b_{). The total variation of the measure used to represent}_f _is

&=&₁+&₂

=

|

Rd

|

|a|K &|a|K |sin(b)|exp(&|a| 2₂₎ (2?)d2 db da <

|

Rd 2 |a|Kexp(&|a|2₂₎ (2?)d2 da (9) 2K-d. (10)

(6)

An integral representation of the Gaussian as an expected value invites Monte Carlo approximation by a sample average. In particular, both f1(x) andf2(x) in (7) are expected values of indicators of half-spaces inRd. Thus a 2T-term neural net approximation tof(x) is then

f₂_T(x)&exp

\

&K 2 2

+

= &1 T : T i=1 ,(a_i}x+b_i)&&2 T : 2T i=T+1 ,(a_i}x+b_i), (11) where the parameters (a_i,b_i)T

i=1 are drawn at random independently from the distribution V₁ and (a_i,b_i)2T

i=T+1 from V2. The sampling scheme is simple. For example, to obtain an approximation for f₁(x), first draw a

from a standard multi-variate normal distribution over Rd_{, then draw} _b

from [&|a|K, |a|K] with density proportional to sin+₍_b_).

We now bound the L approximation error between f(x) and f2T(x).

We will draw on symmetrization techniques and the concept of Orlicz norms in empirical process theory (see, for example, Pollard [13]), and the theory of VapnikC8 ervonenkis classes of sets (Vapnik and C8ervonenkis [18]). With the particular choice of 9(x)=1

5exp(x

2_{) used by Pollard} [13], the Orlicz norm of a random variable Zis defined by

&Z&9=inf

{

C>0 :Eexp

\

Z2

C2

+

5

=

.

We examine the approximation error betweenf1(x) andf1,T(x), itsT-term

neural net approximation, first.

From empirical process theory, the following lemma is obtained. See Appendix B.

Lemma 1. Let !=(a,b) and g

x(!)=,(a}x+b).If h(x)= ,(a}x+b)

P(da,db) for x#B_K for some probability measure P on (a,b), then there exist !₁,!₂, ...,!_T such that

sup x#BK

}

1 T : T i=1 gx(!i)&h(x)

}

34

d+1 T . (12)

Recall that for the approximation of the Gaussian function, the approximation f2T less exp(&K22) can be split up into two parts,

(7)

approximate the positive and negative parts f1 and f2 respectively. Using Lemma 1, we see that

sup x#BK

}

&₁ T : T i=1 gx(!i)&f1(x)

}

34&1

d+1 T (13) and similarly, sup x#BK

}

&2 T : 2T i=T+1 gx(!i)&f2(x)

}

34&2

d+1 T . (14)

Hence by the triangle inequality, sup

x#BK

|f2T(x)&f(x)| 34(&1+&2)

d+1 T =34&

d+1 T 68K

d(d+1) T . (15)

3. BOUNDING THE HAUSDORFF DISTANCE OF THE APPROXIMATION

The Hausdorff distance between two sets Fand Gis defined as

$H₍_F_,_G_)=max_[_sup x#F inf y#G |x&y|, sup y#G inf x#F |x&y|].

The norm | } | is the usual Euclidean norm inRd_{. We bound the Hausdorff}

distance between the ball and its approximating set $H₍_B_,_N

2T) in this

section. The ball is assumed to be centered at the origin. However, we apply the result later to other balls and ellipsoids that are not necessarily centered at the origin. Note that the unit ballBinRd_{may be represented as}

B=

{

x: exp

\

&|x| 2 2

+

exp

\

& 1 2

+=

. We defineN2Tas N2T=[x: f2T(x)exp(&12)+K=T].

(8)

Let f(x)=exp(&|x|2_{2) and let} _f

2T(x) be the approximation with T

pairs of indicators. Here

=_T:=68

d(d+1)

T ,

for which we have theLerror between the Gaussian and its approximant

bounded above by

sup

x#BK

|f2T(x)&f(x)| =T. (16)

We are going to bound the Hausdorff distance between B andN2T, using

this sup norm bound on the error between the functions f and f₂_T which yieldB andN₂Tas level sets.

Theorem 1. Let B_R be a ball of radius R in Rd centered at the origin,

and let N₂T be the level set of the neural net approximation.For sufficiently

large T, such that =T12(exp(& 1 4)&exp(& 1 2)), $H₍_B R,N2T)318R

d(d+1) T .

Proof. The ballB coincides with the level set off at the level exp(&1 2). Let T be such that =T is less than 21Kexp(&

1

2). Choose r0 such that exp(&r2

02)=exp(& 1

2)+2K=T. LetBr0be the ball of radius r0 centered at

the origin. If x#N2T, then exp(&12)f2T(x)&K=Texp(&|x|22) which

implies that x#B. Similarly ifx#Br0, then exp(&

1

2)+K=Texp(&|x|22)

&K=Tf2T(x), which implies thatx#N2T. Thus

B_r

0/N2T/B.

BothB and its approximating setN2T are sandwiched betweenBr0andB.

Consequently

$H₍_B_,_N

2T)1&r0.

The function g(r)=exp(&r2_{2) has derivative &}_rg₍_r_{) of magnitude} which is largest at r=1. Now

r0=-2 log(1(e&12+2K=T))

=-1&2 log(1+2K=Te12),

which is close to 1. If T is large enough that =T is less than 21K

(exp(&1

4)&exp(&12)), thenr0g(r0)(1-2) exp(&12), and hence using the mean-value theorem

(9)

$H₍_B_,_N 2T)1&r0 g(r0)&g(1) r0g(r0) 2-2e K=T 136-2e K

d(d+1) T . (17)

Now we set K. From Section 2, BK need only be large enough to cover

both the unit ball B and its approximation set N2T, which we have

arranged to be contained inB. Thus we can takeBKto beB, whenceK=1.

Again when=T12(exp(& 1 4)&exp(& 1 2)), we have $H₍_B_,_N 2T)136-2e

d(d+1) T 318

d(d+1) T . (18)

For a ball B_R of radius R, the Hausdorff distance between it and its approximation set is simply 318R-d(d+1)

T . This concludes the proof of

Theorem 1. K

4. AN L1BOUND

Let BR be a ball of radiusR, N2T the level set induced by the

approxi-mation as explained in Section 1, + is the Lebesgue measure, and$ is the Hausdorff distance between BR and N2T as obtained above. Since the

symmetric differenceBRqN2Tis included in the shellBR+$"BR&$, one has

|

|1BR&1N2T| +(dx) +(B_R)= +(BRqN2T) +(B_R) +(BR)&+(BR&$) +(BR) 1&

\

1&$ R

+

d d $ R 318d

d(d+1) T , (19)

(10)

Theorem 2. The relative Lebesgue measure of the symmetric difference

+(BRqN2T)+(BR) between BR and its approximation set N2T is bounded

above by +(B_Rq_N₂_T₎ +(B_R) 318d

d(d+1) T . 5. ELLIPSOID APPROXIMATION

Consider an ellipsoid E=[x:x$Mx1] centered at the origin with

M=A$Astrictly positive definite with ad_dpositive definite square rootA. Equivalently E=[x: exp(&x$A$Ax2)exp(&12)] is the level set of a Gaussian surface. In a similar manner to the ball, it can also be accurately and parsimoniously approximated by a single hidden-layer neural net. Let the eigenvalues of A be r₁r₂ } } } rd with the corresponding

eigen-vectors [r1,r2, ...,rd]. If the approximating set for the unit ball takes the form [x:2T

i=1ci1[ai}xbi]k], then the one for the ellipsoid Eis

E₂T=

{

x: :

2T

i=1

ci1[a_i}Axb_i]k

=

.

We are interested in bounding $H₍_E_,_E

2T), the Hausdorff distance between

the ellipsoid and its approximating set.

Theorem 3. The Hausdorff distance between the ellipsoid E and its

approximating set E₂_Tis bounded above by

318rd

d(d+1)

T . (20)

Proof. The matrix transformation A transforms the unit ball to an ellipsoid by stretching the unit radius to lengthr_iin theridirection and the

approximating set NT is similarly stretched in the same way to E2T. For

the ballBr₀(as defined in the proof of Theorem 1), the matrix

transforma-tionAtransforms it to an ellipsoidE$ by stretching its radius to lengthrir0 in theridirection. Thus the order of inclusivity is still preserved after the

transformation and

E$/E2T/E.

Note that the ellipsoids E and E$ are similar, centered at the origin and aligned along the same axes. The only difference is in the scale.

(11)

The two extreme parts of the ellipsoids are along the directionr1andrd.

The ellipsoid is contained in a ball of radius rd. Thus $H(E,E2T) is

bounded above by the greatest distance betweenEandE$, and this occurs along the direction of rd, and hence is bounded above by the Hausdorff

distance between that of a ball of radius rd (containing the ellipsoid) and

a ball of radius rdr0, and that is in turn bounded above by 318rd

d(d+1)

T . K

The error is the same as for approximation of a ball except that the radius of the ball is replaced by the maximal eigenvalue (length of major axis).

Now consider an ellipsoid E with axial lengths r₁ } } } r_d_&1r_d=R

and its approximating set E₂_T. The ellipsoid E$_=(1&$

R)E is a scaled

down version ofEand it has axial lengthsr₁(1&$

R) } } } rd&1(1&

$ R)

rd(1&R$)=R&$. Recall that the approximation set E2T is obtained by

scalingN2T(the approximation set for the unit ball) by a factor ofrialong

theith axis of the ellipsoid E. The Hausdorff distance betweenE andE2T

is$ which is bounded by 318R-d(d+1)

T from Theorem 3.

Corollary 1. The measure of the symmetric difference +(EqE₂_T)

between E and its approximation set E2Tis bounded above by

+(EqE₂_T)318+(E)d

d(d+1)

T .

Proof. Since the difference Eq_E₂_T _{is included in the shell} _E_"_E$_{, we}

obtain

|

|1E&1E2T|+(dx)=+(E)&+(E2T) +(E)&+(E$₎ =+(E)&

\

1&$ R

+

d +(E) =+(E)

\

1&

\

1&$ R

+

d

+

+(E)d $ R 318+(E)d

d(d+1) T . K (21)

(12)

6. REMARKS

In earlier work of one of the authors (Cheang [3]), an L2 approxima-tion bound of order T&16 _{was obtained with} _T _{pairs of half-spaces in a} neural net with a ramp sigmoid applied to the output. Here our order

T&12 _{bound gives an improved rate.}

The integral representation to the Gaussian on BKmay also be written

exp

\

&|x| 2 2

+

=

|

Rd

|

|a|K &|a|K 1[sgn(b)a}x+bsgn(b)0]|sin(b)| exp(&|a|2₂₎ (2?)d2 db da &1 2

|

Rd

|

|a|K &|a|K

sin&₍_b₎exp(&|a| 2₂₎ (2?)d2 db da +exp

\

&K

2

+

. (22)

Sampling from the distribution V proportional to |sin(b)| exp(&|a₂|2), the approximation to the ball takes the form

NT=

{

x#Rd: : T

i=1

1[a_i}xb_i]k

=

,

that is,xis inNTif it is in at leastkof the half-spaces. This approximation

achieves

$H₍_B_,_N

T)318

d(d+1)

T . (23)

In particular, when 2Tsigmoids are used in the approximation,

$H₍_B_,_N

2T)318

d(d+1) 2T

when the representation (22) is used, reducing the constant by a factor of 1-2 from the bound in Theorem 1.

It may be possible to extend our results to neural network approxima-tion of other classes of closed convex sets with smooth boundaries, for example, to classes of sets of the form D=[x#Rd_: _f₍_x_{) }}_f₍_x₎₁_]_{, where}

f:Rd_Ä_Rd_{has a strictly positive definite derivative. If this is achieved, the}

results pertaining to functions which have total finite variation with respect to a class of ellipsoids (in the following section) could be extended to those for a class of convex sets with some suitable smoothness properties.

(13)

7. APPROXIMATION BOUNDS FOR TWO LAYER NETS The second (outer) layer of a two layer net takes a linear combination of level sets Hof functions represented by linear combinations on the first (inner) layer. The class of sets represented by level sets of combinations of first layer nodes include half-spaces and rectangles, and (as we have seen) approximations to ellipsoids.

A function fis said to have variationV_f_,_Hwith respect to a class of sets

H if V_f_,_His the infimum of numbers V such that fVis in the closure of the convex hull of signed indicators of sets inH, where the closure is taken in L₂(PX). A special case of finite variation is the case we call total

varia-tion with respect to a class of sets. Suppose f(x) defined over a bounded region S inRd_{. We say that}_f _{has total variation}_V _{with respect to a class}

of sets H=[H!:!#5] if there exist some signed measure v over the

measurable space 5 and

f(x)=

|

5

1H!(x)v(d!) for x#S, (24)

and ifvhas finite total variationV. The setsH!are parametrized by!in5.

In our context, the H! are half spaces in Rd where the ! consist of the

location and orientation parameters. In the event that the representation (24) is not unique, we take the measure v that yields the smallest total variationV.

The function class FV,Hof functions with variationVf,Hbounded byV

arises naturally when thinking of the functions obtained by linear combina-tions on a layer of a network where the sum of absolute values of the coefficients of linear combination are bounded byVand the level sets from the preceding layer yield the sets inH. In our analysis of the two layer case we will take advantage of both L approximations bounds (used to yield

approximations to the indicators of ellipsoids in the inner layer) and L₂

approximation bounds for convex hulls of indicators of ellipsoids (essen-tially achieved by the outer layer of the network). First we state a simple

L2 approximation bound, which is a counterpart to the L bound of

Lemma 1, but with smaller constants (and without requirement of integral representation).

Lemma 2. If f has variation V

f with respect to a class of setsH then for

each T there exists H₁, ...,H_Tand c₁, ...,c_Twith T

i=1|ci| Vfsuch that the

approximation f_T(x)=T

i=1ci1Hi(x) achieves

&f&f_T&2

Vf

(14)

This lemma as a tool of approximation theory and two probabilistic proofs (one based on probabilistic sampling and one based on a greedy algorithm) are in Barron [2] and (with a somewhat larger constant) an earlier form of the greedy algorithm proof of the approximation result is in Jones [12]. The probabilistic sampling bound on L2 norms of averages used in the proof is classical Hilbert theory.

Proof. The proof is based on the Monte Carlo sampling idea as in Section 2. First fix T and suppose that f is not identically constant. (Equality occurs in (25) only if f is identically constant.) Since f is in the closure of the convex hull ofG=[\V_f1H:H#H], one takes af that is

a (potentially very large) finite convex combination with &f&f&2<$. In particular we take$==-Tand=small, say=<V_f&-V2

f&&f&24, which

is less than &₂f&.

By the triangle inequality,

&f&fT&2&f&f&2+&f&fT&2

=

-T+&f&fT&2. (26)

Suppose f=i pigi with gi in G, and pi>0 with i pi=1. Since f is an

expectation, we apply the Monte Carlo sampling technique. Draw indices

i1, ...,iTindependently according to the distributionpiin the representation

off and letfT=T1 T

j=1gij. Then

E_i&f&f_T&2

2=

Ei&gi&2&&f&2

T

V

2

f&&f&

2

T

V

2

f&&f&

2₄

T , (27)

and so there exists a choice of such anfTwith

&f&f_T&2 2<

V2

f&&f&

2₄

T .

That is,

&f&fT&2<

-V2

f&&f&

2₄

-T . (28)

(15)

As a consequence of the lemma above, we have the following corollary involving approximation with a class of ellipsoids. Let !be the parameters that define the ellipsoids, and 1E!(x) the indicator of the ellipsoid.

Corollary 3. If f has variation V_f=V_f_,_Ewith respect to the classEof

ellipsoids then there is a choice of ellipsoids E1, ...,ET and s1, ...,sT1#

[&1, +1], and ci=Vf siT1such that

f_T 1(x)=: T1 i=1 c_i1Ei (29) satisfies

&fT₁&f&2

Vf -T1

. (30)

The indicators of ellipsoids have two layer sigmoidal network approxi-mations consisting of a single outer node and a single hidden inner layer. These approximations to 1Eimay be substituted into the approximation in

(29) to yield a two hidden layer approximation to f.

Let E=[E!:!#5]be the set of ellipsoids with+(E!)+(S) where+is

the Lebesgue measure. Let P_Xbe the uniform probability measure overS, and let E₂_T

2 be the neural net level set with 2T2 sigmoids that is used to

approximateE. Using the bound in Corollary 1, for eachE#E,

|

_S|1E(x)&1E₂_T 2(x)| 2_P X(dx)= +((E&E2T2)&S) +(S) +(E) +(S)318d

d(d+1) 2 318d

d(d+1) T₂ . (31)

After replacing the indicators of the ellipsoids in (29) with their neural net approximations, we obtain

f_T 1, 2T2=: T1 i=1 c_i,

\

: T2 j=1 |_ij,(a_ij}x&b_ij)&d_i

+

. (32) The following theorem bounds the mean-squared approximation error. An ellipsoid inEis denoted by E.

Theorem 4. If f has variation V_f with respect to the class of ellipsoidsE,

(16)

there exist a choice of parameters(aij,bij,ci,di,|ij)such that a two hidden

layer net with step activation function achieves approximation error bounded by

&f&fT₁, 2T₂&2

V_f -T₁+Vf

\

318d

d(d+1) T₂

+

12 , (33) and &f&f_T 1, 2T2&1 Vf -T1 +V_f318d

d(d+1) T2 ,

where&}&p denotes the Lp(PX) norm; provided that T2 is large enough that 68-d(d+1)T212(exp(&

1

4)&exp(& 1 2)).

Proof. By the triangle inequality,

&f&f_T

1, 2T2&2&f&fT1&2+&fT1&fT1, 2T2&2. (34)

Now

&f&fT1&2

V_f

-T₁

from Corollary 1. The other term on the right hand side of (34) is bounded as follows. Let E_i be the neural net level set of the approximation to E_i

from Section 5. Then

&fT1&fT1, 2T2&2=

"

: T1 i=1 ci(1Ei&1Ei)

"

2 1 T1 : T1 i=1

|c_i|&1Ei&1Ei&2

Vf

\

318d

d(d+1)

T2

+

12

, (35)

where (31) bounds the last inequality (35). K

The proof of the L1 bound is similar (using &f&fT1&1&f&fT1&2

Vf-T1) except that the square root in (35) is not used in bounding

&1Ei&1Ei&1.

We conclude with two examples of functions with variation with respect to a class of balls (ellipsoids).

(17)

Example 1. Convex Combination of Balls. Let B(a,b) denote a ball centered at a with radius b. InR3_{, the function}

f(x)=4? 3&-2?(x 2 1+x 2 2+x 2 3) 12₊ ? 3-2(x 2 1+x 2 2+x 2 3) 32 =

|

1B(%, 1)(x) 1B(0, 1)(%)d% (36)

is a convex combination of indicators of balls. Thus

f_T 1(x)= 4? 3T1 : T1 i=1 1_B₍_% i, 1)(x) (37)

is an approximation tof(x) where the %i's are sampled from the uniform

distribution in a unit ball. We then approximate each ball 1B(%i, 1)(x) with

the form (5).

Example 2. A Radial Function Let +2,

1 2( |x| &++2), +&2< |x| + f(x)=

{

1 2(++2& |x| ), +< |x| ++2 (38) 0, otherwise. Then f(x)=

|

R

1(%&1,%+1)( |x| )211[&1, 1](%&+)d% (39) and thus f(x) can be approximated by

f_T 1(x)= 1 T1 : T1 i=1 [1B(0,%i+1)(x)&1B(0,%i&1)(x)], (40)

where %itiid Uniform (+&1,++1).

APPENDIX A

Starting with the right hand side of (6) and recalling that |a}x| |a|K

(18)

|

_Rd

|

|a|K &|a|K 1[a}x+b0]sin(b) exp(&|a|2₂₎ (2?)d2 db da (41) = &Im

|

Rd

|

|a|K &|a|K 1[a}x+b0]exp(&ib) exp(&|a|2₂₎ (2?)d2 db da = &Im

|

Rd

_

|

a}x+ |a|K a}x& |a|K 1[s0]exp(&is)ds

&

exp(ia}x) exp(&|a|2₂₎ (2?)d2 da = &Im

|

Rd

_

|

a}x+ |a|K 0

exp(&is)ds

&

exp(ia}x) exp(&|a|

2₂₎

(2?)d2 da

=Imi

|

Rd

[1&exp(&ia}x) exp(&i|a|K)]exp(ia}x) exp(&|a| 2₂₎ (2?)d2 da =exp

\

&|x| 2 2

+

&

|

Rd exp(&i|a|K) exp(&|a|2₂₎ (2?)d2 da (42) =f(x)&exp

\

&K 2 2

+

. (43) In (41), we did a substitutions=a}x+b. APPENDIX B

We prove a more general version of Lemma 1. Let a parameterized class of setsH=[H!:!#5]inRdbe given where5is a measurable space. Let

H =[Hx:x#Rd] with Hx=[!:x#H!] be the dual class of sets in 5

parametrized by x.

First we define some terms that will be used in the lemma. Let G be a class of functions mapping fromXtoRand letx₁, ...,x_N#X. We say that

x₁, ...,x_N are shattered by G if there exists r#RN _{such that for each}

b=(b₁, ...,b_N) #[0, 1]N_{, there is an}_g_#_G_{such that for each}_i_,

g(x_i)

{

ri <ri

if bi=1

if bi=0.

The pseudo-dimension is defined as

dimP(G)=max[N:_x1, ...,xN,Gshatters x1, ...,xN] (44)

if such a maximum exists, andotherwise. For the class of unit step func-tions,(a}x+b), the pseudo-dimension and the VC-dimensionDcoincide and isd+1. The=-packing numberDT(=,Lp) for a subset of a metric space

(19)

is defined as the largest numbermfor which there exist pointst1, ...,tm in

the subset of the metric space withdp(ti,tj)>=fori{j, wheredpis theLp

metric.

Lemma 4. IfH has VC-dimension D and if h is a function in the convex

hull of the indicators of sets inH which possesses an integral representation

h(x)=

|

1H!(x)P(d!) for x#BK,

then there is a choice of !₁,!₂, ...,!T such that the approximation hT(x)=

1 T T i=11H !i(x) satisfies sup x#BK |hT(x)&h(x)| 34

D T. (45)

Remark. Such a uniform approximation bound holds over any subset of Rd_{in which the integral representation holds. In our application we use}

BK, the ball of radius K.

Proof. Let gx(!)=1Hx(!)=1H!(x) and let _i be independent random

variables taking the values \1 with probablity1

2. Define!

=(!1,!2, ...,!T), where the!i are independently and identically distributed with respect to

P( } ), and _

=(_1,_2, ...,_T). By symmetrization, using Jensen's inequality as in Pollard [13, p. 7], for C>0, we have

E9

\

supx#BK| T i=1 gx(!i)&Th(x)| C

+

E! E_ 9

\

2 supx#BK| T i=1_igx(!i)| C

+

. (46)

Conditioning on !, we need to find an upper bound to E_

9(2 supx#BK |T

i=1_igx(!i)| ). This involves bounding the Orlicz norm &2 supx#BK

|T

i=1_igx(!i)|&9with !

fixed. Using a result in Pollard [13, pp. 3537],

"

2 sup x#BK

}

: T i=1 _igx(!i)

}"

9 18-T

|

1 0 -logDT(=,L2)d=, (47)

where DT(=,L2) is the L2 =-packing number for H , where the L2 norm

on 5 is taken with respect to the empirical probability measure on

(20)

From Pollard [13, p. 14], DT(=,L2)

\

3 =

+

D , (48)

uniformly over all !₁,!₂, ...,!_T. We now work out an upper bound to

1

0-logDT(=,L2)d=. From the CauchySchwartz inequality,

|

₀1-logD_T(=,L2)d=

|

1 0 logD_T(=,L2)d= =

Dlog 3&D

|

1 0 log= d= -(1+log 3)D. (49)

Substituting (49) into (47), we see that

"

2 sup x#BK

}

: n i=1 __ig_x(!_i)

}"

9 18-(1+log 3)TD. (50)

From the definition of the Orlicz norm, the choice of C₀=

18-(1+log 3)TD ensures that

E_9

\

2 supx#BK| T i=1_igx(!i)| C0

+

1, and hence, E9

\

supx#BK| T i=1gx(!i)&Th(x)| C₀

+

1. (51)

Thus we conclude that there exists!1,!2, ...,!T such that

9

\

supx#BK| T i=1gx(!i)&Th(x)| C0

+

1, whence sup x#BK

}

1 T : T i=1 gx(!i)&h(x)

}

18-D(1+log 3) log 5 -T 34

D T. K (52)

(21)

In our case, !=(a,b) andgx(!)=1H!(x)=1[a}x+b0]. The dual class of

sets in 5 are Hx=[!:gx(!)=1]=[(a,b) :a}x+b0]. Since (a,b) #

Rd_{_}_R_{, which is a vector space of dimension} _d_{+1, the class of sets} _H ₌

[Hx:x#Rd] has VC-dimension D=d+1 (Pollard [12, p. 20], Haussler

[10]). Thus Lemma 1.

REFERENCES

1. A. R. Barron, Neural net approximation,in``Proc. of the 7th Yale Workshop on Adaptive and Learning Systems,'' 1992.

2. A. R. Barron, Universal approximation bounds for superpositions of a sigmoidal function,

IEEE Trans.Inform.Theory39(1993), 930945.

3. G. H. L. Cheang, Neural network approximation and estimation of functions,in``Proc. of the 1994 IEEE-IMS Workshop on Info. Theory and Stat.,'' 1994.

4. G. Cybenko, Approximations by superpositions of a sigmoidal function, Math.Control Signals Systems2(1989), 303314.

5. R. Dudley, Metric entropy of some classes of sets with differentiable boundaries,

J.Approx.Theory10(1974), 227236; Correction,J.Approx.Theory26(1979), 192193. 6. L. Fejes Toth, ``Lagerungen in der Ebene, auf der Kugel und im Raum,'' Springer-Verlag,

Berlin, 1953, 1972.

7. P. M. Gruber, Approximation of convex bodies, in ``Convexity and its Appplications'' (P. M. Gruber and J. M. Wills, Eds.), pp. 131162, Birkhauser, Basel, 1983.

8. P. M. Gruber and P. Kenderov, Approximation of convex bodies by polytopes, Rend.

Circ.Mat.Palermo31(1982), 195225.

9. S. S. Haykin, ``Neural Networks: a Comprehensive Foundation,'' Macmillan, New York, 1994.

10. D. Haussler, Decision theoretic generalizations of the PAC model for neural net and other learning applications, Inform.Comp.100, No. 1 (1992), 78150.

11. K. Hornik, M. B. Stinchcombe, and H. White, Multi-layer feedforward networks are universal approximators, Neural Networks2(1988), 359336.

12. L. K. Jones, A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training, Ann. Statist. 20

(1990), 608613.

13. D. Pollard, ``Convergence of Stochastic Processes,'' Springer-Verlag, Berlin, 1984. 14. D. Pollard, ``Empirical Processes: Theory and Applications,'' NSF-CBMS Regional Conf.

Ser. Probab. Statist., Vol. 2, SIAM, Philadelphia, 1990.

15. F. Rosenblatt, The perceptron: A probabilistic model for information storage and organization in the brain,Psych.Rev.65(1958), 386408.

16. F. Rosenblatt, ``Principles of Neurodynamics,'' Spartan, Washington, DC, 1962. 17. R. Schneider and J. A. Wieacker, Approximation of convex bodies by polytopes, Bull.

London Math.Soc.13(1981), 149156.

18. V. N. Vapnik and A. Ya. C8_{ervonenkis, On the uniform convergence of relative frequencies}

of events to their probabilities,Theory Probab.Appl.16, No. 2 (1971), 264280. 19. J. E. Yukich, M. B. Stinchcombe, and H. White, Sup-norm approximation bounds for

networks through probabilistic methods, IEEE Trans. Inform. Theory 41 (1995), 10211027.