arxiv: v1 [stat.ml] 18 Oct 2018

(1)

arXiv:1810.08033v1 [stat.ML] 18 Oct 2018

ADAPTIVITY OF DEEP

RELU

NETWORK FOR

LEARN-ING IN

B

ESOV AND MIXED SMOOTH

BESOV SPACES:

OPTIMAL RATE AND CURSE OF DIMENSIONALITY

Taiji Suzuki

The University of Tokyo, Tokyo, Japan

Center for Advanced Intelligence Project, RIKEN Japan Digital Design

[email protected]

ABSTRACT

Deep learning has shown high performances in various types of tasks from visual recognition to natural language processing, which indicates superior flexibility and adaptivity of deep learning. To understand this phenomenon theoretically, we develop a new approximation and estimation error analysis of deep learning with the ReLU activation for functions in a Besov space and its variant with mixed smoothness. The Besov space is a considerably general function space includ-ing the H¨older space and Sobolev space, and especially can capture spatial inho-mogeneity of smoothness. Through the analysis in the Besov space, it is shown that deep learning can achieve the minimax optimal rate and outperform any non-adaptive (linear) estimator such as kernel ridge regression, which shows that deep learning has higher adaptivity to the spatial inhomogeneity of the target function than other estimators such as linear ones. In addition to this, it is shown that deep learning can avoid the curse of dimensionality if the target function is in amixed smoothBesov space. We also show that the dependency of the convergence rate on the dimensionality is tight due to its minimax optimality. These results support high adaptivity of deep learning and its superior ability as a feature extractor.

1 INTRODUCTION

Deep learning has shown great success in several applications such as computer vision and nat-ural language processing. As its application range is getting wider, theoretical analysis to reveal the reason why deep learning works so well is also gathering much attention. To understand deep learning theoretically, several studies have been developed from several aspects such as approxima-tion theory and statistical learning theory. A remarkable property of neural network is that it has universal approximation capability even if there is only one hidden layer (Cybenko, 1989; Hornik, 1991; Sonoda & Murata, 2015). Thanks to this property, deep and shallow neural networks can approximate any function with any precision (of course, the meaning of the terminology “any” must be rigorously defined like “any function inL1₍_R₎_{”). A natural question coming next to the}

universal approximation capability is its expressive power. It is shown that the expressive power of deep neural network grows exponentially against the number of layers (Montufar et al., 2014; Bianchini & Scarselli, 2014; Cohen et al., 2016; Cohen & Shashua, 2016; Poole et al., 2016) where the “expressive power” is defined by several ways.

The expressive power of neural network can be analyzed more precisely by specifying the target function’s property such as smoothness. Barron (1993; 1994) developed an approximation theory for functions having limited “capacity” that is measured by integrability of their Fourier transform. An interesting point of the analysis is that the approximation error is not affected by the dimen-sionality of the input. This observation matches the experimental observations that deep learning is quite effective also in high dimensional situations. Another typical approach is to analyze function spaces with smoothness conditions such as the H¨older space. In particular, deep neural network with the ReLU activation (Nair & Hinton, 2010; Glorot et al., 2011) has been extensively studied recently from the view point of its expressive power and its generalization error. For example,

(2)

Table 1: Comparison between the performances achieved by deep learning and linear methods. Here,Nis the number of parameters to approximate a function in a Besov space (Bs

p,q([0,1]d)), and

nis the sample size. The approximation error is measured byLr-norm. TheO˜ symbol hides the poly-log order.

Model Deep learning Linear method

Approximation error rate O(N˜ −sd) O˜

N−ds+(1p−1r)+

Estimation error rate O(n˜ −2s2+sd) Ω n−

2s−(2/(p∧1)−1) 2s+1−(2/(p∧1)−1)

Yarotsky (2016) derived the approximation error of the deep network with the ReLU activation for functions in the H¨older space. Schmidt-Hieber (2017) evaluated the estimation error of regularized least squared estimator performed by deep ReLU network based on this approximation error analysis in a nonparametric regression setting. Petersen & Voigtlaender (2017) generalized the analysis by Yarotsky (2016) to the class ofpiece-wisesmooth functions. Imaizumi & Fukumizu (2018) utilized this analysis to derive the estimation error to estimate the piece-wise smooth function and concluded that deep leaning can outperform linear estimators in that setting; here, thelinear methodindicates an estimator which is linearly dependent on the output observations(y1, . . . , yn)(it could be

non-linearly dependent on the input(x1, . . . , xn); for example, the kernel ridge regression depends on

the output observations linearly, but it is nonlinearly dependent on the inputs). Although these error analyses are standard from a nonparametric statistics view point and the derived rates are known to be (near) minimax optimal, the analysis is rather limited because the analyses are given mainly based on the H¨older space. However, there are several other function spaces such as the Sobolev space and the space of finite total variations. A comprehensive analysis to deal with such function classes from a unified view point is required.

In this paper, we give generalization error bounds of deep ReLU networks for aBesov spaceand its variant withmixed smoothness, which includes the H¨older space, the Sobolev space, and the function class with total variation as special cases. By doing so, (i) we show that deep learning achieves the minimax optimal rate on the Besov space and notably it outperformsany linear esti-matorsuch as the kernel ridge regression, and (ii) we show that deep learning canavoid the curse of dimensionalityon the mixed smooth Besov space and achieves the minimax optimal rate. As related work, Mhaskar & Micchelli (1992); Mhaskar (1993); Chui et al. (1994); Mhaskar (1996); Pinkus (1999) also developed an approximation error analysis which essentially leads to analyses for Besov spaces. However, the ReLU activation is basically excluded and comprehensive analyses for the Besov space have not been given. Consequently, it has not been clear whether ReLU neural networks can outperform another representative methods such as kernel methods. As a summary, the contribution of this paper is listed as follows:

(i) To investigate adaptivity of deep learning, we give an explicit form of approximation and estimation error bounds for deep learning with the ReLU activation where the target func-tions are in the Besov spaces (Bs

p,q) fors >0and0< p, q≤ ∞withs > d(1/p−1/r)+

whereLr_{-norm is used for error evaluation. In particular, deep learning outperforms any}

linear estimator such as kernel ridge regression if the target function has highly spatial inhomogeneity of its smoothness. See Table 1 for the overview.

(ii) To investigate the effect of dimensionality, we analyze approximation and estimation prob-lems in so-called the mixed smooth Besov space by ReLU neural network. It is shown that deep learning with the ReLU activation can avoid the curse of dimensionality and achieve the near minimax optimal rate. The theory is developed on the basis of thesparse grid technique (Smolyak, 1963). See Table 2 for the overview.

2 SET UP OF FUNCTION SPACES

In this section, we define the function classes for which we develop error bounds. In particular, we define the Besov space and its variant with mixed smoothness. The typical settings in statistical learning theory is to estimate a function with asmoothnesscondition. There are several ways to

(3)

Table 2: Summary of relation between related existing work and our work for a mixed smooth Besov space.Nis the number of parameters in the deep neural network,nis the sample size.βrepresents the smoothness parameter, and drepresents the dimensionality of the input. The approximation accuracy is measured byL2_{-norm and estimation accuracy is measures by the square of}_L2_-norm.

See Theorem 3 for the definition ofu.

Function class H¨older Barron class m-Sobolev (0< β_≤2)

m-Besov (0< β)

Approximation

Author Yarotsky (2016),

Liang & Srikant (2016)

Barron (1993) Montanelli & Du (2017)

This work Approx. error O(N˜ −βd) O(N˜ −1/2) O(N˜ −β) O(N˜ −β)

Estimation

Author Schmidt-Hieber (2017) Barron (1993) —- This work Estimation

er-ror

˜

O(n−2β2β+d₎ _O(n˜ −12) —- O(n˜ −

2β

2β+1 _×

log(n)2(d−1+21)(uβ+β)₎

characterize “smoothness.” Here, we summarize the definitions of representative functional spaces that are appropriate to define the smoothness assumption.

LetΩ _⊂ Rd _{be a domain of the functions. Throughout this paper, we employ}_{Ω = [0,}_1]d_{. For}

a function f : Ω → R_{, let} _k_f_k_p _:= _k_f_kLp_(Ω) := (R

Ω|f|

p_dx)1/p _for₀ _{< p <} _∞_{. For}_p ₌ ∞, we define_kf_k_∞ := _kf_kL∞_(Ω) := sup_x_∈_Ω|f(x)|. Forα ∈ Rd, let|α| = Pd_j₌₁|α_j|. Let C0_(Ω) _{be the set of continuous functions equipped with}_L∞_-norm: _C0_{(Ω) :=} _{_f _{: Ω} _→ _R _|

f is continuous andkfk∞<∞}1. Forα∈Nd+, we denote byDαf(x) =

∂|α|_f ∂α1x1...∂αdxd(x)

2

.

Definition 1(H¨older space (Cβ_(Ω)₎₎_. _Let_{β >} ₀_with_β _6∈_N_{be the smoothness parameter. For an}

mtimes differentiable functionf :Rd _→R_{, let the norm of the H¨older space}_Cβ_(Ω)be_kf_k_Cβ := max|α|≤m

Dαfk∞+ max|α|=msupx,y∈Ω|∂

α

f(x)−∂αf(y)|

|x−y|β−m ,wherem = ⌊β⌋. Then, (β-)H¨older

space_Cβ(Ω)is defined as_Cβ(Ω) ={f | kfkCβ <∞}.

The parameterβ >0controls the “smoothness” of the function. Along with the H¨older space, the Sobolev spaceis also important.

Definition 2(Sobolev space(Wk

p(Ω))). Sobolev space(Wpk(Ω))with a regularity parameterk∈ N and a parameter 1 ≤ p ≤ ∞ is a set of functions such that the Sobolev norm _kfkWk

p := (P_|_α_|≤_k_kDα_f kp p) 1 p _{is finite.}

There are some ways to define a Sobolev space with fractional order, one of which will be defined by using the notion of interpolation space(Adams & Fournier, 2003), but we don’t pursue this direction here. Finally, we introduceBesov spacewhich further generalizes the definition of the Sobolev space. To define the Besov space, we introduce the modulus of smoothness.

Definition 3. For a functionf _∈Lp_(Ω)_{for some}_p

∈(0,_∞], ther-th modulus of smoothness off

is defined by wr,p(f, t) = sup khk2≤t k∆rh(f)kp, where∆r h(f)(x) = (Pr j=0 r j (−1)r−j_f_(x₊_jh) _(x_∈_{Ω, x}₊_rh_∈_Ω), 0 (otherwise).

Based on the modulus of smoothness, the Besov space is defined as in the following definition.

1_Since_{Ω = [0}_,_1]d

in our setting, the boundedness automatically follows from the continuity.

2

We letN+ := {0,1,2,3, . . .},N

d

+ := {(z1, . . . , zd) | zi ∈ N+},R+ := {x ≥ 0 | x ∈ R}, and

(4)

Definition 4(Besov space (Bα

p,q(Ω))). For0< p, q≤ ∞,α >0,r:=⌊α⌋+ 1, let the seminorm | · |Bα p,q be |f_|Bα p,q := ( R∞ 0 (t− α_w r,p(f, t))qd_tt 1 q _{(q <} ∞), supt>0t−αwr,p(f, t) (q=∞).

The norm of the Besov spaceBα

p,q(Ω)can be defined bykfkBα

p,q :=kfkp+|f|Bp,qα , and we have Bα

p,q(Ω) ={f ∈Lp(Ω)| kfkBα

p,q<∞}.

Note thatp, q <1is also allowed. In that setting, the Besov space is no longer a Banach space but a quasi-Banach space. The Besov space plays an important role in several fields such as nonparametric statistical inference (Gin´e & Nickl, 2015) and approximation theory (Temlyakov, 1993a). These spaces are closely related to each other as follows (Triebel, 1983):

• Form_∈N_,_B_p,m₁_(Ω)_֒_→_W_pm_(Ω)_֒_→_B_p,m

∞(Ω),andB2m,2(Ω) =W2m(Ω). • For0< s <∞ands6∈N_,_Cs_{(Ω) =}_Bs_∞_,_∞_(Ω).

• For0< s, p, q, r≤ ∞withs > δ :=d(1/p−1/r)+, it holds thatBsp,q(Ω)֒→Br,qs−δ(Ω).

In particular, under the same condition, from the definition of_{k · k}Bs

p,q, it holds that

Bp,qs (Ω)֒→Lr(Ω). (1)

• For0< s, p, q_{≤ ∞}, ifs > d/p, then

Bs

p,q(Ω)֒→ C0(Ω). (2)

Hence, if the smoothness parameter satisfiess > d/p, then it is continuously embedded in the set of the continuous functions. However, ifs < d/p, then the elements in the space are no longer continuous. Moreover, it is known thatB1

1,1([0,1])is included in the space of bounded total

varia-tion (Peetre & Dept, 1976). Hence, the Besov space also allows spatially inhomogeneous smooth-ness with spikes and jumps; which makes difference between linear estimators and deep learning (see Sec. 4.1).

It is known that the minimax rate to estimatefo_{is lower bounded by}_n−2s/(2s+d)_,_{(Gin´e & Nickl,}

2015). We see that thecurse of dimensionalityis unavoidable as long as we consider the Besov space. This is an undesirable property because we easily encounter high dimensional data in several machine learning problems. Hence, we need another condition to derive approximation and estima-tion error bounds that are not heavily affected by the dimensionality. To do so, we introduce the notion ofmixed smoothness.

The Besov space with mixed smoothness is defined as follows (Schmeisser, 1987; Sickel & Ullrich, 2009). To define the space, we define the coordinate difference operator as

∆r,i_h (f)(x) = ∆rh(f(x1, . . . , xi−1,·, xi+1, . . . , xd))(xi)

forf : Rd _→ R_,_h _∈ R₊_,_i _∈ _[d]_{, and}_r _≥ ₁_{. Accordingly, the mixed differential operator for}

e_{⊂ {}1, . . . , d_}andh_∈Rd_{is defined as}

∆r,e_h (f) =Q_i_∈_e∆_hr,i_i(f), ∆r,_h∅(f) =f.

Then, the mixed modulus of smoothness is defined as

we

r,p(f, t) := sup|hi|≤ti,i∈ek∆

r,e h (f)kp

fort _∈Rd₊_and₀_{< p}_{≤ ∞}_{. Letting}₀_{< p, q}_{≤ ∞}_,_α_∈Rd₊₊_and_r_i_:=_⌊_α_i_⌋_{+ 1}_{, the semi-norm} | · |MBα,e

p,q based on the mixed smoothness is defined by

|f_|MBα,e p,q :=    nR Ω[( Q i∈et− αi i )wer,p(f, t)]qQ_id_∈t_eti o1/q (0< q <∞), supt∈Ω( Q i∈et− αi i )wer,p(f, t) (q=∞).

By summing up the semi-norm over the choice ofe, the (quasi-)norm of the mixed smooth Besov space (abbreviated to m-Besov space) is defined by

kf_kMBα p,q:=kfkp+ X e⊂{1,...,d} |f_|MBα,e p,q,

(5)

and thusMBα

p,q(Ω) :={f ∈Lp(Ω) | kfkMBα

p,q <∞}where0< p, q≤1andα∈R

d

++. In this

paper, we assume thatα1 = · · · = αd. With a slight abuse of notation, we also use the notation

M Bα

p,qforα >0to indicateM B (α,...,α) p,q .

Forα _∈ Rd₊_{, if}_p₌ _q_{, the m-Besov space has an equivalent norm with the}tensor productof the one-dimensional Besov spaces:

MBαp,p=Bα1p,p⊗δp· · · ⊗δpB

αd

p,p,

where_⊗δpis atensor product with respect to thep-nuclear tensor norm(see Sickel & Ullrich (2009)

for its definition and more details). We can see that the following models are included in the m-Besov space:

• Additive model Meier et al. (2009): iffj ∈Bp,qαj([0,1])forj= 1, . . . , d,

f(x) =

d

X

r=1

fd(xd)∈MBp,qα (Ω),

• Tensor model Signoretto et al. (2010): iffr,j ∈ Bp,qαj([0,1]) forr = 1, . . . , R andj =

1, . . . , d, f(x) = R X r=1 d Y j=1 fr,j(xj)∈MBαp,q(Ω).

(m-Besov space allowsR_{→ ∞}if the summation converges with respect to the quasi-norm ofk · kMBα

p,q).

It is known that an appropriate estimator in these models can avoid curse of dimensionality (Meier et al., 2009; Raskutti et al., 2012b; Kanagawa et al., 2016; Suzuki et al., 2016). What we will show in this paper supports that this fact is also applied to deep learning from a unifying view-point.

The difference between the (normal) Besov space and the m-Besov space can be informally ex-plained as follows. For regularity condition αi ≤ 2 (i = 1,2), the m-Besov space consists of

functions for which the following derivatives are “bounded”:

∂f ∂x1 , ∂f ∂x2 ,∂ 2_f ∂x2 1 ,∂ 2_f ∂x2 2 , ∂ 2_f ∂x1∂x2 , ∂ 3_f ∂x1∂x22 , ∂ 3_f ∂x2 1∂x2 , ∂ 4_f ∂x2 1∂x22 .

That is, the “max” of the orders of derivatives over coordinates needs to be bounded by 2. On the other hand, the Besov space only ensures the boundedness of the following derivatives:

∂f ∂x1 , ∂f ∂x2 ,∂ 2_f ∂x2 1 ,∂ 2_f ∂x2 2 , ∂ 2_f ∂x1∂x2 ,

where the “sum” of the orders needs to be bounded by 2. This difference directly affects the rate of convergence of approximation accuracy. Further details about this space and related topics can be found in a comprehensive survey (D˜ung et al., 2016).

Relation to Barron class. Barron (1991; 1993; 1994) showed that, if the Fourier transform of a functionf satisfies some integrability condition, then we may avoid curse of dimensionality for estimating neural networks with sigmoidal activation functions. The integrability condition is given

by _Z

Cdk

ωk|fˆ(ω)|dω <∞,

wherefˆis the Fourier transform of a functionf. We call the class of functions satisfying this conditionBarron class. A similar function class is analyzed by Klusowski & Barron (2016) too. We cannot compare directly the m-Besov space and Barron class, but they are closely related. Indeed, ifp= q= 2ands =α1 =· · · =αd, then m-Besov spaceMB2s,2(Ω)is equivalent to the tensor

product of Sobolev space Sickel & Ullrich (2011) which consists of functionsf : Ω→R_satisfying

Z Cd d Y i=1 (1 +_|ωi|2)s|fˆ(ω)|2dω <∞.

Therefore, our analysis gives a (similar but) different characterization of conditions to avoid curse of dimensionality.

(6)

3 APPROXIMATION ERROR ANALYSIS

In this section, we evaluate how well the functions in the Besov and m-Besov spaces can be ap-proximated by neural networks with the ReLU activation. Let us denote the ReLU activation by

η(x) = max_{x,0_}(x_∈R₎_{, and for a vector}_x_,_η(x)_{is operated in an element-wise manner. Define}

the neural network with heightL, widthW, sparsity constraintSand norm constraintBas

Φ(L, W, S, B) :=_{(W(L)η(_·) +b(L))_{◦ · · · ◦}(W(1)x+b(1)) |W(ℓ) ∈RW×W_{, b}(ℓ)_∈RW_, L X ℓ=1 (_kW(ℓ) k0+kb(ℓ)k0)≤S,max ℓ kW (ℓ) k∞∨ kb(ℓ)k∞≤B}, where_{k · k}0is theℓ0-norm of the matrix (the number of non-zero elements of the matrix) andk · k∞ is theℓ∞-norm of the matrix (maximum of the absolute values of the elements). We want to evaluate how largeL, W, S, Bshould be to approximatefo

∈MBα

p,q(Ω)by an elementf ∈Φ(L, W, S, B)

with precisionǫ >0measured byLr_-norm:_min

f∈Φkf−fokr≤ǫ.

3.1 APPROXIMATION ERROR ANALYSIS FORBESOV SPACES

Here, we show how the neural network can approximate a function in the Besov space which is useful to derive the generalization error of deep learning. Although its derivation is rather standard as considered in Chui et al. (1994); B¨olcskei et al. (2017), it should be worth noting that the bound derived here cannot be attained anynon-adaptivemethod and the generalization error based on the analysis is also unattainable by anylinearestimators including the kernel ridge regression. That explains the high adaptivity of deep neural network and how it outperforms usual linear methods such as kernel methods.

To show the approximation accuracy, a key step is to show that the ReLU neural network can ap-proximate thecardinal B-splinewith high accuracy. Let_N(x) = 1 (x∈[0,1]), 0 (otherwise), then thecardinal B-spline of ordermis defined by takingm+ 1-times convolution of_N:

Nm(x) = (N ∗ N ∗ · · · ∗ N_| _{z _}

m+ 1times

)(x),

wheref_∗g(x) :=R f(x₋t)g(t)dt. It is known that_Nmis a piece-wise polynomial of orderm.

Fork= (k1, . . . , kd)∈Ndandj= (j1, . . . , jd)∈Nd, letMk,jd (x) =

Qd

i=1Nm(2kixi−ji). Even

fork ∈ N_{, we also use the same notation to express}_M_k,jd _{(x) =} Qd

i=1Nm(2kxi−ji). Here,k

controls the spatial “resolution” andjspecifies the location on which the basis is put. Basically, we approximate a functionf in a Besov space by a super-position ofMm

k,j(x), which is closely related

to wavelet analysis (Mallat, 1999).

Mhaskar & Micchelli (1992); Chui et al. (1994) have shown the approximation ability of neural net-work for a function with bounded modulus of smoothness. However, the activation function dealt with by the analysis does not include ReLU but it deals with a class of activation functions satisfying the following conditions,

lim x→∞η(x)/x k →1, lim x→−∞η(x)/x k _{= 0,} ∃K >1s.t.|η(x)| ≤K(1 +|x|)k(x∈R_), ₍₃₎

fork = 2which excludes ReLU. Mhaskar (1993) analyzed deep neural network under the same setting but it restricts the smoothness parameter tos = k+ 1. Mhaskar (1996) considered the Sobolev space Wm

p with an infinitely many differentiable “bump” function which also excludes

ReLU. However, approximating the cardinal B-spline by ReLU can be attained by appropriately using the technique developed by Yarotsky (2016) as in the following lemma.

Lemma 1(Approximation of cardinal B-spline basis by the ReLU activation). There exists a con-stantc(d,m)depending only ondandmsuch that, for allǫ >0, there exists a neural networkMˇ ∈

Φ(L0, W0, S0, B0)withL0:= 3+2 l log2 3d∨m ǫc(d,m) + 5m⌈log2(d∨m)⌉,W0:= 6dm(m+2)+2d,

S0:=L0W02andB0:= 2(m+ 1)mthat satisfies

kM0d,0−MˇkL∞₍Rd₎≤ǫ,

(7)

The proof is in Appendix A. Based on this lemma, we can translate several B-spline approxima-tion results into those of deep neural network approximaapproxima-tion. In particular, combining this lemma and the B-spline interpolant representations of functions in Besov spaces (DeVore & Popov, 1988; DeVore et al., 1993; D˜ung, 2011b), we obtain the optimal approximation error bound for deep neural networks. Here, letU(_H)be the unit ball of a quasi-Banach space_H, and for a set_Fof functions, define the worst case approximation error as

Rr(F,H) := sup fo_∈_U₍_H₎ inf f∈Fkf o −fkLr_([0_,_1]d₎.

Proposition 1(Approximation ability for Besov space). Suppose that0 < p, q, r ≤ ∞and0 < s <_∞satisfy the following condition:

s > d(1/p₋1/r)+. (4)

Assume thatm_∈Nsatisfies0< s <min(m, m₋1 + 1/p). Letν= (s₋δ)/(2δ). For sufficiently largeN_∈Nandǫ=N−s/d−(ν−1₊_d−1₎₍_d/p −s)+_log(N₎−1_{, let} L= 3 + 2_⌈log2 3d∨m ǫc(d,m) + 5_⌉⌈log2(d∨m)⌉, W =N W0, S= (L₋1)W02N+N, B=O(N(ν −1₊_d−1₎₍₁ ∨(d/p−s)+)_),

then it holds that

Rr(Φ(L, W, S, B), Bp,qs ([0,1]d)).N−s/d. Remark 1. By Eq. (1), the condition(4)indicates thatfo_∈_Bs

p,qsatisfiesfo ∈Lr(Ω). If we set

p= q =_∞andr =_∞, thenBs

p,q(Ω) = Cs(Ω)which yields the result by Yarotsky (2016) as a

special case.

The proof is in Appendix B. An interesting point is that the statement is valid even forp6=r. In par-ticular, the theorem also supports non-continuous regime(s < d/p)in whichL∞_{-convergence does} no longer hold but insteadLr-convergence is guaranteed under the conditions > d(1/p−1/r)+.

In that sense, the convergence of the approximation error is guaranteed in considerably general set-tings. Pinkus (1999) gave an explicit form of convergence when1 _≤ p = r for the activation functions satisfying Eq. (3) which does not cover ReLU and an important settingp6=r. Petrushev (1998) consideredp = r = 2 and activation function with Eq. (3) where sis an integer and

s≤k+ 1 + (d−1)/2. Chui et al. (1994) and B¨olcskei et al. (2017) dealt with the smooth sigmoidal activation satisfying the condition (3) withk _≥2or a “smoothed version” of the ReLU activation which excludes ReLU; but B¨olcskei et al. (2017) presented a general strategy for neural-net approx-imation by using the notion of bestM-term approximation. Mhaskar & Micchelli (1992) gives an approximation bound using the modulus of smoothness, but the smoothnesssand the order of sig-moidal functionkin (3) is tightly connected andfo_{is assumed to be continuous which excludes the}

situations < d/p. On the other hand, the above proposition does not require such a tight connection and it explicitly gives the approximation bound for Besov spaces. Williamson & Bartlett (1992) derived a spline approximation error bound for an element in a Besov space whend= 1, but the de-rived bound is onlyO(N−s+(1/p−1/r)+₎_{which is the one of non-adaptive methods described below,}

and approximation by a ReLU activation network is not discussed. We may also use the analysis of Cohen et al. (2001) which is based on compactly supported wavelet bases, but the cardinal B-spline is easy to handle through quasi-interpolant representation as performed in the proof of Proposition 1.

It should be noted that the presented approximation accuracy bound is not trivial because it can not be achieved by a non-adaptive method. Actually, the linearN-width(Tikhomirov, 1960) of the Besov space is lower bounded as

inf LN sup f∈U(MBs p,q) kf₋LN(f)kr&        N−s/d+(1/p−1/r)+    either (0< p_≤r_≤2), or (2_≤p_≤r_{≤ ∞}), or (0< r≤p≤ ∞), N−s/d+1/p−1/2 ₍₀_{< p <}₂_{< r <}_∞_{, s > d}_max(1₋_1/r,_1/p), (5)

(8)

where the infimum is taken over all linear opratorsLN with rankN fromBp,qs toLr(see Vyb´aral

(2008) for more details). Similarly, thebestN-term approximation error(Kolmorogov width) of the Besov space is lower bounded as

inf SN⊂Bp,qs sup f∈U(Bs p,q) inf ˇ f∈SN kf₋fˇ_kLr_(Ω)&    N−s/d+(1/p−1/r)+ ₍₁_{< p < r}_≤_{2, s > d(1/p}₋_1/r)), N−s/d+1/p−1/2 ₍₁_{< p <}₂_{< r} ≤ ∞, s > d/p), N−s/d ₍₂ ≤p < r_{≤ ∞}, s > d/2), (6) if1< p < r_{≤ ∞},1_≤q <_∞and1< s, whereSN is anyN-dimensional subspace ofBsp,q(see

Vyb´aral (2008), and see also Romanyuk (2009); Myronyuk (2016) for a related space). That is, any linear/non-linear approximator withfixedN-bases does not achieve the approximation errorN−α/d

in some parameter settings such as0< p <2< r. On the other hand, adaptive methods including deep learning can improve the error rate up toN−α/d _{which is rate optimal (D˜ung, 2011b). The}

difference is significant whenp < r. This implies that deep neural network possesses high adaptivity to find which part of the function should be intensively approximated. In other words, deep neural network can properly extracts the feature of the input (which corresponds to construct an appropriate set of bases) to approximate the target function in the most efficient way.

3.2 APPROXIMATION ERROR ANALYSIS FOR M-BESOV SPACE

Here, we deal with m-Besov spaces instead of the ordinary Besov space. The next theorem gives the approximation error bound to approximate functions in the m-Besov spaces by deep neural network models. Here, defineDk,d:= 1 +d−_k1

k

1 + _d₋k₁d−1.Then, we have the following theorem.

Theorem 1(Approximation ability for m-Besov space). Suppose that0< p, q, r_{≤ ∞}ands <_∞

satisfiess > (1/p₋1/r)+.Assume thatm ∈ Nsatisfies0 < s < min(m, m−1 + 1/p). Let

δ = (1/p−1/r)+andν = (s−δ)/(2δ). For anyK ≥1, letK∗ =⌈K(1 +_α2₋δ_δ)⌉. Then, for

N = (2 + (1₋2−ν₎−1₎₂K_D K∗_,d, if we set L= 3 + 2llog2 3d∨m c(d,m) + 5 + (s+ (1_p₋s)++ 1)K∗+ log([e(m+ 1)]d(1 +K∗)) m ⌈log2(d∨m)⌉, W =N W0, S= (L−1)N W02+N, B=O(N(ν −1₊₁₎₍₁ ∨(1/p−s)+)_),

then it holds that (i) Forp≥r, Rr(Φ(L, W, S, B), M Bp,qs ([0,1]d)).2−KsD (1/min(r,1)−1/q)+ K,d , (7a) (ii) Forp < r, Rr(Φ(L, W, S, B), M Bp,qs ([0,1]d)). ( 2−Ks_D(1/r−1/q)+ K,d (r <∞), 2−Ks_D(1−1/q)+ K,d (r=∞). (7b)

The proof is given in Appendix C. Now, the numberS of non-zero parameters for a givenK is evaluated asS = Ω(N)_≃2K_D

K,din this theorem. It holds thatN ≃2KK(d−1), which implies

2−K

≃ N−1_logd−1_(N₎

ifN _≫ d(see also the discussion right after Theorem 5 in Appendix D.1 for more details of calculation). Therefore, whenr ≫ q, the approximation error is given as

O(N−s_logs(d−1)_(N))

in which the effect of dimensionalitydis much milder than that of Proposi-tion 1. This means that the curse of dimensionality is much eased in the mixed smooth space. The obtained bound is far from obvious. Actually, it is better than any linear approximation methods as follows. Let the linearM-width introduced by Tikhomirov (1960) be

λN(M Bsp,q, Lr) := inf LN sup f∈U(MBs p,q) kf₋LN(f)kr,

where the infimum is taken over all linear opratorsLN with rankN fromM Bp,qs toLr. The linear

N-width of the m-Besov space has been extensively studies as in the following proposition (see Lemma 5.1 of D˜ung (2011a), and Romanyuk (2001)).

(9)

Proposition 2. Let1_≤p, r_{≤ ∞},0< q_{≤ ∞}ands >(1/p₋1/r)+. Then we have the following

asymptotic order of the linear width for the asymptoticsN _≫d: (a) Forp_≥r, λN(M Bp,qs , Lr)≃            (N−1_logd−1 (N))s    (q≤2≤r≤p <∞), (q≤1, p=r=∞), (1< p=r≤2, q≤r), (N−1_logd−1 (N))s_(logd−1 (N))1/r−1/q ₍₁_{< p}₌_r_≤_{2, q > r),} (N−1_logd−1 (N))s_(logd−1 (N))(1/2−1/q)+ ₍₂_≤_q, ₁_{< r <}₂_≤_{p <}_∞_), (b) For1< p < r <_∞, λM(M Bsp,q, Lr)≃ (N−1_logd−1_(N))s+1/r−1/p ₍₂ ≤p, 2_≤q_≤r), (N−1_logd−1_(N))s+1/r−1/p_(logd−1_(N₎₎(1/r−1/q)+ _(r_≤_2).

Therefore, the approximation error given in Theorem 1 achieves the optimal linear width ((N−1_logd−1_(N₎₎s_{) for several parameter settings of}_{p, q, s}_{. In particular, when}_{p < r}_{, the bound}

in Theorem 1 is better than that of Proposition 2. This is because to prove Theorem 1, we used an adaptive recovery technique instead of a linear recovery method. This implies that, by constructing a deep neural network accurately, we achieve the same approximation accuracy as the adaptive one which is better than that of linear approximation.

4 ESTIMATION ERROR ANALYSIS

In this section, we connect the approximation theory to generalization error analysis (estimation er-ror analysis). For the statistical analysis, we assume the following nonparametric regression model:

yi=fo(xi) +ξi (i= 1, . . . , n),

wherexi ∼ PX with density0 ≤ p(x) < R on [0,1]d, andξi ∼ N(0, σ2). The dataDn =

(xi, yi)ni=1is independently identically distributed. We want to estimatefofrom the data. Here, we

consider a regularized learning procedure:

b f = argmin ¯ f:f∈Φ(L,W,S,B) n X i=1 (yi−f¯(xi))2

wheref¯is theclippingoff defined byf¯= min_{max_{f,₋F_}, F_}forF >0which is realized by ReLU units. Since the sparsity level is controlled bySand the parameter is bounded byB, this esti-mator can be regarded as a regularized estiesti-mator. In practice, it is hard to exactly computefb. Thus, we approximately solve the problem by applying sparse regularization such asL1-regularization and

optimal parameter search through Bayesian optimization. The generalization error that we present here is an “ideal” bound which is valid if the optimal solutionfbis computable.

4.1 ESTIMATION ERROR INBESOV SPACES

In this subsection, we provide the estimation error rate of deep learning to estimate functions in Besov spaces by using the approximation error bound given in the previous sections.

Theorem 2. Suppose that0< p, q≤ ∞ands > d(1/p−1/2)+. Iffo∈Bp,qs (Ω)∩L∞(Ω)and. kfo

kBs

p,q ≤ 1andkf

o

k∞ ≤F forF ≥ 1, then letting(W, L, S, B)be as in Proposition 1 with

N _≍n2sd+d_{, we obtain}

EDn[kfo−fbk2L2₍_P_X₎].n− 2s

2s+dlog(n)2,

whereEDn[·]indicates the expectation w.r.t. the training dataDn.

The proof is given in Appendix E. The condition_kfo

k∞ ≤F is required to connect the empirical

L2_-norm 1 n

Pn

i=1(fb(xi)−fo(xi))2to the populationL2-normkfb−fok2L2₍_P_X₎. It is known that

(10)

it cannot be improved by any estimator. Therefore, deep learning can achieve the minimax optimal rate up tolog(n)2_{-order. The term}_log(n)2_{could be improved to}_log(n)_{by using the construction}

of Petersen & Voigtlaender (2017). However, we don’t pursue this direction for simplicity.

Here an important remark is that this minimax optimal rate cannot be achieved by anylinear es-timator. We call an estimatorlinear when the estimator depends on (yi)ni=1 linearly (it can be

non-linearly dependent on(xi)ni=1). Several classical methods such as the kernel ridge regression,

the Nadaraya-Watson estimator and the sieve estimator are included in the class of linear estimators (e.g., kernel ridge regression is given asfb(x) =kx,X(kXX+λI)−1Y). The following proposition

given by Donoho et al. (1998); Zhang et al. (2002) states that the minimax rate of linear estimators is lower bounded byn−

2s−2(1/p−1/₂₎₊

2s+1−2(1/p−1/₂₎₊ _{which is larger than the minimax rate}_n−2s2+2s _ifp <2_. Proposition 3(Donoho et al. (1998); Zhang et al. (2002)). Suppose thatd= 1and the input dis-tribution PX is the uniform distribution on [0,1]. Assume thats > 1/p, 1 ≤ p, q ≤ ∞ or

s=p=q= 1. Then, inf b f: linear sup fo_∈_U₍_Bs p,q) EDn[kf o −fb_k2 L2₍_P_X₎]&n− 2s−v 2s+1−v

wherev= 2/(p_∧2)₋1andfbruns over all linear estimators, that is,fbdepends on(yi)ni=1linearly.

Whenp <2, the smoothness of the Besov space is somewhat inhomogeneous, that is, a function in the Besov space contains spiky/jump parts and smooth parts (remember that whens=p=q= 1

ford= 1, the Besov space is included in the set of functions with bounded total variation). Here, the settingp <2is the regime where there appears difference between non-adaptive methods and deep learning in terms of approximation accuracy (see Eq. (6)). On the other hand, the linear estimator captures only global properties of the function and cannot capture variability of local shapes of the function. Hence, the linear estimator cannot achieve the minimax optimal rate if the function has spatially inhomogeneous smoothness. However, deep learning possesses adaptivity to the spatial inhomogeneity.

We would like to remark that The shrinkage estimator proposed in Donoho et al. (1998); Zhang et al. (2002) achieves the minimax optimal rate fors >1/pwithd= 1and1≤p, q≤ ∞which excludes an interesting setting such ass =p=q= 1. However, the result of Theorem 2 also covers more general settings whered≥1ands > d(1/p−1/2)+with0< p, q≤ ∞.

Imaizumi & Fukumizu (2018) has pointed out that such a discrepancy between deep learning and linear estimator appears when the target function isnon-smooth. Interestingly, the parameter set-tings > 1/passumed in Proposition 3 ensures smoothness (see Eq. (2)). This means that non-smoothness is not necessarily required to characterize the superiority of deep learning, but non-convexityof the set of target functions is essentially important. In fact, the gap is coming from the property that thequadratic hull of the modelU(Bs

p,q) is strictly larger than the original set

(Donoho et al., 1998).

4.2 ESTIMATION ERROR IN MIXED SMOOTHBESOV SPACES

Here, we provide the estimation error rate of deep learning to estimate functions in mixed smooth Besov spaces.

Theorem 3. Suppose that0< p, q≤ ∞ands >(1/p−1/2)+. Letu= (1−1/q)+forp≥2and

u= (1/2₋1/q)+forp <2. Iffo ∈MBp,qs (Ω)∩L∞(Ω)andkfokMBs

p,q ≤1andkf

o

k∞ ≤F forF ≥1, then letting(W, L, S, B)be as in Theorem 1, we obtain

EDn[kf o −fb_k2 L2₍_P_X₎].n− 2s 2s+1log(n) 2(d−1)(u+s) 1+2s log(n)2.

Under the same assumption, ifs > ulog2(e)is additionally satisfied, we also have

EDn[kfo−fbk2L2₍_P

X)].n

−2s+1+(12s−2−ulog2(2u_{) log2(}e) e) _log(n)2_.

The proof is given in Appendix E. The risk bound (Theorem 3) indicates that the curse of dimen-sionality can be eased by assuming the mixed smoothness compared with the ordinary Besov space

(11)

(n−2s2+sd_{). We show that this is almost minimax optimal in Theorem 4 below. In the first bound, the}

dimensionalitydcomes in the exponent ofpoly log(n)term. Ifu= 0, then the effect ofdcan be further eased. Actually, in this situation (u= 0), the second bound can be rewritten as

n−2s_+1+log2(2s e) _log(n)2_,

where the effect of the dimensionalitydcompletely disappears from the exponent. This explains partially why deep learning performs well for high dimensional data. Montanelli & Du (2017) has analyzed the mixed smooth H¨older space withs < 2. However, our analysis is applicable to the m-Besov space which is more general than the mixed smooth H¨older space and the covered range ofs, p, qis much larger.

Here, we again remark the adaptivity of deep learning. Remind that this rate cannot be achieved by the linear estimator forp <2whend= 1by Proposition 3. This explains the adaptivity ability of deep learning to the spatial inhomogeneity of the smoothness.

Minimax optimal rate for estimating a function in the m-Besov space Here, we show the min-imax optimality of the obtained bound as follows.

Theorem 4. Assume that0< p, q≤ ∞ands >(1/p−1/2)+andPX is the uniform distribution

over[0,1]d_{. Regarding}_d_{as a constant, the minimax learning rate in the asymptotics of}_n

→ ∞is lower bounded as follows: There exists a constantCb1such that

inf b f sup fo_∈_U₍_MBs p,q) EDn[kfb−fok2L2₍_P_X₎]≥Cb1n− 2s 2s+1log(n) 2(d−1)(s+1/2−1/q₎₊ 2s+1 ₍₈₎

where “inf ” is taken over all measurable functions of the observations(xi, yi)ni=1and the

expecta-tion is taken for the sample distribuexpecta-tion.

The proof is given in Appendix F. Because of this theorem, our bound given in Theorem 3 achieves the minimax optimal rate in the regime ofp < 2and1/2−1/q > 0up to log(n)2 _{order. Even}

outside of this parameter setting, the discrepancy between our upper bound and the minimax lower bound is just a poly-logoder. See also Neumann (2000) for some other related spaces and specific examples such asp=q= 2.

5 CONCLUSION

This paper investigated the learning ability of deep ReLU neural network when the target function is in a Besov space or a mixed smooth Besov space. Based on the analysis for the Besov space, it is shown that deep learning using the ReLU activation can achieve the minimax optimal rate and outperform the linear method whenp <2which indicates the spatial inhomogeneity of the shape of the target function. The analysis for the mixed smooth Besov space shows that deep learning can adaptively avoid the curse of dimensionality. The bound is derived by sparse grid technique. All analyses in the paper adopted the cardinal B-spline expansion and the adaptive non-linear approx-imation technique, which allowed us to show the minimax optimal rate. The consequences of the analyses partly support the superiority of deep leaning in terms of adaptivity and ability to avoid curse of dimensionality. From more high level view point, these favorable property is reduced to its high feature extraction ability.

This paper did not discuss any optimization aspect of deep learning. However, it is important to investigate what kind of practical algorithms can actually achieve the optimal rate derived in this paper in an efficient way. We leave this important issue for future work.

ACKNOWLEDGMENT

TS was partially supported by MEXT Kakenhi (25730013, 25120012, 26280009, 15H05707 and 18H03201), Japan Digital Design, and JST-CREST.

REFERENCES

R.A. Adams and J.J.F. Fournier.Sobolev Spaces. Pure and Applied Mathematics. Elsevier Science, 2003.

(12)

Andrew Barron. Approximation and estimation bounds for artificial neural networks. InProceedings of the Fourth Annual Workshop on Computational Learning Theory, pp. 243–249, 1991.

Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.

Andrew R Barron. Approximation and estimation bounds for artificial neural networks. Machine Learning, 14(1):115–133, 1994.

Monica Bianchini and Franco Scarselli. On the complexity of neural network classifiers: A compar-ison between shallow and deep architectures.IEEE transactions on neural networks and learning systems, 25(8):1553–1565, 2014.

Helmut B¨olcskei, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. Optimal approximation with sparsely connected deep neural networks.arXiv preprint arXiv:1705.01714, 2017.

CK Chui, Xin Li, and HN Mhaskar. Neural networks for localized approximation.Mathematics of Computation, 63(208):607–623, 1994.

Albert Cohen, Wolfgang Dahmen, Ingrid Daubechies, and Ronald DeVore. Tree approximation and optimal encoding.Applied and Computational Harmonic Analysis, 11(2):192–226, 2001. Nadav Cohen and Amnon Shashua. Convolutional rectifier networks as generalized tensor

decom-positions. InProceedings of the 33th International Conference on Machine Learning, volume 48 ofJMLR Workshop and Conference Proceedings, pp. 955–963, 2016.

Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. InThe 29th Annual Conference on Learning Theory, pp. 698–728, 2016.

George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Con-trol, Signals, and Systems (MCSS), 2(4):303–314, 1989.

Ronald A DeVore and Vasil A Popov. Interpolation of besov spaces. Transactions of the American Mathematical Society, 305(1):397–414, 1988.

Ronald A DeVore, George Kyriazis, Dany Leviatan, and Vladimir M Tikhomirov. Wavelet com-pression and nonlinearn-widths.Advances in Computational Mathematics, 1(2):197–214, 1993. David L Donoho, Iain M Johnstone, et al. Minimax estimation via wavelet shrinkage. The Annals

of Statistics, 26(3):879–921, 1998.

Dinh D˜ung. On recovery and one-sided approximation of periodic functions of several variables. In Dokl. Akad. SSSR, volume 313, pp. 787–790, 1990.

Dinh D˜ung. On optimal recovery of multivariate periodic functions. In Satoru Igari (ed.),ICM-90 Satellite Conference Proceedings, pp. 96–105, Tokyo, 1991. Springer Japan. ISBN 978-4-431-68168-7.

Dinh D˜ung. Optimal recovery of functions of a certain mixed smoothness. Vietnam Journal of Mathematics, 20(2):18–32, 1992.

Dinh D˜ung. B-spline quasi-interpolant representations and sampling recovery of functions with mixed smoothness.Journal of Complexity, 27(6):541–567, 2011a.

Dinh D˜ung. Optimal adaptive sampling recovery.Advances in Computational Mathematics, 34(1): 1–41, 2011b.

Dinh D˜ung, Vladimir N Temlyakov, and Tino Ullrich. Hyperbolic cross approximation. arXiv preprint arXiv:1601.03978, 2016.

´

E. M. Galeev. Linear widths of h¨older-nikol’skii classes of periodic functions of several variables. Matematicheskie Zametki,, 59(2):189–199, 1996.

E. Gin´e and R. Nickl. Mathematical Foundations of Infinite-Dimensional Statistical Models. Cam-bridge Series in Statistical and Probabilistic Mathematics. CamCam-bridge University Press, 2015.

(13)

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Pro-ceedings of the 14th International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pp. 315–323, 2011.

Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4 (2):251–257, 1991.

Masaaki Imaizumi and Kenji Fukumizu. Deep neural networks learn non-smooth functions effec-tively. arXiv preprint arXiv:1802.04474, 2018.

Heishiro Kanagawa, Taiji Suzuki, Hayato Kobayashi, Nobuyuki Shimizu, and Yukihiro Tagami. Gaussian process nonparametric tensor estimator and its minimax optimality. InProceedings of the 33rd International Conference on Machine Learning (ICML2016), pp. 1632–1641, 2016. Jason M Klusowski and Andrew R Barron. Risk bounds for high-dimensional ridge function

com-binations including neural networks.arXiv preprint arXiv:1607.01434, 2016.

Shiyu Liang and R Srikant. Why deep neural networks for function approximation?arXiv preprint arXiv:1610.04161, 2016. ICLR2017.

Stephane Mallat. A Wavelet Tour of Signal Processing. Academic Press, 1999.

Lukas Meier, Sara van de Geer, and Peter B¨uhlmann. High-dimensional additive modeling. The Annals of Statistics, 37(6B):3779–3821, 2009.

Hrushikesh N Mhaskar. Neural networks for optimal approximation of smooth and analytic func-tions.Neural computation, 8(1):164–177, 1996.

Hrushikesh N Mhaskar and Charles A Micchelli. Approximation by superposition of sigmoidal and radial basis functions.Advances in Applied mathematics, 13(3):350–373, 1992.

Hrushikesh Narhar Mhaskar. Approximation properties of a multilayered feedforward artificial neu-ral network.Advances in Computational Mathematics, 1(1):61–80, 1993.

Hadrien Montanelli and Qiang Du. Deep relu networks lessen the curse of dimensionality. arXiv preprint arXiv:1712.08688, 2017.

Guido F. Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N.d. Lawrence, and K.q. Weinberger (eds.),Advances in Neural Information Processing Systems 27, pp. 2924–2932. Curran Associates, Inc., 2014.

V Myronyuk. Kolmogorov widths of the anisotropic Besov classes of periodic functions of many variables.Ukrainian Mathematical Journal, 68(5), 2016.

Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. InProceedings of the 27th International Conference on Machine Learning, pp. 807–814, 2010. Michael H. Neumann. Multivariate wavelet thresholding in anisotropic function spaces. Statistica

Sinica, 10(2):399–431, 2000.

J. Peetre and Duke University. Mathematics Dept.New thoughts on Besov spaces. Duke University mathematics series. Mathematics Dept., Duke University, 1976.

Philipp Petersen and Felix Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks.arXiv preprint arXiv:1709.05289, 2017.

Pencho P Petrushev. Approximation by ridge functions and neural networks. SIAM Journal on Mathematical Analysis, 30(1):155–189, 1998.

Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica, 8:143– 195, 1999.

(14)

Ben Poole, Subhaneil Lahiri, Maithreyi Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Expo-nential expressivity in deep neural networks through transient chaos. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.),Advances in Neural Information Processing Sys-tems 29, pp. 3360–3368. Curran Associates, Inc., 2016.

Garvesh Raskutti, Martin Wainwright, and Bin Yu. Minimax-optimal rates for sparse additive mod-els over kernel classes via convex programming. Journal of Machine Learning Research, 13: 389–427, 2012a.

Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax-optimal rates for sparse additive models over kernel classes via convex programming.The Journal of Machine Learning Research, 13(1):389–427, 2012b.

A. S. Romanyuk. Linear widths of the besov classes of periodic functions of many variables. ii. Ukrainian Mathematical Journal, 53(6):965–977, Jun 2001.

A. S. Romanyuk. Bilinear approximations and Kolmogorov widths of periodic Besov classes. The-ory of Operators, Differential Equations, and the TheThe-ory of Functions, 6(1):222–236, 2009. Proc. of the Institute of Mathematics, Ukrainian National Academy of Sciences.

H-J Schmeisser. An unconditional basis in periodic spaces with dominating mixed smoothness properties.Analysis Mathematica, 13(2):153–168, 1987.

J. Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function.ArXiv e-prints, August 2017.

Winfried Sickel and Tino Ullrich. Tensor products of Sobolev–Besov spaces and applications to approximation from the hyperbolic cross. Journal of Approximation Theory, 161(2):748–786, 2009.

Winfried Sickel and Tino Ullrich. Spline interpolation on sparse grids.Applicable Analysis, 90(3-4): 337–383, 2011.

M. Signoretto, L. De Lathauwer, and J.A.K. Suykens. Nuclear norms for tensors and their use for convex multilinear estimation. Technical Report 10-186, ESAT-SISTA, K.U.Leuven, 2010. Sergey Smolyak. Quadrature and interpolation formulas for tensor products of certain classes of

functions. InSoviet Math. Dokl., volume 4, pp. 240–243, 1963.

Sho Sonoda and Noboru Murata. Neural network with unbounded activation functions is universal approximator.Applied and Computational Harmonic Analysis, 2015.

Taiji Suzuki, Heishiro Kanagawa, Hayato Kobayashi, Nobuyuki Shimizu, and Yukihiro Tagami. Minimax optimal alternating minimization for kernel nonparametric tensor learning. InAdvances In Neural Information Processing Systems, pp. 3783–3791, 2016.

V.N. Temlyakov. Approximation of periodic functions of several variables with bounded mixed difference.Math. USSR Sb, 41(1):53–66, 1982.

V.N. Temlyakov.Approximation of Periodic Functions. Nova Science Publishers, 1993a.

V.N. Temlyakov. On approximate recovery of functions with bounded mixed derivative.Journal of Complexity, 9:41–59, 1993b.

Vladimir Mikhailovich Tikhomirov. Diameters of sets in function spaces and the theory of best approximations.Uspekhi Matematicheskikh Nauk, 15(3):81–120, 1960.

Hans Triebel. Theory of function spaces. Monographs in mathematics. Birkh¨auser Verlag, 1983. ISBN 9783764313814.

A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: With Applica-tions to Statistics. Springer, New York, 1996.

(15)

Robert C Williamson and Peter L Bartlett. Splines, rational functions and neural networks. In Advances in Neural Information Processing Systems, pp. 1040–1047, 1992.

Yuhong Yang and Andrew Barron. Information-theoretic determination of minimax rates of conver-gence.The Annals of Statistics, 27(5):1564–1599, 1999.

Dmitry Yarotsky. Error bounds for approximations with deep relu networks.CoRR, abs/1610.01145, 2016.

Shuanglin Zhang, Man-Yu Wong, and Zhongguo Zheng. Wavelet threshold estimation of a regres-sion function with random design.Journal of multivariate analysis, 80(2):256–284, 2002.

A

PROOF OF

LEMMA

1

Proof of Lemma 1. First note that_Nm(x) = _m1_!Pm_j₌₀+1(−1)j m_j+1(x−j)m+ (see Eq. (4.28) of

Mhaskar & Micchelli (1992) for example). Thus, if we can make an approximation ofη(x)m_{, then}

by taking a summation of those basis, we obtain an approximate of_Nm(x). It is shown by Yarotsky

(2016) that, forD ∈ N_{and any}_{ǫ >}₀_{, there exists a neural network}_φ_mult _∈ _{Φ(L, W, S, B)}_with

L=_⌈log2

3D ǫ

+ 5_⌉⌈log2(D)⌉,W = 6d,S=LW2andB = 1such that

sup x∈[0,1]d φmult(x1, . . . , xD)− D Y i=1 xi ≤ǫ,

andφmult(0, . . . ,0) = 0fory ∈Rd such thatQdj=1yj = 0. Moreover, for anyM > 0, we can

realize the functionmin{M,max{x,0}}by a single-layer neural networkφ(0,M)(x) := η(x)−

η(x−M)(= min{M,max{x,0}}). Thus, forx∈R_{, it holds that}

sup x∈[0,M] φmult(φ(0,1)(x/M), . . . , φ(0,1)(x/M))−(φ(0,1)(x/M))m _≤ǫ.

Now, since _Nm(x) = 0 for x 6∈ [0, m + 1], it also holds Nm(x) = 1 m! Pm+1 j=0 (−1)j m +1 j φ(0,m+1−j)(x − j)m = m1! Pm+1 j=0 (−1)j m +1 j (m + 1)m_φ (0,1−j/(m+1))((x−j)/(m+ 1))m.Therefore, letting f(x) = 1 m! m_X+1 j=0 (₋1)j(m+1)m m+ 1 j φmult φ(0,1−mj+1) x₋j m+ 1 , . . . , φ₍₀_,₁₋ j m+1) x₋j m+ 1 | {z } m-times ! ,

we have thatf(x) = 0for allx≤0and

sup 0≤x≤m+1|N m(x)−f(x)| ≤ 1 m! m_X+1 j=0 m+ 1 j (m+ 1)mǫ≤ (m+ 1) m √ 2πmm+1/2_e−m2 m+1_ǫ ≤e(2e) m √_m ǫ=:ǫ′,

where we usedPm_j₌₀+1 m+1_j = 2m+1_{and Stirling’s approximation}_m!

≥√2πmm+1/2_e−m_{in the}

second inequality. Hence, we also have

f(x) = 1 m! m_X+1 j=0 (−1)j m+ 1 j (m+ 1)m ×φmult φ₍₀_,₁₋ j m+1) _m_{+ 1} −j m+ 1 , . . . , φ₍₀_,₁₋ j m+1) _m_{+ 1} −j m+ 1 =:δ′ (∀x > m+ 1).

(16)

It holds that_|δ′_{| ≤}_ǫ′_{. Because of this and noting}₀_{≤ N} m(x)≤1, we see thatg(x) :=φ(0,1)(f(x)− δ′ m+1φ(0,m+1)(x))yields sup x∈R|N m(x)−g(x)| ≤2ǫ′,

supx∈R|g(x)| ≤1, andg(x) = 0for allx6∈[0, m+ 1]. Hence, by applyingφmultagain, we finally

obtain that sup x∈[0,1]d| M0d,0(x)−φmult(g(x1), . . . , g(xd))| ≤ sup x∈[0,1]d M d 0,0(x)− d Y j=1 g(xj) + supx∈[0,1]d d Y j=1 g(xj)−φmult(g(x1), . . . , g(xd)) ≤2dǫ′+ǫ.

We again applyingφ(0,1), we obtain thath =φ(0,1)◦φmult(g(x1), . . . , g(xd))satisfieskM0d,0−

hkL∞_(Rd₎≤2dǫ′+ǫ,h(x) = 0for allx6∈[0, m+1]d, andkhk_∞≤1. Finally, by carefully checking

the network construction, it is shown that h _∈ Φ(L, W, S, B) withL = 3 + 2_⌈log3d∨m ǫ

+ 5_⌉⌈log2(d∨m)⌉,W = 6dm(m+ 2) + 2d,S = LW2andB = 2(m+ 1)m. Hence, resetting

ǫ←2dǫ′₊_ǫ_{= (1 + 2de}(2e)m

√

m )ǫ, thishis the desiredMˇ.

B

PROOF OF

PROPOSITION

1

For the orderm_∈ N_{of the cardinal B-spline bases, let}_{J(k) =} _{−_m,₋_m_{+ 1, . . . ,}₂k₋_1,₂k_}d

and the quasi-norm of the coefficient(αk,j)k,j fork∈N+andj∈J(k)be k(αk,j)k,jkbs p,q.=    X k∈N+  2k(s−d/p) X j∈J(k) |αk,j|p 1/p   q   1/q .

Lemma 2. Under one of the conditions(4)in Proposition 1 and the condition0< s <min(m, m−

1 + 1/p)wherem_∈Nis the order of the cardinal B-spline bases, for anyf _∈Bs

p,q(Ω), there exists

fN that satisfies

kf−fNkLr_(Ω).N−s/dkfk_Bs

p,q (9)

forN _≫1, and has the following form:

fN(x) = K X k=0 X j∈J(k) αk,jMk,jd (x) + K∗ X k=K+1 nk X i=1 αk,jiM d k,ji(x), (10)

where (ji)ni=1k ⊂ J(k), K = ⌈C1log(N)/d⌉, K∗ = ⌈log(λN)ν−1⌉ + K + 1, nk = ⌈λN2−ν(k−K)

⌉(k=K+ 1, . . . , K∗₎_for_δ₌_d(1/p₋_1/r)

+andν = (s−δ)/(2δ), and the real

number constantsC1 >0andλ > 0are chosen to satisfyPK_k₌₁(2k+m)d+PK ∗

k=K+1nk ≤N

independently toN. Moreover, we can choose the coefficients(αk,j)to satisfy k(αk,j)k,jkbs

p,q .kfkBsp,q.

Proof of Lemma 2. DeVore & Popov (1988) constructed a linear bounded operatorPk having the

following form:

Pk(f)(x) =

X

j∈J(k)

ak,jMk,jd (x) (11)

whereαk,j is constructed in a certain way, where for everyf ∈ Lp([0,1]d)with0 < p ≤ ∞, it

holds

(17)

Let

pk(f) :=Pk(f)−Pk−1(f), P−1(f) = 0.

Then, it is shown that for0< p, q ≤ ∞and0< s <min(m, m−1 + 1/p),f belongs toBs p,q if

and only iff can be decomposed into

f =

∞

X

k=0

pk(f),

with the convergence condition _k(pk(f))∞k=0kbs

p(Lp) < ∞; in particular, kfkBp,qs ≃ k(pk(f))∞k=0kbs p(Lp) =: ( P k∈N+(2 sk

kpkkLp)q)1/q. Here, eachp_k can be expressed as p_k(x) =

P

j∈J(k)αk,jMk,jd (x)for a coefficient(αk,j)k,j(which could be different from(ak,j)k,jappearing

in Eq. (11)). Hence,f ∈Bsp,qcan be decomposed into

f = ∞ X k=0 X j∈J(k) αk,jMk,jd (x) (13)

with convergence in the sence of Lp_. _Moreover, _{it is shown that} _k_p

kkLp ≃ (2−kdP

j∈J(k)|αk,j|p)1/pand thus kf_kBs

p,q≃ k(αk,j)k,jkbsp,q. (14)

Based on this decomposition, D˜ung (2011b) proposed an optimal adaptive recovery method such that the approximator has the form (10) under the conditions forK, K∗, nk given in the statement and

satisfies the approximation accuracy (9). This can be proven by applying the proof of Theorem 3.1 in D˜ung (2011b) to the decomposition (13) instead of Eq. (3.8) of that paper. See also Theorem 5.4 of D˜ung (2011b). Moreover, the equivalence (14) gives the norm bound of the coefficient(αk,j).

Proof of Proposition 1. Basically, we combine Lemma 1 and Lemma 2. We substitute the ap-proximated cardinal B-spline basisMˇ into the decomposition offN (10). Let the set of indexes

(k, j)_∈N_×N_{that consists}_f_N_{given in Eq. (10) be}_E_N_:_f_N ₌P

(k,j)∈ENαk,jM

d

k,j. Accordingly,

we setfˇ:=P₍_k,j₎_∈_E_Nαk,jMˇk,jd . For eachx∈Rd, it holds that |fN(x)−fˇ(x)| ≤ X (k,j)∈EN |αk,j||Mk,jd (x)−Mˇk,jd (x)| ≤ǫ X (k,j)∈EN |αk,j|1{Mk,jd (x)6= 0} ≤ǫ(m+ 1)d(1 +K∗)2K∗(d/p−s)+ kfkBs p,q .log(N)N(ν−1+d−1)(d/p−s)+_ǫ kf_kBs p,q,

where we used the definition ofK∗in the last inequality. Therefore, for eachf ∈U(Bs

p,q([0,1]d)), it holds that kf₋fˇ_kLr.kf−f_Nk_Lr+kf_N−fˇk_Lr .log(N)N(ν −1 +d−1)(d/p−s)+ kf_kBs p,qǫ+N −s/d_.

By taking ǫ to satisfy log(N)N(ν−1₊_d−1₎₍_d/p

−s)+_ǫ _≤ _N−s/d _(i.e., _ǫ _≤

N−s/d−(ν−1+d−1)(d/p−s)+_log(N₎−1_{), then we obtain the approximation error bound.}

Next, we bound the magnitude of the coefficients. Each coefficient αj,k satisfies |αj,k| .

2k(d/p−s)+_k_f_k Bs

p,q ≤ 2

k(d/p−s)+ _. _N(ν−1+d−1)(d/p−s)+ _for _k _≤ _K∗_{. Finally, the}

magni-tudes of the coefficients hidden inMˇd

k,j are evaluated. Remembering thatMˇk,jm(x) = ˇM(2kx1−

j1, . . . ,2kxd−jd), we see that we just need to bound the quantity2k(k≤K∗). However, this is

(18)

C

PROOF OF

THEOREM

1

Let Nd₊_(e) _:= _{_s _∈ Nd₊ _| _s_i ₌ _{0, i} _6∈ _e_} _{and for} _k _∈ Nd₊_(e), _{we define}

2−k _{:= (2}−ki1, . . . ,2−ki|e|₎ ∈ _R|e|

+ where (i1, . . . , i|e|) = e. By defining k(gk)kkbα,e q := P k∈Nd +(e)(2 αkkk1_|_g k|)q 1/q for a sequence(gk)k∈Nd

+(e), then it holds that |f_|MBα,e p,q = X e⊂{1,...,d} k(we r,p(f,2−k))kkbα,e q .

Proof of Theorem 1. The result is immediately follows from Theorem 5. Let the set of indexes of

(k, j)consisting ofRK beEK:RK(f) =P(k,j)∈EKαk,jM

d

k,j(x). As in the proof of Proposition

1, we approximateRK(f)by a neural network given as

ˇ

f(x) = X

(k,j)∈EK

αk,jMˇk,jd (x).

Each coefficient αj,k satisfies |αj,k| . 2kkk1(1/p−s)+kfkMBs p,q . 2 K∗(1/p−s)+_. _{The difference} between |RK(f)−fˇ(x)| ≤ X (k,j)∈EK |αk,j||Mk,jd (x)−Mˇk,jd (x)| ≤ǫ X (k,j)∈EK |αk,j|1{Mk,jd (x)6= 0} .ǫ(m+ 1)d_{(1 +}_K∗_)D K∗_,d2K ∗ (1/p−s)+ kf_kMBs p,q.

Therefore, by takingǫso thatǫ(m+ 1)d_{(1 +}_K∗_)D_K∗_,d2K ∗₍₁_/p −s)+ _≤₂−Ks_{is satisfied, it holds} that |RK(f)−fˇ(x)|.2−Ks. By the inequalityDK∗_,d ≤ eK ∗₊_d −1 , it suffices to letǫ ≤ e−K∗(s+(1/p−s)++1) [e(m+1)]d₍₁₊_K∗₎ . The cardinality of E(K)is bounded as X κ=0,...,K 2κ κ+d₋1 d₋1 + X k:K<kkk1≤K∗ nk ≤2K+1 K+d₋1 d−1 + X K<κ≤K∗ 2K−s−δ 2δ (κ−K) κ+d₋1 d−1 ≤2K+1DK,d+ 2K(1−2− s−δ 2δ )−1D_K∗_,d≤(2 + (1−2− s−δ 2δ )−1)2KD_K∗_,d=N.

Since each unitMˇd

k,j requires widthW0, the whole width becomesW = N W0. The number of

nonzero parameters to constructMˇd

k,jis bounded byS = (L−1)W02N+N.

Finally, the magnitudes of the coefficients hidden in Mˇd

k,j are evaluated. Remembering that

ˇ Md

k,j(x) = ˇM(2k1x1−j1, . . . ,2kdxd−jd), here maximum of2kj is bounded by2K ∗

.N(1+1/ν)_.

Hence, we obtain the assertion. Similarly, it holds that_|αj,k|.N(1+1/ν){1∨(1/p−s)+}.

D

PROOF OF

THEOREM

5

D.1 PREPARATION: SPARSE GRID

Here, we give technical details behind the approximation bound. The analysis utilizes the so called sparse grid technique Smolyak (1963) which has been developed in the function approximation theory field.

As we have seen in the above, in a typical B-spline approximation scheme, we put the basis functions

Mm

(19)

f(x)_≈P_k₌₁_,...,KP_j_∈_J₍_k₎αk,jMk,jm(x), which consists ofO(2Kd)terms (see Eq. (10)). Hence,

the number of parametersO(2Kd₎_{is affected by the dimensionality}_d_{in an exponential order.}

How-ever, to approximate functions with mixed smoothness, we do not need to put the basis on the whole range of the regular grid. Instead, we just need to put them on asparse gridwhich is a subset of the regular grid and has much smaller cardinality than the whole set. The approximation algorithm uti-lizing sparse grid is based on Smolyak’s construction (Smolyak, 1963) and its applications to mixed smooth spaces (D˜ung, 1990; 1991; 1992; Temlyakov, 1982; 1993a;b). D˜ung (2011a) studied an optimal non-adaptive linear sampling recovery method for the mixed smooth Besov space based on the cardinal B-spline bases. We adopt this method, and combining this with the adaptive technique developed in D˜ung (2011b), we give the following approximation bound using a non-linear adaptive method to obtain better convergence for the settingp < r.

Before we state the theorem, we define an quasi-norm of a set of coefficientsαk,j ∈Rfork∈Nd+

andj _∈Jd m(k) :={−m,−m+ 1, . . . ,2k1−1,2k1} × · · · × {−m,−m+ 1, . . . ,2kd−1,2kd}as k(αk,j)k,jkmbα p,q :=  X k∈Nd +  2(α−1/p)kkk1 X j∈Jd m(k) |αk,j|p 1/p   q  1/q .

Theorem 5. Suppose that0< p, q, r≤ ∞andα >(1/p−1/r)+. Assume that the orderm∈N

of the cardinal B-spline satisfies0< s <min(m, m₋1 + 1/p). Letδ= (1/p₋1/r)+. Then, for

anyf _∈M Bs

p,q(Ω)andK >0, there existsRK(f)such thatRK(f)can be represented as

RK(f)(x) = X k∈Nd₊: kkk1≤K X j∈Jd m(k) αk,jMk,jd (x) + X k∈Nd₊: K<kkk1≤K∗ nk X i=1 α_k,j(k) i M d k,j_i(k)(x), whereK∗ ₌ _⌈_{K(1 +} 2δ α−δ)⌉,(j (k) i ) nk i=1 ⊂ Jmd(k), and nk = ⌈2K− α−δ

2δ (kkk1−K)_⌉_{, and has the}

following properties: (i) Forp_≥r, kf ₋RK(f)kr.2−KαDK,d(1/min(r,1)−1/q)+kfkMBs p,q. (ii) Forp < r, kf ₋RK(f)kr. ( 2−Kα_D(1/r−1/q)+ K,d kfkMBs p,q (r <∞), 2−Kα_D(1−1/q)+ K,d kfkMBp,qs (r=∞).

Moreover, the coefficients(αk,j)k,j can be taken to holdk(αk,j)k,jkmbα

p,q .kfkMBp,qα .

The proof is given in Appendix D.2. The total number of cardinal B-spline bases consisting of

RK(f)can be evaluated as 2K+1 K+d−1 d₋1 + X k:K<kkk1≤K∗ nk .2KDK,d+ 2KDK∗_,d.2KD_K,d (∵Eq. (17)).

Here,DK,dcan be evaluated as

DK,d.Kd−1 or DK,d.dK.

Therefore, the total number of bases can be evaluated as

2Kmin_{Kd−1, dK_}

which is much smaller than2Kd_{which is required to approximate functions in the ordinal Besov}

space (see Lemma 2). In this proposition,Kcontrols the resolution and asM goes to infinity, the approximation error goes to 0 exponentially fast. A remarkable point in the proposition is in the