Density estimation using Dirichlet kernels

(1)

arXiv:2002.06956v1 [math.ST] 17 Feb 2020

Density estimation using Dirichlet kernels

Fr´ed´eric Ouimeta,1,∗

a

California Institute of Technology, Pasadena, USA.

Abstract

In this paper, we introduce Dirichlet kernels for the estimation of multivariate densities sup-ported on thed-dimensional simplex. These kernels generalize the beta kernels fromBrown & Chen (1999); Chen (1999, 2000a); Bouezmarni & Rolin (2003), originally studied in the context of smoothing for regression curves. We prove various asymptotic properties for the estimator : bias, variance, mean squared error, mean integrated squared error, asymptotic normality and uniform strong consistency. In particular, the asymptotic normality and uniform strong con-sistency results are completely new, even for the case d= 1 (beta kernels). These new kernel smoothers can be used for density estimation of compositional data. The estimator is simple to use, free of boundary bias, allocates non-negative weights everywhere on the simplex, and achieves the optimal convergence rate ofn−4/(d+4) for the mean integrated squared error. Keywords: Dirichlet kernels, beta kernels, simplex, density estimation, mean squared error, mean integrated squared error, asymptotic normality, uniform strong consistency

2010 MSC: Primary : 62G07 Secondary : 62G05, 62G20

1. Dirichlet kernel estimators

The d-dimensional simplex and its interior are defined by

S :=s_∈[0,1]d:ks_k1 _≤1 and Int(S) :=s_∈(0,1)d :ks_k1<1 , (1.1) where_ks_k1 :=Pd_i₌₁|si|. For α1, . . . , αd, β >0, the density of the Dirichlet(α, β) distribution is

Kα,β(s) := Γ(_kα_k1+β) Γ(β)Qd_i₌₁Γ(αi) ·(1_{− k}s_k1)β−1 d Y i=1 sαi−1 i , s∈ S. (1.2)

For a given bandwidth parameterb >0, and a sample of i.i.d. vectors X1,X2, . . . ,Xnthat are F distributed with a densityf supported on_S, the Dirichlet kernel estimator is

ˆ fn,b(s) := 1 n n X i=1 Ks/b+1,(1−ksk1)/b+1(Xi). (1.3)

Throughout the paper, we use the notation

[d] :=_{1,2, . . . , d_}. (1.4) Also, as n→ ∞ and/or b→ 0, we use the standard asymptotic notation O(·) and o(·), which implicitly can depend on the density f and the dimension d, but no other variable unless explicitly written as a subscript.

∗_{Corresponding author}

Email address: [email protected](Fr´ed´eric Ouimet)

(2)

2. Main results

For each result stated in this section, one of the following two assumptions will be used.

Assumptions.

• The densityf is Lipschitz continuous onS. (2.1) • The second order partial derivatives of f are (uniformly) continuous on _S. (2.2) We denote the expectation of ˆfn,b(s) by

fb(s) :=E[Ks/b+1,(1−ksk1)/b+1(X)] =

Z

S

f(x)K_s_/b₊₁_,₍₁_−ksk₁₎_/b₊₁(x)dx. (2.3) Alternatively, notice that if ξ_s _∼ Dirichlet(s/b+ 1,(1_{− k}s_k1)/b+ 1), then we also have the representation

fb(s) =E[f(ξs)]. (2.4)

Proposition 2.1. Under assumption (2.2), we have, uniformly for s_{∈ S},

fb(s) =f(s) +b g(s) + o(b), (2.5) as b→0, where g(s) := d X i=1 (1₋(d+ 1)si) ∂ ∂si f(s) +1 2 d X i,j=1 si(1{i=j}−sj) ∂2 ∂si∂sj f(s). (2.6)

Theorem 2.2 (Bias and variance). Assuming (2.2), we have, uniformly fors∈ S,

B_{ias[ ˆ}_f_n,b₍_s_{)] =}_f_b₍_s₎₋_f₍_s_{) =}_{b g}₍_s_{) + o(}_b₎_. _(2.7)

Furthermore, for every subsets _{J ⊆}[d], denote

ψ(s) :=ψ∅(s) and ψJ(s) := h (4π)d−|J |·(1− ks_k1) Y i∈[d]\J si i−1/2 . (2.8)

Then, for anys_∈Int(_S), any_{∅ 6}=_{J ⊆}[d], and anyκ_∈(0,_∞)d_{, we have, only assuming} _(2.1),

V_{ar( ˆ}_f_n,b₍_s_{)) =}                      n−1_b−d/2₍_ψ₍_s₎_f₍_s_{) +}_O s(b1/2)), if si/b→κi ∀i∈[d] and (1− ksk1)/b→ ∞, n−1_b−(d+|J |)/2 · Qi∈J ₂2_κiΓ(2+1_Γκ+1)2₍_κ₊₁₎·ψJ(s)f(s) +Oκ,s(b1/2) , if si/b→κi ∀i∈ J and si/b→ ∞ ∀i∈[d]\J and (1_{− k}s_k1)/b_{→ ∞}. (2.9)

Corollary 2.3 (Mean squared error). Under assumption (2.2), we have, for s_∈Int(_S), MSE( ˆfn,b(s)) =n−1b−d/2ψ(s)f(s) +b2g2(s) +Os(n−1b−d/2+1/2) + o(b2). (2.10)

In particular, if f(s)·g(s)6= 0, the asymptotically optimal choice of b, with respect to MSE, is

bopt =n−2/(d+4) d 4 · ψ(s)f(s) g2₍_s₎ 2/(d+4) , with (2.11) MSE[ ˆfn,bopt] =n −4/(d+4) " 1 +d₄ d 4 d d+4 # ψ(s)f(s)4/(d+4) g2₍_s₎−d/(d+4) + os(n −4/(d+4)₎_, _(2.12)

and, more generally, if n2/(d+4)b_→λ for some λ >0, then MSE[ ˆfn,b(s)] =n−4/(d+4)

(3)

Theorem 2.4 (Mean integrated squared error). Under assumption (2.2), we have MISE[ ˆfn,b] =n−1b−d/2 Z S ψ(s)f(s)ds+b2 Z S g2(s)ds+ o(n−1b−d/2) + o(b2). (2.14)

In particular, if R_Sg2₍_s_)d_s_>_{0, the asymptotically optimal choice of}_b_{, with respect to} _{MISE, is}

bopt =n−2/(d+4) d 4 · R S_Rψ(s)f(s)ds Sg2(s)ds 2/(d+4) , with (2.15) MISE[ ˆfn,bopt] =n −4/(d+4) " 1 +d₄ d 4 d d+4 # R Sψ(s)f(s)ds 4/(d+4) R Sg2(s)ds −d/(d+4) + o(n −4/(d+4)₎_, _(2.16)

and, more generally, if n2/(d+4)_b_→_λ _{for some} _{λ >}_{0, then}

MISE[ ˆfn,b] =n−4/(d+4) h λ−d/2 Z S ψ(s)f(s)ds+λ2 Z S g2(s)ds i + o(n−4/(d+4)). (2.17)

Theorem 2.5 (Uniform strong consistency). Assume (2.1). Asb_→0, we have sup

s∈S|

fb(s)−f(s)|=O(b1/2). (2.18)

Furthermore, define

Sδ :=s∈ S : 1− ksk1≥δ andsi≥δ ∀i∈[d] , δ >0. (2.19)

Then, if _|logb_|2b−2d_≤n as n_{→ ∞} and b_→0, we have sup

s∈Sbd

|fˆn,b(s)−f(s)|=O |logb|b−d

p

logn/n+O(b1/2), a.s. (2.20)

In particular, if _|logb_|2_b−2d_{= o(}_n/_log_n_{), then} _sup

s∈Sbd|fˆn,b(s)−f(s)| →0 a.s.

Theorem 2.6 (Asymptotic normality). Assume (2.1). Let s_∈Int(_S) be such that f(s)> 0. If n1/2_bd/4 _{→ ∞} _as _n_{→ ∞} _and _b_→_{0, then}

n1/2bd/4( ˆfn,b(s)−fb(s)) D

−→ N(0, ψ(s)f(s)). (2.21) If we also haven1/2bd/4+1/2→0 as n→ ∞ and b→0, then (2.18) of Theorem 2.5implies

n1/2bd/4( ˆfn,b(s)−f(s)) D

−→ N(0, ψ(s)f(s)). (2.22) Independently of the above rates forn and b, if we assume (2.2) instead and n2/(d+4)b_→λ for some λ >0 as n_{→ ∞} andb_→0, then Proposition 2.1 implies

n2/(d+4)( ˆfn,b(s)−f(s)) D

−→ N(λ g(s), λ−d/2ψ(s)f(s)). (2.23)

Remark 2.7. The rate of convergence for thed-dimensional kernel density estimator with i.i.d. data and bandwidth h is_O(n−1/2h−d/2)in Theorem 3.1.15 of Prakasa Rao (1983), whereas our estimator fˆn,b converges at a rate of O(n−1/2b−d/4). Hence, the relation between the bandwidth

(4)

3. Proof of Proposition 2.1

First, we estimate the expectation and covariances of the random vector

ξ_s = (ξ1, ξ2, . . . , ξd)∼Dirichlet s_b + 1,(1−ksk_b 1)+ 1, s∈ S, b >0. (3.1)

If 0< b_≤2−1₍_d_{+ 1)}−1_{, then, for all}_{i, j}_∈_[_d_],

E_[_ξ_i_{] =} si b + 1 1 b +d+ 1 = si+b 1 +b(d+ 1) =si+b(1−(d+ 1)si) +O(b2), (3.2) C_ov(_ξ_i_{, ξ}_j_{) =} ( si b + 1)((1b +d+ 1)1{i=j}−(s_bj + 1)) (1_b +d+ 1)2₍1 b +d+ 2) = b(si+b)(1{i=j}−sj+b(d+ 1)1{i=j}−b) (1 +b(d+ 1))2_{(1 +}_b₍_d_{+ 2))} =bsi(1{i=j}−sj) +O(b2), (3.3) E_[(_ξ_i₋_s_i₎₍_ξ_j₋_s_j_{)] =}C_ov(_ξ_i_{, ξ}_j_{) + (}E_[_ξ_i_]₋_s_i₎₍E_[_ξ_j_]₋_s_j₎ =bsi(1{i=j}−sj) +O(b2). (3.4)

By a second order mean value theorem, we have

f(ξ_s)−f(s) = d X i=1 ∂ ∂xi f(s)(ξi−si) + 1 2 d X i,j=1 ∂2 ∂xi∂xj f(s)(ξi−si)(ξj−sj) +1 2 d X i,j=1 ∂2 ∂xi∂xj f(ζ_s)− ∂ 2 ∂xi∂xj f(s) (ξi−si)(ξj−sj), (3.5)

for some random vectorζs∈ S on the line segment joiningξs ands. If we take the expectation

in the last equation, and then use (3.2) and (3.4), we get

fb(s)−f(s)−b _Xd i=1 (1−(d+ 1)si) ∂ ∂xi f(s) +1 2 d X i,j=1 si(1{i=j}−sj) ∂2 ∂xi∂xj f(s) ≤ 1 2 d X i,j=1 E ∂ 2 ∂xi∂xj f(ζ_s)− ∂ 2 ∂xi∂xj f(s) |ξi−si||ξj −sj| ·1{kξs−sk1≤δε,d} + 1 2 d X i,j=1 E ∂ 2 ∂xi∂xj f(ζ_s)− ∂ 2 ∂xi∂xj f(s) |ξi−si||ξj−sj| ·1{kξs−sk1>δε,d} =: ∆1+ ∆2 (3.6)

where, for any given ε > 0, the real number δε,d ∈ (0,1] is such that ks′−sk1 ≤ δd,ε implies

|∂x∂i∂x2 jf(s

′₎₋ ∂2

∂xi∂xjf(s)|< ε, uniformly fors,s

′ _{∈ S}_.2 _{The uniform continuity of} ∂2

∂xi∂xjf

d i,j=1, the fact that kζ_s₋s_k1_{≤ k}ξ_s₋s_{k1, the Cauchy-Schwarz inequality and (3.4), yield}

∆1 ≤ 1 2 d X i,j=1 ε_· q E_|_ξ_i₋_s_i_|2qE_|_ξ_j ₋_s_j_|2 ₌_ε_{· O}₍_b₎_. _(3.7)

2_{We know that such a number exists because the second order partial derivatives of}_f_{are uniformly continuous}

(5)

Since _∂x∂_i_∂x2 _jfd_i,j₌₁ are uniformly continuous, they are in particular bounded (say byMd>0).

Furthermore,{kξ_s₋s_k1 > δε,d}implies that at least one component of (ξk−sk)dk=1is larger than

δε,d/d, so a union bound over k followed by dconcentration bounds for the beta distribution3

(see e.g. LemmaA.1 whend= 1) yield

∆2≤ 1 2 d X i,j=1 2Md·12· d X k=1 P _|_ξ_k₋_s_k_{| ≥}_δ_ε,d_/d ≤d3Md·2 exp − (δε,d/d) 2 2·14(1b +d+ 2)−1 . (3.8)

If we choose a sequenceε=ε(b)_ց0 that goes to 0 slowly enough that 1_≥δε,d > d

p

b_{· |}logb_|, then ∆1+ ∆2 in (3.6) is o(b) by (3.7) and (3.8). This ends the proof of Proposition 2.1.

4. Proof of Theorem 2.2

The expression for the bias is a trivial consequence of Proposition 2.1. In order to compute the asymptotics of the variance, we only assume thatf is Lipschitz on _S. First, note that

ˆ fn,b(s)−fb(s) = 1 n n X i=1 Yi,b(s), (4.1)

where the random variables

Yi,b(s) :=Ks

b+1, 1−ksk₁

b +1

(Xi)−fb(s), 1≤i≤n, are i.i.d. (4.2)

Hence, ifγ_s∼Dirichlet(2s/b+ 1,2(1− ksk1)/b+ 1), then

V_{ar[ ˆ}_f_n,b₍_s_{)] =}_n−1E_K_s_/b₊₁_,₍₁_−ksk 1)/b+1(X) 2₋_n−1 _f b(s) 2 =n−1Ab(s)E[f(γs)]− O(n−1) =n−1Ab(s) (f(s) +O(b1/2))− O(n−1), (4.3) where Ab(s) := Γ(2(1_{− k}s_k1)/b+ 1)Qd_i₌₁Γ(2si/b+ 1) Γ2₍₍₁_{− k}_s_k1₎_/b_{+ 1)}Qd i=1Γ2(si/b+ 1) ·Γ 2₍₁_/b₊_d_{+ 1)} Γ(2/b+d+ 1), (4.4) and where the last line in (4.3) follows from the Lipschitz continuity of f, the Cauchy-Schwarz inequality and the analogue of (3.4) for γ_s :

E_[_f₍_γ_s_)]₋_f₍_s_{) =} d X i=1 OE_|_γ_i₋_s_i_| ≤ d X i=1 OqE_|_γ_i₋_s_i_|2 =O(b1/2). (4.5)

The conclusion of Theorem 2.2follows from (4.3) and Lemma4.1below.

3_The_kth _{component of a Dirichlet(}_α_{, β}_{) random vector has a Beta(}_αk,_k_α_k

(6)

Lemma 4.1. Asb_→0, we have, uniformly fors_{∈ S}, 0< Ab(s)≤ b(d+1)/2₍₁_/b₊_d₎d+1/2 (4π)d/2p_s 1s2. . . sd(1− ksk1) (1 +O(b)). (4.6)

Furthermore, for any _{∅ 6}=_{J ⊆}[d], and any κ_∈(0,_∞)d_,

Ab(s) =                b−d/2_ψ₍_s_{)(1 +}_O s(b)), if si/b→κi ∀i∈[d] and (1− ks_k1)/b→ ∞, b−(d+|J |)/2Q_i_∈J ₂2_κiΓ(2+1κ_Γ+1)2₍_κ₊₁₎ ·ψJ(s)(1 +Oκ,s(b)), if si/b→κi ∀i∈ J and si/b→ ∞ ∀i∈[d]\J and (1_{− k}s_k1)/b_{→ ∞}, (4.7) where ψ and ψJ are defined as in (2.8).

Proof. If we denote Sb(s) := R2₍₍₁_{− k}_s_k1₎_/b_{+ 1)}Qd i=1R2(si/b+ 1) R(2(1− ksk1)/b+ 1)Qd_i₌₁R(2si/b+ 1) · R(2/b+d+ 1) R2₍₁_/b₊_d_{+ 1)}, (4.8) where R(z) := √ 2πe−z_zz+1/2 Γ(z+ 1) , z≥0, (4.9)

then, for all s_∈Int(_S), we have

Ab(s) = 22(1−ksk1)/b+1/2Qd i=122si/b+1/2 (2π)(d+1)/2p₍₁_{− k}_s_k1₎_/b Qd i=1 p si/b · √ 2π e−d₍₁_/b₊_d₎2/b+2d+1 (2/b+d)2/b+d+1/2 ·Sb(s) = b (d+1)/2₍₁_/b₊_d₎d+1/2 (4π)d/2p_s 1s2. . . sd(1− ksk1) · 2/b+ 2d 2/b+d 2/b+d+1/2 e−d·Sb(s). (4.10)

It was shown in Lemma 3 of Brown & Chen(1999) that R(z)<1 for all z_≥0. Together with the fact that z7→R(z) is increasing on (1,∞),4 we see that

max

s∈S Sb(s)≤

R(2/b+d+ 1)

R2₍₁_/b₊_d_{+ 1)} →1, asb→0, (4.11) from which (4.6) follows.

To prove the second claim of the lemma, note that Sb(s) = 1 +Os(b) by Stirling’s formula,

so assi/b→κi ∀i∈[d] and (1− ksk1)/b→ ∞, we have, from (4.10),

Ab(s) = b−d/2_{(1 +}_O₍_b₎₎ (4π)d/2p_s 1s2. . . sd(1− ksk1) · 1 + d 2/b+d 2/b+d e−d_·(1 +_Os(b)) = b −d/2_{(1 +}_O s(b)) (4π)d/2p_s 1s2. . . sd(1− ksk1) . (4.12)

4_{By the standard relation (Γ}′_/_Γ)(_z_{+ 1) = 1}_/z_{+ (Γ}′_/_Γ)(_z_{) and Lemma 2 in}_{Minc & Sathre}₍₁₉₆₄_{), we have}

d dzlogR(z) = logz+ 1 2z − Γ′(z+ 1) Γ(z+ 1) = logz− 1 2z − Γ′(z) Γ(z) >0, for allz >1, which means thatz7→R(z) is increasing on (1,∞).

(7)

Next, let _{∅ 6}=_{J ⊆}[d] and κ_∈(0,_∞)d_{. If} _s

i/b→ κi for alli∈ J, si/b → ∞for all i∈[d]\J,

and (1− ksk1)/b→ ∞, then, from (4.4),

Ab(s) = Y i∈J Γ(2κ+ 1) Γ2₍_κ_{+ 1)}(1 +Oκ,s(b))· 22(1−ksk1)/b+1/2Q i∈[d]\J 22si/b+1/2 (2π)(d−|J |+1)/2p₍₁_{− k}_s_k1₎_/bQ i∈[d]\J p si/b · √ 2π e−d(1/b+d)2/b+2d+1 (2/b+d)2/b+d+1/2 ·S J b (s) = Y i∈J Γ(2κ+ 1) Γ2₍_κ_{+ 1)}(1 +Oκ,s(b))· b(d−|J |+1)/2₍₁_/b₊_d₎d+1/2 2d/2Q i∈J 22κi+1/2(2π)(d−|J |)/2 q (1− ksk1)Q_i_∈_[_d_]_\J si · 2/b+ 2d 2/b+d 2/b+d+1/2 e−d·S_bJ(s), (4.13) where S_bJ(s) := R 2₍₍₁_{− k}_s_k1₎_/b_{+ 1)}Q i∈[d]\JR2(si/b+ 1) R(2(1_{− k}s_k1)/b+ 1)Q_i_∈_[_d_]_\JR(2si/b+ 1)· R(2/b+d+ 1) R2₍₁_/b₊_d_{+ 1)}. (4.14) Similarly to (4.12), Stirling’s formula implies

Ab(s) = Y i∈J Γ(2κ+ 1) 22κi+1_Γ2₍_κ_{+ 1)} · b−(d+|J |)/2_{(1 +}_O κ,s(b)) (4π)(d−|J |)/2q₍₁_{− k}_s_k1₎Q i∈[d]\J si . (4.15)

which concludes the proof of Lemma 4.1and Theorem2.2.

Proof. By the bound (4.6), the fact that f is uniformly bounded (it is continuous on _S), the almost-everywhere convergence in (4.12), and the dominated convergence theorem, we have

bd/2 Z S Ab(s)f(s)ds= Z S ψ(s)f(s)ds+ o(1). (5.1)

The expression for the variance in (4.3), and the expression for the bias in (2.7), yield

MISE[ ˆfn,b] = Z S V_{ar( ˆ}_f_n,b₍_s_{)) +}B_{ias[ ˆ}_f_n,b₍_s_)]2_d_s =n−1b−d/2 Z S ψ(s)f(s)ds+b2 Z S g2(s)ds+ o(n−1b−d/2) + o(b2). (5.2)

This ends the proof.

This is the most technical proof, so here is the idea. The first three lemmas below bound, uniformly, the Dirichlet density (Lemma 6.1), the partial derivatives of the Dirichlet density with respect to each of its (d+ 1) parameters (Lemma 6.2), and then the absolute difference of densities (pointwise and under expectations) that have different parameters (Lemma 6.3). This is then used to show continuity estimates for the random field s _7→ Yi,b(s) from (4.2)

(Proposition6.4), meaning that we get a control on the probability thatYi,b(s) andYi,b(s′) are

(8)

the supremum of Yi,b(s) over points s′ that are inside a small hypercube of width 2b centered

at s(Corollary 6.5). Since ˆfn,b(s)−fb(s) = _n1Pni=1Yi,b(s), we can estimate tail probabilities

for the global supremum of _|fˆn,b−fb|by a union bound over suprema on the small hypercubes

and apply a large deviation bound from Corollary6.5 for each one of them.

The first lemma bounds the density of the Dirichlet(α, β) distribution from (1.2).

Lemma 6.1. If α1, . . . , αd, β ≥2, then sup s∈S Kα,β(s)≤ s kα_k1+β−1 (β−1)Qd_i₌₁(αi−1) (kαk1+β−d−1)d. (6.1)

Proof. The Dirichlet density is well-known to be maximized ats⋆= _kαkα−1

1+β−d−1. At this point, we have Kα,β(s⋆) = Γ(kα_k1+β) Γ(β)Qd_i₌₁Γ(αi) · (β−1) β−1Qd i=1(αi−1)αi−1 (_kα_k1+β₋d₋1)kαk1+β−d−1. (6.2)

From Theorem 2.2 in Batir(2017), we know that √

2πe−y+1(y₋1)y−1+12 ≤Γ(y)≤ 7

5· √

2πe−y+1(y₋1)y−1+12, for all y≥2, (6.3)

so (6.2) is ≤ 7 5 √ 2πe−kαk1−β+1₍_k_α_k1₊_β₋₁₎kαk1+β−1+1₂ √ 2πe−β+1₍_β₋₁₎β−1+1 2 Qd i=1 √ 2πe−αi+1₍_α_i−₁₎αi−1+1₂ · (β₋1)β−1Qd_i₌₁(αi−1)αi−1 (_kα_k1+β₋d₋1)kαk1+β−d−1 = 7₅(2π)−d/2·e−d 1− d kα_k1+β₋1 −(kαk1+β−1) · s kαk1+β−1 (β−1)Qd_i₌₁(αi−1) ·(kαk1+β−d−1)d ≤ 75(2π) −d/2_·_e2₅d_· s kα_k1+β₋1 (β₋1)Qd_i₌₁(αi−1) ·(_kα_k1+β₋d₋1)d ≤ s kα_k1+β₋1 (β₋1)Qd_i₌₁(αi−1) ·(_kα_k1+β₋d₋1)d, (6.4)

where we used our assumption that α1, . . . , αd, β ≥ 2 and the fact that (1−y)−1 ≤ e

7 5y for

y_∈[0,1/2] to obtain the second inequality. The conclusion follows.

In the next lemma, we bound the partial derivatives of the Dirichlet(α, β) density function with respect to its parameters.

Lemma 6.2. If α1, . . . , αd, β ≥2, then sup s∈S _dd_α j Kα,β(s) ≤n2|log(kα_k1+β)|+|logsj| o · s kα_k1+β−1 (β−1)Qd_i₌₁(αi−1) (kαk1+β−d−1)d, (6.5) sup s∈S _dd_βKα,β(s) ≤n2_|log(_kα_k1+β)_|+_|log(1_{− k}s_k1)_|o · s kα_k1+β₋1 (β₋1)Qd_i₌₁(αi−1) (_kα_k1+β₋d₋1)d. (6.6)

(9)

Proof. The digamma function ψ(z) := Γ′₍_z₎_/_Γ(_z_{) satisfies} _ψ₍_z₎ _<_log(_z_{) for all} _z_≥_{1 (see e.g.}

Lemma 2 inMinc & Sathre (1964)), so we have _dd_α j Kα,β(s) = ψ(_kα_k1+β)₋ψ(αj) + logsjKα,β(s) ≤n2_|log(_kα_k1+β)_|+_|logsj| o Kα,β(s). (6.7)

The conclusion (6.5) follows from Lemma 6.1. The proof of (6.6) is virtually the same.

As a consequence of Lemma6.2and the multivariate mean value theorem, we can control the difference of two Dirichlet densities with different parameters, pointwise and under expectations.

Lemma 6.3. If α1, . . . , αd, β, α′1, . . . , α′d, β′ ≥2, and X isF distributed with a bounded density f supported on _S, then E_|_Kα′_,β′(X)−Kα_,β(X)| ≤C d_kf_k∞ r kαk1+β−1 (β−1)Qd_i=1(αi−1) kα∨α ′ k1+ (β_∨β′)₋d₋1d ·log _kα_∨α′_k1+ (β_∨β′)_{· k}(α′, β′)₋(α, β)_k∞, (6.8)

where C >0 is a universal constant. Furthermore, if Sδ := s_{∈ S} : 1_{− k}s_k1_≥δ andsi≥δ ∀i∈[d] , δ >0, (6.9) then we have max s∈Sδ| Kα′_,β′(s)−K_α_,β(s)| ≤C de _kf_k∞logδ· r kαk1+β−1 (β−1)Qd_i=1(αi−1) kα∨α ′ k1+ (β_∨β′)₋d₋1d ·log kα∨α′k1+ (β∨β′)· k(α′, β′)−(α, β)k∞, (6.10)

where C >e 0 is another universal constant.

Proof. By the multivariate mean value theorem and Lemma 6.2,

E_|_Kα′_,β′(X)−Kα_,β(X)| ≤ kfk∞ r kαk1+β−1 (β−1)Qdi=1(αi−1) kα∨α ′_k1_{+ (}_β_∨_β′₎₋_d₋₁d log kα∨α′k1+ (β∨β′) · k(α′, β′)₋(α, β)_k∞ n Z S| log(1_{− k}s_k1)_|ds+ d X j=1 Z S| logsj|ds o . (6.11) The integrals are bounded by 1. Indeed, if_Sedenotes a (d₋1)-dimensional simplex, then

Z S| logsj|ds= Z 1 0 | logsj| ·Vold−1((1−sj)Se) | {z } ≤ 1 dsj ≤ Z 1 0 | logsj|dsj = 1. (6.12)

The second claim is obvious, again by the multivariate mean value theorem and Lemma6.2.

Proposition 6.4 (Continuity estimates). Let b >0 and recall from (4.2) :

Yi,b(s) :=Ks b+1, 1−ksk₁ b +1 (Xi)−E Ks b+1, 1−ksk₁ b +1 (Xi) , 1≤i≤n. (6.13)

(10)

If s_{∈ S}_b₍_d₊₁₎ and _|logb_|2_b−2d_≤_n_{, then we have, for any} _h_∈_R _and _a_≥_1, P sup s′_∈s_+[₋_b,b_]d 1 n n X i=1 Yi,b(s′)≥h+a, 1 n n X i=1 Yi,b(s)≤h ≤exp₋cf,d· n b2da2 |logb_|2 . (6.14)

where cf,d>0 depends only on the density f and the dimension d.

Proof. Clearly, the probability in (6.14) is

≤P sup s′_∈s_+[₋_b,b_]d 1 n n X i=1 Yi,b(s′)−Yi,b(s)≥a . (6.15)

The main idea of the proof now is to decompose the supremum with a chaining argument and apply concentration bounds on the increments at each level of thed-dimensional tree. With the notation _Hk:= 2−k·bZd, we have the embedded sequence of lattice points

H0 ⊆ H1⊆ · · · ⊆ Hk⊆ · · · ⊆Rd. (6.16)

Hence, for s ∈ Sb(d+1) fixed, and for any s′ ∈ s+ [−b, b]d, let (sk)k∈N₀ be a sequence that

satisfies s0=s, sk−s∈ Hk∩[−b, b]d, lim k→∞ksk−s ′_k ∞= 0, (6.17) and (sk+1)i= (sk)i±2−k−1b, for all i∈[d]. (6.18)

Since the maps_7→ _n1 Pn_i₌₁Yi,b(s) is almost-surely continuous,

1 n n X i=1 Yi,b(s′)−Yi,b(s) = ∞ X k=0 1 n n X i=1 Yi,b(sk+1)−Yi,b(sk) , (6.19)

and since,P∞_k₌₀₂₍_k₊₁₎1 2 ≤1, we have the inclusion of events,

n sup s′_∈s_+[₋_b,b_]d 1 n n X i=1 Yi,b(s′)−Yi,b(s) ≥a o ⊆ ∞ [ k=0 [ s1∈s+Hk∩(−b,b)d (s2)i=(s1)i±2−k−1b ∀i∈[d] 1 n n X i=1 Yi,b(s2)−Yi,b(s1) ≥ a 2(k+ 1)2 . (6.20)

By a union bound and the fact that _|H_k_∩(₋b, b)d_{| ≤}₂(k+1)d_{, the probability in (6.15) is}

≤ ∞ X k=0 2(k+1)d·2d sup s1∈s+Hk∩(−b,b)d (s2)i=(s1)i±2−k−1b ∀i∈[d] P 1 n n X i=1 Yi,b(s2)−Yi,b(s1) ≥ a 2(k+ 1)2 . (6.21)

By Azuma’s inequality (see e.g. Theorem 1.3.1 in Steele (1997)), Lemma 6.3 (note that s _∈

Sb(d+1) and s′∈s+ [−b, b]dimply s′ ∈ Sb) and (6.13), the above is

≤ ∞ X k=0 2(k+2)d_·2 exp − na 2 8(k+ 1)4 · Cf,db−d log(b−1+d+ 1) 2−k−1 −2 ≤ ∞ X k=0 2(k+2)d_·2 exp₋ 2 2k−1 C2 f,d(k+ 1)4 ·n b 2d_a2 |logb|2 , (6.22)

for some constant Cf,d>0 that only depend on f and d. Assuming|logb|2b−2d≤n, this is

≤exp₋cf,d·

n b2d_a2 |logb|2

, (6.23)

(11)

Corollary 6.5 (Large deviation estimates). Ifs_{∈ S}_b₍_d₊₁₎ and_|logb_|2_b−2d_≤_n_{, then we have,} for any a≥1, P sup s′_∈s_+[₋_b,b_]d 1 n n X i=1 Yi,b(s′)≥2a ≤exp₋cf,d· n b2da2 |logb_|2 , (6.24)

where cf,d>0 depends only on the density f and the dimension d.

Proof. By a union bound, the probability in (6.24) is

≤P sup s′_∈s_+[₋_b,b_]d 1 n n X i=1 Yi,b(s′)≥2a, 1 n n X i=1 Yi,b(s)≤a +P 1 n n X i=1 Yi,b(s)≥a . (6.25)

The first probability is bounded with Proposition 6.4 with h=a. We get the same bound on the second probability by applying Azuma’s inequality and Lemma6.3, as we did in (6.22).

We are now ready to prove Theorem 2.5. On the one hand, the Lipschitz continuity of f, Jensen’s inequality and (3.4), imply that, uniformly for s_{∈ S},

fb(s)−f(s) =E[f(ξs)]−f(s) = d X i=1 OE_|_ξ_i₋_s_i_| ≤ d X i=1 OqE_|_ξ_i₋_s_i_|2 =O(b1/2). (6.26)

On the other hand, recall from (4.1) that

ˆ fn,b(s)−fb(s) = 1 n n X i=1

Yi,b(s), whereYi,b(s) :=Ks

b+1, 1−ks_k

1

b +1

(Xi)−fb(s). (6.27)

By a union bound over the suprema on hypercubes of width 2b centered at each s _∈ 2bZd_∩

Sb(d+1), and the large deviation estimates in Corollary6.5(assuming|logb|2b−2d≤n), we have

P _sup s∈Sbd |fˆn,b(s)−fb(s)|>2a ≤ X s∈2bZd_∩S b(d+1) P _sup s′_∈s_+[₋_b,b_]d 1 n n X i=1 Yi,b(s)>2a ≤b−d·exp−cf,d· n b2d(2a)2 |logb|2 . (6.28)

With the choice a = (cf,d)−1/2|logb|b−d

p

logn/n, the right-hand side of (6.28) is equal to exp(₋4 logn+d_|logb_|), which is summable inn,5 _{so the Borel-Cantelli lemma yields}

sup

s∈Sbd

|fˆn,b(s)−fb(s)|=O |logb|b−d

p

logn/n, a.s. (6.29)

Together with (6.26), the conclusion follows.

(12)

By (6.27), the asymptotic normality of n1/2_bd/4_{( ˆ}_f

n,b(s)−fb(s)) will follow if we verify the

Lindeberg condition for double arrays :6 For every ε >0,

s−_b2E_|_Y₁_,b₍_s₎_|2₁

{|Y1,b(s)|>εn1/2sb}

−→0, asn_{→ ∞}, (7.1)

wheres2_b :=E_|_Y₁_,b₍_s₎_|2 _and_b₌_b₍_n₎_→_{0. From Lemma} _{6.1, we know that}

|Y1,b(s)|=O ψ(s)bd/2·b−d

=_Os(b−d/2), (7.2)

and we also know thatsb =b−d/4

p

ψ(s)f(s)(1 + os(1)) whenf is Lipschitz continuous, by the

proof of Theorem 2.2, so |Y1,b(s)|

n1/2_s

b

=_Os(n−1/2bd/4b−d/2) =Os(n−1/2b−d/4)−→0, (7.3)

whenever n1/2_bd/4 _{→ ∞} _as_n_{→ ∞} _and _b_→_{0. Under this condition, (7.1) holds and thus}

n1/2bd/4( ˆfn,b(s)−fb(s)) =n1/2bd/4· 1 n n X i=1 Yi,m D −→ N(0, ψ(s)f(s)). (7.4)

This completes the proof of Theorem 2.6.

A. Supplemental material

The sub-Gaussianity property of the Dirichlet distribution allows us to get very useful con-centration bounds. The following lemma is used in the proof of Proposition2.1.

Lemma A.1 (Concentration bounds for the Dirichlet distribution). Let D _∼Dirichlet(α, β) for some α1, . . . , αd, β >0. There is a variance parameter0< σ2opt(α, β)≤(4(kαk1+β+ 1))−1 such that P _|₍_D₋E_[_D_])_i_{| ≥}_t_i _∀_i_∈_[_d_]_≤₂d_exp₋ ktk 2 2 2σ2 opt(α, β) , (A.1) for allt∈(0,∞)d.

Proof. By Chernoff’s inequality and the sub-Gaussianity of the Dirichlet distribution, shown in Theorem 3.3 ofMarchal & Arbel (2017), we have, for allλ_∈(0,_∞)d,

P₍_D₋E_[_D_]_≥_t₎_≤E_eλ⊤(D−E[D])_e−λ⊤t _≤_e−

kλk2₂σ_opt(2 α,β)

2 −λ⊤t, (A.2)

for some 0 < σ2

opt(α, β) ≤ (4(kαk1 +β + 1))−1. (The upper bound on σopt2 (α, β) is stated in Theorem 2.1 of Marchal & Arbel (2017).) If we take the optimal vector λ⋆ =t/σ2

opt(α, β), the right-hand side of (A.2) is ≤ exp(−ktk2

2/(2σopt2 (α, β))). We get the same bound on any probability of the form

P ₍_D₋E_[_D_])_i_{≤ −}_t_i _∀_i_{∈ J}_,₍_D₋E_[_D_])_i _≥_t_i _∀_[_d_]_\J_, _for_{J ⊆}_[_d_]_, _(A.3)

simply by rerunning the above argument and multiplying the components i∈ J of λ⋆ by −1. Since there are 2dpossible subsets of [d], the conclusion (A.1) follows from a union bound.

(13)

References

Abramowitz, M., & Stegun, I. A. 1964.Handbook of mathematical functions with formulas, graphs, and mathemat-ical tables. National Bureau of Standards Applied Mathematics Series, vol. 55. For sale by the Superintendent of Documents, U.S. Government Printing Office, Washington, D.C. MR0167642.

Babu, G. J., & Chaubey, Y. P. 2006. Smooth estimation of a distribution and density function on a hypercube using Bernstein polynomials for dependent random vectors.Statist. Probab. Lett.,76_{(9), 959–969.}_MR2270097_.

Babu, G. J., Canty, A. J., & Chaubey, Y. P. 2002. Application of Bernstein polynomials for smooth estimation of a distribution and density function. J. Statist. Plann. Inference,105_{(2), 377–392.} _MR1910059_.

Barrientos, A. F., Jara, A., & Quintana, F. A. 2015. Bayesian density estimation for compositional data using random Bernstein polynomials. J. Statist. Plann. Inference,166_{, 116–125.} _MR3390138_.

Barrientos, A. F., Jara, A., & Quintana, F. A. 2017. Fully nonparametric regression for bounded data using dependent Bernstein polynomials. J. Amer. Statist. Assoc.,112_(518).

Batir, N. 2017. Bounds for the gamma function. Results Math.,72_{(no. 1-2), 865–874.} _MR3684463_.

Belalia, M. 2016. On the asymptotic properties of the Bernstein estimator of the multivariate distribution function. Statist. Probab. Lett.,110_{, 249–256.} _MR3474765_.

Belalia, M., Bouezmarni, T., & Leblanc, A. 2017a. Smooth conditional distribution estimators using Bernstein polynomials. Comput. Statist. Data Anal.,111_{, 166–182.} _MR3630225_.

Belalia, M., Bouezmarni, T., Lemyre, F. C., & Taamouti, A. 2017b. Testing independence based on Bernstein empirical copula and copula density. J. Nonparametr. Stat.,29_{(2), 346–380.} _MR3635017_.

Belalia, M., Bouezmarni, T., & Leblanc, A. 2019. Bernstein conditional density estimation with application to conditional distribution and regression functions. J. Korean Statist. Soc.,48_{(3), 356–383.} _MR3983257_.

Bouezmarni, T., & Rolin, J.-M. 2003. Consistency of the beta kernel density function estimator. Canad. J. Statist.,31_{(1), 89–98.} _MR1985506_.

Bouezmarni, T., & Rolin, J.-M. 2007. Bernstein estimator for unbounded density function. J. Nonparametr. Stat.,19_{(3), 145–161.} _MR2351744_.

Bouezmarni, T., & Scaillet, O. 2005. Consistency of asymmetric kernel density estimators and smoothed his-tograms with application to income data. Econometric Theory,21_{(2), 390–412.} _MR2179543_.

Bouezmarni, T., M., Mesfioui, & Rolin, J. M. 2007. L1-rate of convergence of smoothed histogram. Statist. Probab. Lett.,77_{(14), 1497–1504.} _MR2395599_.

Bouezmarni, T., Rombouts, J. V. K., & Taamouti, A. 2010. Asymptotic properties of the Bernstein density copula estimator forα-mixing data. J. Multivariate Anal.,101_{(1), 1–10.} _MR2557614_.

Bouezmarni, T., El Ghouch, A., & Taamouti, A. 2013. Bernstein estimator for unbounded copula densities. Stat. Risk Model.,30_{(4), 343–360.} _MR3143795_.

Brown, B. M., & Chen, S. X. 1999. Beta-Bernstein smoothing for regression curves with compact support.Scand. J. Statist.,26_{(1), 47–59.} _MR1685301_.

Chen, S. X. 1999. Beta kernel estimators for density functions. Comput. Statist. Data Anal., 31_{(2), 131–145.} MR1718494.

Chen, S. X. 2000a. Beta kernel smoothers for regression curves. Statist. Sinica,10_{(1), 73–91.} _MR1742101_.

Chen, S. X. 2000b. Probability density function estimation using gamma kernels.Ann. Inst. Statist. Math,52_(3),

471–480. MR1794247.

Curtis, S. M., & Ghosh, S. K. 2011. A variable selection approach to monotonic regression with Bernstein polynomials. J. Appl. Stat.,38_{(5), 961–976.} _MR2782409_.

Gawronski, W. 1985. Strong laws for density estimators of Bernstein type.Period. Math. Hungar,16_{(1), 23–43.} MR0791719.

Gawronski, W., & Stadtm¨uller, U. 1980. On density estimation by means of Poisson’s distribution. Scand. J. Statist.,7_{(2), 90–94.} _MR0574548_.

Gawronski, W., & Stadtm¨uller, U. 1981. Smoothing histograms by means of lattice and continuous distributions. Metrika,28_{(3), 155–164.} _MR0638651_.

Gawronski, W., & Stadtm¨uller, U. 1984. Linear combinations of iterated generalized Bernstein functions with an application to density estimation. Acta Sci. Math.,47_{(1-2), 205–221.} _MR0755576_.

Ghosal, S. 2001. Convergence rates for density estimation with Bernstein polynomials. Ann. Statist., 29_(5),

1264–1280. MR1873330.

Guan, Z. 2016. Efficient and robust density estimation using Bernstein type polynomials.J. Nonparametr. Stat.,

28_{(2), 250–271.} _MR3488598_.

Igarashi, G., & Kakizawa, Y. 2014. On improving convergence rate of Bernstein polynomial density estimator. J. Nonparametr. Stat.,26_{(1), 61–84.} _MR3174309_.

Janssen, P., Swanepoel, J., & Veraverbeke, N. 2012. Large sample behavior of the Bernstein copula estimator. J. Statist. Plann. Inference,142_{(5), 1189–1197.} _MR2879763_.

Janssen, P., Swanepoel, J., & Veraverbeke, N. 2014. A note on the asymptotic behavior of the Bernstein estimator of the copula density. J. Multivariate Anal.,124_{, 480–487.} _MR3147339_.

Janssen, P., Swanepoel, J., & Veraverbeke, N. 2016. Bernstein estimation for a copula derivative with application to conditional distribution and regression functionals. TEST,25_{(2), 351–374.} _MR3493523_.

(14)

Janssen, P., Swanepoel, J., & Veraverbeke, N. 2017. Smooth copula-based estimation of the conditional density function with a single covariate. J. Multivariate Anal.,159_{, 39–48.} _MR3668546_.

Kakizawa, Y. 2004. Bernstein polynomial probability density estimation.J. Nonparametr. Stat.,16_{(5), 709–729.} MR2068610.

Kakizawa, Y. 2006. Bernstein polynomial estimation of a spectral density. J. Time Ser. Anal.,27_{(2), 253–287.} MR2235846.

Kakizawa, Y. 2011. A note on generalized Bernstein polynomial density estimators. Stat. Methodol., 8_(2),

136–153. MR2769276.

Lange, K. 1995. Applications of the Dirichlet distribution to forensic match probabilities. Genetica, 96_(1-2),

107–117. doi:10.1007/BF01441156.

Leblanc, A. 2009. Chung-Smirnov property for Bernstein estimators of distribution functions. J. Nonparametr. Stat.,21_{(2), 133–142.} _MR2488150_.

Leblanc, A. 2010. A bias-reduced approach to density estimation using Bernstein polynomials. J. Nonparametr. Stat.,22_{(3-4), 459–475.} _MR2662607_.

Leblanc, A. 2012a. On estimating distribution functions using Bernstein polynomials. Ann. Inst. Statist. Math.,

64_{(5), 919–943.} _MR2960952_.

Leblanc, A. 2012b. On the boundary properties of Bernstein polynomial estimators of density and distribution functions. J. Statist. Plann. Inference,142_{(10), 2762–2778.} _MR2925964_.

Lorentz, G. G. 1986. Bernstein polynomials. Second edition. Chelsea Publishing Co., New York. MR0864976. Lu, L. 2015. On the uniform consistency of the Bernstein density estimator. Statist. Probab. Lett.,107_{, 52–61.}

MR3412755.

Marchal, O., & Arbel, J. 2017. On the sub-Gaussianity of the beta and Dirichlet distributions.Electron. Commun. Probab.,22_{(Paper No. 54), 14 pp.}_MR3718704_.

Minc, H., & Sathre, L. 1964. Some inequalities involving (r!)1/r. Proc. Edinburgh Math. Soc., 14_{(1), 41–46.} MR0162751.

Ouimet, F. 2018. Complete monotonicity of multinomial probabilities and its application to Bernstein estimators on the simplex. J. Math. Anal. Appl.,466_{(2), 1609–1617.} _MR3825458_.

Ouimet, F. 2019. Extremes of log-correlated random fields and the Riemann-zeta function, and some asymptotic results for various estimators in statistics. PhD thesis, Universit´e de Montr´eal.

http://hdl.handle.net/1866/22667.

Ouimet, F. 2020. A precise local limit theorem for the multinomial distribution.Preprint, 1–7.arXiv:2001.08512. Petrone, S. 1999a. Bayesian density estimation using Bernstein polynomials. Canad. J. Statist.,27_{(1), 105–126.}

MR1703623.

Petrone, S. 1999b. Random Bernstein polynomials. Scand. J. Statist.,26_{(3), 373–393.} _MR1712051_.

Petrone, S., & Wasserman, L. 2002. Consistency of Bernstein polynomial posteriors. J. Roy. Statist. Soc. Ser. B,64_{(1), 79–100.} _MR1881846_.

Prakasa Rao, B. L. S. 1983. Nonparametric functional estimation. Probability and Mathematical Statistics. Academic Press, Inc. [Harcourt Brace Jovanovich, Publishers], New York. MR0740865.

Prakasa Rao, B. L. S. 2005. Estimation of distribution and density functions by generalized Bernstein polynomials. Indian J. Pure Appl. Math.,36_{(2), 63–88.} _MR2153833_.

Sancetta, A. 2007. Nonparametric estimation of distributions with given marginals via Bernstein-Kantorovich polynomials: L1 and pointwise convergence theory. J. Multivariate Anal.,98_{(7), 1376–1390.} _MR2364124_.

Sancetta, A., & Satchell, S. 2004. The Bernstein copula and its applications to modeling and approximations of multivariate distributions. Econometric Theory,20_{(3), 535–562.} _MR2061727_.

Serfling, R. J. 1980. Approximation theorems of mathematical statistics. Wiley Series in Probability and Mathe-matical Statistics. John Wiley & Sons, Inc., New York. MR0595165.

Stadtm¨uller, U. 1983. Asymptotic distributions of smoothed histograms. Metrika,30_{(3), 145–158.} _MR0726014_.

Stadtm¨uller, U. 1986. Asymptotic properties of nonparametric curve estimates. Period. Math. Hungar., 17_(2),

83–108. MR0858109.

Steele, M. 1997. Probability theory and combinatorial optimization. CBMS-NSF Regional Conference Series in Applied Mathematics, vol. 69. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA.

MR1422018.

Tenbusch, A. 1994. Two-dimensional Bernstein polynomial density estimators. Metrika, 41_{(3-4), 233–253.} MR1293514.

Tenbusch, A. 1997. Nonparametric curve estimation with Bernstein estimates.Metrika,45_{(1), 1–30.}_MR1437794_.

Turnbull, B. C., & Ghosh, S. K. 2014. Unimodal density estimation using Bernstein polynomials. Comput. Statist. Data Anal.,72_{, 13–29.} _MR3139345_.

Vitale, R. A. 1975. Bernstein polynomial approach to density function estimation. Pages 87–99 of: Statistical Inference and Related Topics. Academic Press, New York. MR0397977.