Adaptive Bayesian Estimation of Mixed Discrete-Continuous Distributions under Smoothness and Sparsity

(1)

Working Paper 342

July 2018

Adaptive Bayesian Estimation of

Mixed Discrete-Continuous

Distributions under Smoothness and

Sparsity

Andriy Norets

Justinas Pelenis

(2)

Author(s):

Andriy Norets, Justinas Pelenis Title:

Adaptive Bayesian Estimation of Mixed Discrete-Continuous Distributions under Smoothness and Sparsity

ISSN: 1605-7996

2018 Institut für Höhere Studien - Institute for Advanced Studies (IHS) Josefstädter Straße 39, A-1080 Wien

E-Mail: o [email protected]ﬃ

Web: ww w .ihs.ac. a t

All IHS Working Papers are available online: http://irihs. ihs. ac.at/view/ihs_series/ This paper is available for download without charge at: http://irihs.ihs.ac.at/4711/

(3)

DISCRETE-CONTINUOUS DISTRIBUTIONS UNDER SMOOTHNESS AND

SPARSITY

By Andriy Norets

‡

and Justinas Pelenis

§

Brown University and Vienna Institute for Advanced Studies

We consider nonparametric estimation of a mixed discrete-continuous distribution under anisotropic smoothness conditions and possibly in-creasing number of support points for the discrete part of the distri-bution. For these settings, we derive lower bounds on the estimation rates in the total variation distance. Next, we consider a nonparamet-ric mixture of normals model that uses continuous latent variables for the discrete part of the observations. We show that the poste-rior in this model contracts at rates that are equal to the derived lower bounds up to a log factor. Thus, Bayesian mixture of normals models can be used for optimal adaptive estimation of mixed discrete-continuous distributions.

Keywords: Bayesian nonparametrics, adaptive rates, minimax rates, posterior contraction, discrete-continuous distribution, mixed scale, mixtures of normal distributions, latent variables.

JEL classification: C11, C14.

1. Introduction.

Mixture models have proven to be very useful for Bayesian nonparametric mod-eling of univariate and multivariate distributions of continuous variables. These models possess out-standing asymptotic frequentist properties: in Bayesian nonparametric estimation of smooth densities the posterior in these models contracts at optimal adaptive rates up to a log factor (Rousseau(2010),

Kruijer et al.(2010),Shen, Tokdar, and Ghosal(2013)). Tractable Markov chain Monte Carlo (MCMC)

algorithms for exploring posterior distributions of these models are available (Escobar and West(1995),

MacEachern and Muller (1998),Neal (2000), Miller and Harrison (2017), Norets (2017)) and they are widely used in empirical work (seeDey, Muller, and Sinha(1998),Chamberlain and Hirano(1999),Burda,

Harding, and Hausman(2008),Chib and Greenberg(2010), andJensen and Maheu(2014) among many

others). ∗

First version: December 2017, current version: July 24, 2018. †

We thank participants of Harvard-MIT econometrics workshop, OBayes 2017, CIREQ 2018, and SBIES 2018 for helpful comments.

‡

Associate Professor, Department of Economics, Brown University §

Assistant Professor, Vienna Institute for Advanced Studies 1

(4)

In most applications, data contain both continuous and discrete variables. From the computational perspective, discrete variables can be easily accommodated through the use of continuous latent variables in Bayesian MCMC estimation (Albert and Chib(1993),McCulloch and Rossi(1994)). In

nonparamet-ric modelling of discrete-continuous data by mixtures, latent variables were used byCanale and Dunson

(2011) andNorets and Pelenis(2012) among others. Some results on frequentist asymptotic properties of the posterior distribution in such models have also been established.Norets and Pelenis(2012) obtained

approximation results in Kullback-Leibler distance and weak posterior consistency for mixture models with a prior on the number of mixture components.DeYoreo and Kottas(2017) establish weak posterior

consistency for Dirichlet process mixtures. In similar settings, Canale and Dunson (2015) derived

pos-terior contraction rates that are not optimal. The question we address in the present paper is whether a mixture of normal model that uses latent variables for modeling the discrete part of the distribu-tion can deliver (near) optimal and adaptive posterior contracdistribu-tion rates for nonparametric estimadistribu-tion of discrete-continuous distributions.

Our contribution has two main parts. First, we derive lower bounds on the estimation rate for mixed multivariate discrete-continuous distributions under anisotropic smoothness conditions and potentially growing support of the discrete part of the distribution. Second, we study the posterior contraction rate for a mixture of normals model with a variable number of components that uses continuous latent variables for the discrete part of the observations. We show that the posterior in this model contracts at rates that are equal to the derived lower bounds up to a log factor. Thus, Bayesian mixture models can be used for (up to a log factor) optimal adaptive estimation of mixed discrete-continuous distributions. These results are obtained in a rich asymptotic framework where the multivariate discrete part of the data generating distribution can have either a large or a small number of support points and it can be either very smooth or not, and these characteristics can differ from one discrete coordinate to another. In these settings, smoothing is beneficial only for a subset of discrete variables with a quickly growing number of support points and/or high level of smoothness. In a sense, this subset is automatically and correctly selected by the mixture model. The obtained optimal posterior contraction rates are adaptive since the priors we consider do not depend on the number of support points and the smoothness of the data generating process.

Our results on lower bounds have independent value outside of the literature on Bayesian mixture models and their frequentist properties. Let us briefly review most relevant results on lower bounds and place our results in that context. The minimax estimation rates for mixed discrete continuous distributions appear to be studied first by Efromovich (2011). He considers discrete variables with a

fixed support and shows that the optimal rates for discrete continuous distributions are equal to the optimal nonparametric rates for the continuous part of the distribution. Relaxing the assumption of the fixed support for the discrete part of the distribution is very desirable in nonparametric settings. It has been commonly observed at least since Aitchison and Aitken (1976) that smoothing discrete

(5)

an asymptotic framework that provided a precise theoretical justification for improvements resulting from smoothing in the context of estimating a univariate discrete distribution with a support that can grow with the sample size. In their setup, the support is an ordered set and the probability mass function is β-smooth (in a sense that analogs of β-order Taylor expansions hold). They show that in their setup the minimax rate is the smaller one of the following two: (i) the optimal estimation rate for a continuous density with the smoothness level β, n−β/(2β+1)_{, and (ii) the rate of convergence of} the standard frequency estimator, (N/n)1/2_{, where} _N _{is the cardinality of the support and} _n _{is the} sample size. Hall and Titterington (1987) refer to their setup as “Sparse Multinomial Data” since N

can be larger thannand this is the reason we refer to sparsity in the tile of the present paper.Burman

(1987) established similar results for β = 2. Subsequent literature in multivariate settings (e.g., Dong

and Simonoff(1995),Aerts et al.(1997)) did not consider lower bounds but demonstrated that when the

support of the discrete distribution grows sufficiently fast then estimators that employ smoothing can achieve the standard nonparametric rates forβ-smooth densities on Rd,n−β/(2β+d).

We generalize the results of Hall and Titterington (1987) on lower bounds for univariate discrete

distributions to multivariate mixed discrete-continuous case and anisotropic smoothness. Alternatively, our results can be viewed as a generalization of results inEfromovich(2011) to settings with potentially

growing supports for discrete variables.

Some details of our settings and assumptions differ from those in Hall and Titterington (1987) and

Efromovich (2011) because our original motivation was in understanding the behavior of the posterior

in mixture models with latent variables. Specifically, we consider lower and upper estimation bounds in the total variation distance since posterior concentration in nonparametric settings is much better understood when the total variation distance is considered (Ghosal et al. (2000)). We also introduce a

new definition of anisotropic smoothness that, on the one hand, accommodates an extension of techniques for deriving lower bounds fromIbragimov and Hasminskii(1984) and, on the other hand, lets us exploit

approximation results for mixtures of multivariate normal distributions developed byShen et al.(2013). The rest of the paper is organized as follows. In Section 2, we describe our framework and define

notation. Section3presents our results on lower bounds for estimation rates. The results on the posterior

contraction rates are given in Section4. Appendix contains auxiliary results and some proofs.

2. Preliminaries and Notation.

Let us denote the continuous part of observations by x ∈ X ⊂Rdx _{and the discrete part by}_y_{= (}_y

1, . . . , ydy)∈ Y, where Y= dy Y j=1 Yj, withYj= 1−1/2 Nj ,2−1/2 Nj , . . . ,Nj−1/2 Nj ,

is a grid on [0,1]dy _{(a product symbol Π applied to sets hereafter denotes a Cartesian product). The} number of values that the discrete coordinates yj can take,Nj, can potentially grow with the sample size or stay constant.

(6)

Fory= (y1, . . . , ydy)∈ Y, letAy= Qdy j=1Ayj, where Ayj =            (−∞, yj+ 0.5/Nj] ifyj = 0.5/Nj (yj−0.5/Nj,∞) ifyj = 1−0.5/Nj (yj−0.5/Nj, yj+ 0.5/Nj] otherwise

and let us represent the data generating density-probability mass function as

p0(y, x) = Z

Ay

f0(˜y, x)dy,˜ (2.1)

wheref0belongs to D, the set of probability density functions onRdx+dy _{with respect to the Lebesgue} measure. The representation of a mixed discrete-continuous distribution in (2.1) is so far without a

loss of generality since for any given p0 one could always define f0 using a mixture of densities with non-overlapping supports included inAy,y∈ Y.

In this paper, we consider independently identically distributed observations from p0: (Yn, Xn) = (Y1, X1, . . . , Yn, Xn). Let P0, E0, P0n, andE0n denote the probability measures and expectations corre-sponding top0 and its productpn0.

WhenNj’s grow with the sample size the generality of the representation in (2.1) can be lost when assumptions such as smoothness are imposed onf0. Nevertheless, in what follows we do impose a smooth-ness assumption onf0. The interpretation of this assumption is that the values of discrete variables can be ordered and that borrowing of information from nearby discrete points can be useful in estimation.

To get more refined results, we allow Nj’s to grow at different rates for different j’s. For the same reason, we work with anisotropic smoothness. LetZ+denote the set of non-negative integers. For smooth-ness coefficientsβi>0,i= 1, . . . , d,d=dx+dy, and an envelope functionL:R2d →R, an anisotropic

(β1, . . . , βd)-Holder class,Cβ1,...,βd,L, is defined as follows.

Definition 2.1. f ∈ Cβ1,...,βd,L _{if for any} _k _{= (}_k

1, . . . , kd) ∈ Zd+, Pd

i=1ki/βi < 1, mixed partial derivative of orderk,Dk_f_{, is finite and}

|Dk_f₍_z_{+ ∆}_z₎₋_Dk_f₍_z₎_{| ≤}_L₍_z,_∆_z₎ d X j=1 |∆zj|βj(1− Pd i=1ki/βi)_, _(2.2) where∆zj= 0 whenP d i=1ki/βi+ 1/βj<1.

In this definition, a Holder condition is imposed on Dk_f _{for a coordinate} _j _when _Dk_f _{cannot be} differentiated with respect to zj anymore (Pd_i=1ki/βi < 1 but Pd_i=1ki/βi + 1/βj ≥ 1). This defini-tion slightly differs from definidefini-tions available in the literature on anisotropic smoothness that we found. Section 13.2 inSchumaker(2007) presents some general anisotropic smoothness definitions but restricts

attention to integer smoothness coefficients. Ibragimov and Hasminskii (1984), and most of the

(7)

Bhattacharya et al. (2014), do not restrict mixed derivatives. Shen et al. (2013) use |∆zj|min(βj−kj,1) instead of|∆zj|βj(1−

P

ki/βi)_{in (}_2.2_{). Their requirement is stronger than ours for functions with bounded} support, and it appears too strong for our derivation of lower bounds on the estimation rate. However, our definition is sufficiently strong to obtain a Taylor expansion with remainder terms that have the same order as those inShen et al.(2013) (while the definitions that do not restrict mixed derivatives do not deliver such an expansion).

When βj =β, ∀j and P d

i=1ki/β+ 1/β ≥1, βj(1−Pki/βi) = β− bβc, where bβcis the largest integer that is strictly smaller thanβ, and we get the standard definition ofβ-Holder smoothness for the isotropic case.

The envelopeLcan be assumed to be a function of (z,∆z) to accommodate densities with unbounded support. We derive lower bounds on estimation rates for a constant envelope function. Upper bounds on posterior contraction rates are derived under more general assumptions onLas inShen et al.(2013).

Some extra notation: for a multi-index k = (k1, . . . , kd) ∈ Zd+, k! = Qd

i=1ki!, and for z ∈ Rd,

zk=Qd i=1z

ki

i . Them-dimensional simplex is denoted by ∆ m−1_._I

dstands for thed×didentity matrix. Letφµ,σ(·) andφ(·;µ, σ) denote a multivariate normal density with meanµ∈Rd and covariance matrix

σ2_I

d (or a diagonal matrix with squared elements ofσon the diagonal whenσis ad-vector). Forz∈Rd

and J ⊂ {1,2, . . . , d}, zJ denotes sub-vector {zi, i ∈ J}. Operator “.” denotes less or equal up to a multiplicative positive constant relation.

3. Lower Bounds on Estimation Rates.

LetAdenote a collection of all subsets of indices for discrete coordinates{1, . . . , dy}. ForJ ∈ A, defineJc={1, . . . , d} \J,

NJ= Y i∈J Ni, βJc = " X i∈Jc β_i−1 #−1 , N∅= 1,β∅=∞, andβ∅/(2β∅+ 1) = 1/2.

For a class of probability distributions P, ζ is said to be a lower bound on the estimation error in metricρif

inf ˆ

p _psup_∈PP(ρ(ˆp, p)≥ζ)≥const >0.

We consider the following class of probability distributions: for a positive constantL, let

P = ( p: p(y, x) = Z Ay f(˜y, x)dy, f˜ ∈ Cβ1,...,βd,L_{∩ D} ) . (3.1)

Theorem 3.1. ForP defined in (3.1),

Γn= min J∈A _N J n 2βJcβJc+1 = _N J∗ n βJc_∗ 2βJc_∗+1 (3.2) multiplied by a positive constant is a lower bound on estimation error in the total variation distance.

(8)

One could recognize expression [NJ/n] βJc

2βJc+1 _{in (}_3.2_{) as the standard estimation rate for a}_card₍_Jc

)-dimensional density with anisotropic smoothness coefficients {βj, j ∈ Jc} and the sample size n/NJ (Ibragimov and Hasminskii (1984)). One way to interpret this is that the density of {x,y˜j, j ∈ Jc} conditional on yJ is {βj, j ∈ Jc}-smooth and the number of observations available for its estimation (observations with the same value of yJ) should be of the order n/NJ; also, the estimation rate for the marginal probability mass function for yJ is [NJ/n]1/2, which is at least as fast as [NJ/n]

βJc

2βJc+1_.

In this interpretation, smoothing is not performed over the discrete coordinates with indices in set J, and the lower bound is obtained whenJ minimizes [NJ/n]

βJc

2βJc+1_{. Thus, an estimator that delivers the}

rate in (3.2) should, in a sense, optimally choose the subset of discrete variables over which to perform

smoothing.

We set up the notation and an outline of the proof of Theorem 3.1 below and delegate detailed

calculations to lemmas in Appendix5.1. The proof of the theorem is based on a general theorem from

the literature on lower bounds, which we present next in a slightly simplified form.

Lemma 3.1. (Theorem 2.5 in Tsybakov (2008), see alsoIbragimov and Hasminskii (1977)) ζ is a lower bound on the estimation error in metricρfor a classQif there exist a positive integerM ≥2and

qj, qi∈ Q,0≤j < i≤M such thatρ(qj, qi)≥2ζ,qj << q0,j = 1, . . . , M and M

X

j=1

KL(Qn_j, Qn₀)/M <log(M)/8, (3.3)

whereKLis the Kullback-Leibler divergence andQn

j is the distribution of a random sample fromqj.

The following standard result on bounding the number of unequal elements in binary sequences is used in our construction ofqj,j= 1, . . . , M.

Lemma 3.2. (Varshamov-Gilbert bound, Lemma 2.9 in Tsybakov (2008)) Consider the set of all

binary sequences of length m¯, Ω ={w= (w1, . . . , wm¯) : wr∈ {0,1}}={0,1}m¯. Supposem¯ ≥8. Then there exists a subset{w1_{, . . . , w}M_} _of _Ω_{such that}_w0_{= (0}_{, . . . ,}₀₎_,

¯ m X r=1 1{wj r6=w i r} ≥m/¯ 8, ∀0≤j < i≤M, and M ≥2m/8¯ .

To defineqj’s for our problem, we need some additional notation. Let

K0(u) = exp{−1/(1−u2)} ·1{|u| ≤1}.

This function has bounded derivatives of all orders and it smoothly decreases to zero at the boundary of its support. This type of kernel functions is usually used for constructing hypotheses for lower bounds,

(9)

see Section 2.5 inTsybakov(2008). Since we need to construct a smooth density that integrates to 1, we

define (as illustrated in Figure1)

g(u) =c0[K0(4(u+ 1/4))−K0(4(u−1/4))], wherec0>0 is a sufficiently small constant that will be specified below.

Fig 1. Function gforc0= 1.

Functiong will be used as a kernel in construction ofqk’s. Let us define the bandwidth for these kernels first.

For the continuous coordinates, we define the bandwidth as inIbragimov and Hasminskii(1984),

hi= Γ1/βn i, i∈ {dy+ 1, . . . , d}.

For the discrete ones, over which smoothing is beneficial, we define the bandwidth as

hi=%i·Γ1/βn i= 2 Ni ·Ri, i∈J∗c∩ {1, . . . , dy}, whereRi=bΓ 1/βi

n Ni/2c+ 1 is a positive integer and%i∈(1,2] as shown in Lemma5.5.

For the rest of the discrete coordinates, our innovation is to first define artificial anisotropic smoothness coefficientsβ∗

i =−log(Γn)/logNi,i∈J∗, at which the rate in (3.2) would have the same value whether

we smooth overyi (i∈J∗c) or not (i∈J∗). Then, we define the bandwidth as

hi= 2·Γ 1/β_i∗

n = 2/Ni, i∈J∗.

To streamline the notation, we also defineβ∗_i =βi fori∈J∗c.

Letmi be the integer part ofh−i1, i= 1, . . . , d. Let us consider ¯m= Qd

i=1mi adjacent rectangles in [0,1]d,Br,r= 1, . . . ,m¯, with the side lengths (h1, . . . , hd) and centerscr= (cr1, . . . , crd),c

r

(10)

kir∈ {1, . . . , mi}. Forz∈Rd andr= 1, . . . ,m¯, define gr(z) = Γn d Y i=1 g((zi−cri)/hi),

which can be non-zero only onBr. A set of hypotheses is defined by sequences of binary weights ongr’s as follows qj(y, x) = Z Ay " 1[0,1]d(˜y, x) + ¯ m X r=1 w_rjgr(˜y, x) # dy,˜ (3.4) wherewj

r∈ {0,1}, j= 0, . . . , M, andM are defined in Lemma3.2.

The rest of the proof is delegated to lemmas in Appendix 5.1, which show that qk in (3.4) satisfy the sufficient conditions from Lemma3.1. Specifically, Lemma5.1derives the lower bound on the total

variation distance. Lemma5.2 verifies condition (3.3) when ¯m≥8. Lemma5.3, part (i) of Lemma 5.5, and the fact thatqk’s are defined on [0,1]d implyqj ∈ Cβ1,...,βd,L,j= 0, . . . , M.

This argument (Lemma5.2 specifically) requires ¯m≥8 as it relies on Lemma 3.2. Observe that as

n→ ∞, ¯m≥8 if there are continuous variables or there are discrete variables over which smoothing is beneficial (Jc

∗ 6=∅). Thus, ¯m <8 can happen only if there are no continuous variables andNJ∗=N1· · ·Nd is bounded. This is just a problem of estimating a multinomial distribution with finite support and the standard results for parametric problems deliver the usualn−1/2 _rate.

Finally, note that we prove the lower bound results for a class of densities that includes densities that are inCβ1,...,βd,L_{on [0}_,_1]d_{. It is straightforward to modify the proof so that it works for a class of smooth} densities onRd. To accomplish this we can replace 1[0,1]d(·) in (3.4) with a smooth function onRd that has a bounded support and is bounded away from zero on [0,1]d, for example,

d Y i=1 1[0,1](zi) +IK0(zi+ 1)∗1(zi<0) +IK0(2−zi)∗1(zi>1) ,

multiplied by a normalization constant, where IK0(zi) = R zi

−1K0(u)du/ R1

−1K0(u)du. Then, proofs of Lemmas5.1-5.3go through with minor modifications.

(11)

4. Posterior Contraction Rates for a Mixture of Normals Model.

4.1. Model and Prior.

In this section, we consider a Bayesian model for the data generating process in (2.1). We use a mixture of normal distributions with a variable number of components for modelling

the joint distribution of (˜y, x),

f(˜y, x|θ, m) = m X j=1 αjφ(˜y, x;µj, σ) p(y, x|θ, m) = Z Ay f(˜y, x|θ, m)dy,˜ (4.1) whereθ= (µy_j, µx j, αj, j= 1,2, . . . , m;σ).

We assume the following conditions on the prior Π for (θ, m). For positive constantsa1, a2, . . . , a9, for eachi∈ {1, . . . , d}the prior for σi satisfies

Π(σ−_i2≥s) ≤ a1exp{−a2sa3} for all sufficiently larges >0 (4.2) Π(σ−_i2< s) ≤ a4sa5 for all sufficiently smalls >0 (4.3) Π{s < σ−2

i < s(1 +t)} ≥ a6sa7ta8exp{−a9s1/2}, s >0, t∈(0,1). (4.4) An example of a prior that satisfies (4.2)-(4.4) is the inverse Gamma prior forσi.

Prior for (α1, . . . , αm) conditional on mis Dirichlet(a/m, . . . , a/m), a > 0. Prior for the number of mixture componentsmis

Π(m=i)∝exp(−a10i(logi)τ1), i= 2,3, . . . , a10>0, τ1≥0. (4.5)

More generally, a prior that can be bounded above and below by functions in the form of the right hand side of (4.5), possibly with different constants, would also work.

A priori, the components ofµj,µj,i,i= 1, . . . , dare independent from each other, other parameters, and acrossj. Prior density forµj,i is bounded below for somea12, τ2>0 by

a11exp(−a12|µj,i|τ2), (4.6)

and for somea13, τ3>0 and all sufficiently largeµ >0,

Π(µj,i∈/[−µ, µ])≤exp(−a13µτ3). (4.7)

4.2. Assumptions on the Data Generating Process.

In what follows, we consider a fixed subset of discrete indices J ∈ A and show that under regularity conditions, the posterior contraction rate is bounded above by NJ

n

2βJcβJc+1 _{times a log factor. If the regularity conditions we describe below for a}

fixedJ hold for every subset ofA, then the posterior contraction rate matches the lower bound in (3.2) up to a log factor.

Without a loss of generality, let J = {1, . . . , dJ}, I = {dJ + 1, . . . , dy}, Jc = {1, . . . , d} \J, and

dJc =card(Jc). Similarly toYandA_ydefined in Section2, we defineY_J =Q_j

∈JYjandAyJ = Q

i∈JAyi. Also, letyJ={yi}i∈J, ˜yI ={y˜i}i∈I, ˜x= (˜yI, x)∈X˜=RdJ c_.

(12)

Also, let F0|J and E0|J denote the conditional probability and expectation corresponding to f0|J. If

π0J(yJ) = 0 for a particular yJ, then we can define the conditional density f0|J(˜x|yJ) arbitrarily. We make the following assumptions on the data generating process.

Assumption4.1. There are positive finite constants b,f¯0, τ such that for anyyJ ∈ YJ andx˜∈X˜

f0|J(˜x|yJ)≤f¯0exp (−b||x||˜ τ). (4.8)

It appears that all the papers on (near) optimal posterior contraction rates for mixtures of normal densities impose similar tail conditions on data generating densities.

Assumption4.2. There exists a positive and finite y¯such that for any(yI, yJ)∈ Y andx∈ X Z A_yI∩{||y˜I||≤y¯} f0|J(˜yI, x|yJ)dy˜I ≥ Z A_yI∩{||y˜I||>¯y} f0|J(˜yI, x|yJ)dy˜I. (4.9)

This assumption always holds forAyI ⊂[0,1]

dJ c−dx_{. When}_A

yIis a rectangle with at least one infinite side, an interpretation of this assumption is that the tail probabilities for ˜yI conditional on (x, yJ) decline uniformly in (x, yJ). Bounded support for ˜yI is a sufficient condition for this assumption.

Assumption4.3. We assume that

f0|J∈ CβdJ+1,...,βd,L, (4.10) where for someτ0≥0 and any (˜x,∆˜x)∈R2dJ c

L(˜x,∆˜x) = ˜L(˜x) exp

τ0||∆˜x||2 , (4.11) ˜

L(˜x+ ∆˜x)≤L˜(˜x) expτ0||∆˜x||2 . (4.12) The smoothness assumption (4.10) on the conditional densityf0|Jis implied by the smoothness of the joint densityf0at least under boundedness away from zero assumption, see Lemma 5.8.

Assumption 4.4. There are positive finite constants ε and F¯, such that for any yJ ∈ YJ and

k={ki}i∈Jc∈_Nd₀J c,P_i ∈Jcki/βi<1, Z "_|Dk_f 0|J(˜x|yJ)| f0|J(˜x|yJ) # (2+εβ_{J c}−1d−_{J c}1) P i∈J c ki /βi f0|J(˜x|yJ)dx <˜ F ,¯ (4.13)

(13)

Z " _L˜_(˜_x₎ f0|J(˜x|yJ) #2+εβJ c−1d −1 J c f0|J(˜x|yJ)dx <˜ F .¯ (4.14)

The envelope function and restrictions on its behaviour are mostly relevant for the case of unbounded support. Condition (4.14) suggests that the envelope function ˜Lshould be comparable tof0|J.

Assumption4.5. For some small ν >0,

NJ=o(n1−ν). (4.15)

We impose this assumption to exclude from consideration the cases with very slow (non-polynomial) rates as some parts of the proof require log(1/n) to be of order logn.

4.3. Posterior Contraction Rates.

Let

tJ0=      dJ c[1+1/(βJ cdJ c)+1/τ]+max{τ1,1,τ2/τ} 2+1/βJ c ifJ c₆₌_∅ max{τ1,1}/2 ifJc=∅ (4.16)

where (τ, τ1, τ2) are defined in Sections4.1-4.2.

Theorem 4.1. Suppose the assumptions from Sections4.1-4.2 hold for a givenJ∈ A. Let

n= NJ n βJ c/(2βJ c+1) (logn)tJ_, _(4.17)

wheretJ> tJ0+ max{0,(1−τ1)/2}. Suppose also n2n→ ∞. Then, there exists M >¯ 0such that

Π p:dT V(p, p0)>M ¯ n|Yn, Xn Pn

0

→0.

As in Section3, whenJc₌_∅_,_β

Jc can be defined to be infinity andβ_Jc/(2β_Jc+ 1) = 1/2 in (4.17).

Corollary 4.1. Suppose the assumptions from Sections4.1-4.2hold for everyJ ∈ A. Let

n= min J∈A NJ n βJ c/(2βJ c+1) (logn)tJ_, _(4.18) wheretJ> tJ0+ max{0,(1−τ1)/2}. Suppose also n2n→ ∞. Then, there exists M >¯ 0such that

Π p:dT V(p, p0)>M ¯ n|Yn, Xn P₀n

→0.

Under the assumptions of the corollary, Theorem4.1 delivers a valid upper bound on the posterior

contraction rate for everyJ ∈ Aincluding the one for which the minimum in (4.18) is attained. Hence, the corollary is an immediate implication of Theorem4.1whose proof is presented in the following section.

The results on lower bounds in Section3hold for any class of data generating densities that includes

f0 satisfying the following conditions:f0∈ Cβ1,...,βd,L, f0= 0 outside [0,1]d, andf ≥f0≥f >0, where

L,f, andf are finite positive constants. It is worth pointing out that these conditions imply Assumptions 4.1-4.4, which combined with Assumption4.5forNJ∗ and a prior specified in Section4.1would deliver the sufficient conditions of Corollary4.1.

(14)

4.4. Proof of Posterior Contraction Results.

To prove Theorem4.1, we use the following sufficient conditions for posterior contraction from Theorem 2.1 in Ghosal and van der Vaart(2001). Letn and ˜

n be positive sequences with ˜n ≤n,n →0, and n˜2n → ∞, andc1, c2, c3, and c4 be some positive constants. Letρbe Hellinger or total variation distance. Suppose Fn ⊂ F is a sieve with the following bound on the metric entropyMe(n,Fn, ρ)

logMe(n,Fn, ρ)≤c1n2n, (4.19)

Π(Fc

n)≤c3exp{−(c2+ 4)n˜2n}. (4.20) Suppose also that the prior thickness condition holds

Π(K(p0,˜n))≥c4exp{−c2n˜2n}, (4.21)

where the generalized Kullback-Leibler neighborhoodK(p0,˜n) is defined by

K(p0, ) =    p: Z X X y∈Y p0(y, x) log p0(y, x) p(y, x)dx < 2_, Z X X y∈Y p0(y, x) logp0(y, x) p(y, x) 2 dx < 2    .

Then, there exists ¯M >0 such that

Π p:ρ(p, p0)>M ¯ n|Yn, Xn Pn

0

→0.

The definition of the sieve and verification of conditions (4.19) and (4.20) closely follow analogous

results in the literature on contraction rates for mixture models in the context of density estimation. The details are given in Lemma 5.18in Appendix5.2.2. Verification of the prior thickness condition is more involved and we formulate it as a separate result in the following theorem.

Theorem4.2. Suppose the assumptions from Sections4.1-4.2hold for a givenJ ∈ A. LettJ > tJ0, wheretJ0 is defined in (4.16), and

˜ n= _N J n βJ c/(2βJ c+1) (logn)tJ_. _(4.22)

For any C >0 and all sufficiently largen,

Π(K(p0,˜n))≥exp{−Cn˜2n}. (4.23)

Approximation results are key for showing the prior thickness condition (4.23). Appropriate

approx-imation results for f0J(yJ,x˜) = f0|J(˜x|yJ)π0J(yJ) are obtained as follows. Based on approximation results for continuous densities by normal mixtures fromShen et al. (2013), we obtain approximations

forf0|J(·|yJ) for everyyJ in the form

f_|?_J(˜x|yJ) = K X

j=1

(15)

where the parameters of the mixture will be defined precisely below. For the discrete variables over which smoothing is not performed,yJ, we show thatπ0J(yJ) can be appropriately approximated by

Z A_yJ X y0 J π0J(y0J)φ(˜yJ;yJ0, σ ? J)dy˜J, whereR A_yJφ(˜yJ, y 0

J, σJ?)dy˜Jbehaves like an indicator 1{yJ =y0J}for sufficiently smallσ?J. The following subsection presents proof details.

4.4.1. Proof of Theorem

4.2 for

J

c

6 =

∅

.

Defineβ=dJcP k∈Jcβ

−1 k

−1

,βmin= minj∈Jcβ_j, and

σn = [˜n/log(1/˜n)]1/β. Forε defined in (4.13)-(4.14),b and τ defined in (4.8), and a sufficiently small

δ > 0, let a0 = {(8β+ 4ε+ 8 + 8β/βmin)/(bδ)}1/τ, aσn = a0{log(1/σn)}

1/τ_{, and} _b

1 >max{1,1/2β} satisfying ˜b1

n{log(1/˜n)}5/4≤˜n. Then, the proofs of Theorems 4 and 6 inShen et al.(2013) imply the following two claims for eachyJ=k∈ YJ under the assumptions of Section4.2.

First, there exists a partition{Uj|k, j= 1, . . . , K}of{x˜∈X˜:||x|| ≤˜ 2aσn}, such that forj= 1, . . . , N,

Uj|k is contained within an ellipsoid with centerµ?j|k and radii{σ β/βi n ˜2bn1, i∈Jc} Uj|k ⊂ ( ˜ x: dJ c X i=1 h (˜xi−µ?j|k,i)/(σ β/β_dJ+i n ˜2bn1) i2 ≤1 ) ;

forj=N+ 1, . . . , K,Uj|k is contained within an ellipsoid with radii{σ β/βi

n , i∈Jc}, and 1≤N < K≤

C1σ−ndJ c{log(1/˜n)}dJ c+dJ c/τ, where C1>0 does not depend on nand yJ.

Second, for each k∈ YJ there exist α?_j_|_k, j= 1, . . . , K, with α?_j_|_k = 0 for j > N, andµx?jk ∈Uj|k for

j=N+ 1, . . . , K such that for a positive constantC2 andσJ?c={σ β/βi n fori∈Jc}, dH f0|J(·|k), f|?J(·|k) ≤C2σnβ, (4.25)

wheref_|?_Jis defined in (4.24). ConstantC2is the same for allk∈ YJsince all the bounds onf0|Jassumed in Section4.2are uniform overk.

Note also that our smoothness definition is different from the one used byShen et al.(2013). In Lemmas

5.6and5.7we show that our smoothness definition (f0|J∈ CL,βdJ+1,...,βd) delivers an anisotropic Taylor expansion with bounds on remainder terms such that the argument on p. 637 ofShen et al.(2013) goes

through.

Third, by Lemma5.10, which is an extension of a part of Proposition 1 inShen et al.(2013), there

exists a constantB0>0 such that for allyJ ∈ YJ

F0|J ||X||˜ > aσn|yJ ≤B0σ4β+2εn σ8n, (4.26) where σ_n= min i∈Jcσ β/βi n . Form=NJK we defineθ? andSθ? as:

θ?= {µ? 1, . . . , µ?m}= n (k, µ?_j_|_k), j= 1, . . . , K, k∈ YJ o ,

(16)

{α? 1, . . . , α ? m}= n α?_jk=α?_j_|_kπ0J(k), j= 1, . . . , K, k∈ YJ o , σ_J?2={σ?2 i = 1/[64N 2 iβlog(1/σn)], i∈J} σJ?c={σ?_i =σβ/β_n i, i∈Jc}, Sθ?= {µ1, . . . , µm}={(µjk,J, µjk,Jc), j= 1, . . . , K, k∈ Y_J}, µjk,Jc ∈U_j_|_k, µ_jk,i∈ ki− 1 4Ni , ki+ 1 4Ni , i∈J, σ2i ∈ 0, σ?2i , i∈J, σ2_i ∈ σ_i?2(1 +σ2β_n )−1, σ_i?2 , i∈Jc, (α1, . . . , αm) ={αjk, j= 1, . . . , K, k∈ YJ} ∈∆m−1, m X r=1 |αr−α?r| ≤2σ 2β n , min j≤K,k∈YJ αjk≥ σ2β+dJ c n 2m2 .

The rest of the proof of the Kullback-Leibler thickness condition follows the general argument de-veloped for mixture models in Ghosal and van der Vaart (2007) and Shen et al. (2013) among others.

First, we will show that for m = NJK and θ ∈ Sθ?, the Hellinger distance d2H(p0(·,·), p(·,·|θ, m)) can be bounded by σ2β

n up to a multiplicative constant. Second, we construct bounds on the ratios

p(·,·|θ, m)/p0(·,·) and combine them with the bound on the Hellinger distance using Lemma5.9. Finally, we will show that the prior puts sufficient probability onm=NJKandSθ?.

Forf_|?_J defined in (4.24), let us define

p?_|_J(yI, x|yJ) = Z

A_yI

f_|?_J(˜yI, x|yJ)dy˜I.

Form =NJK and θ ∈Sθ?, we can bound the Hellinger distance between the DGP and the model as follows,

d2H(p0(·,·), p(·,·|θ, m)) =d2H(p0|J(·|·)π0(·), p(·,·|θ, m))

≤d2_H(p0|J(·|·)π0J(·), p?_|J(·|·)π0J(·)) +d2H(p ?

|J(·|·)π0J(·), p(·,·|θ, m)).

It follows from (4.25) and Lemma 5.4 linking distances between probability mass functions and

corre-sponding latent variable densities that the first term on the right hand side of this inequality is bounded by (C2)2σ2βn . Combining this result with the bound on d2H(p|?J(·|·)π0J(·), p(·,·|θ, m)) from Lemma5.11 we obtain

d2_H(p0(·,·), p(·,·|θ, m)).σn2β. (4.27) Next, forθ∈Sθ?andm=N_JK, let us consider lower bounds on the ratiop(y_J, y_I, x|θ, m)/p₀(y_J, y_I, x). In Lemma5.14in the Appendix we show that lower bounds on the ratiofJ(yJ,x|θ, m˜ )/f0|J(˜x|yJ)π0(yJ) imply the following bounds for all sufficiently largen: for anyx∈ X withkxk ≤aσn,

p(yJ, yI, x|θ, m)

p0(yJ, yI, x)

≥C3

σ2βn

(17)

for some constantC3>0; and for anyx∈ X withkxk> aσn, p(yJ, yI, x|θ, m) p0(yJ, yI, x) ≥exp −8||x|| 2 σ2 n −C4logn , (4.29)

for some constant C4 > 0. Consider all sufficiently large n such that λn < e−1 and (4.28) and (4.29) hold. Then, for anyθ∈Sθ?,

X y∈Y Z X log p0(yJ, yI, x) p(yJ, yI, x|θ, m) 2 1 _p₍_y J, yI, x|θ, m) p0(yJ, yI, x) < λn p0(yJ, yI, x)dx =X y∈Y Z ˜ X log p0(yJ, yI, x) p(yJ, yI, x|θ, m) 2 1 _p₍_y J, yI, x|θ, m) p0(yJ, yI, x) < λn 1{y˜I ∈AyI}f0J(yJ,x˜)dx˜ =X y∈Y Z ˜ X log p0(yJ, yI, x) p(yJ, yI, x|θ, m) 2 1 _p₍_y J, yI, x|θ, m) p0(yJ, yI, x) < λn,||x||> aσn,y˜I ∈AyI f0J(yJ,x˜)dx˜ ≤X y∈Y Z {x:˜||x||>aσn} log p0(yJ, yI, x) p(yJ, yI, x|θ, m) 2 1{y˜I ∈AyI}f0J(yJ,x˜)dx˜ ≤X y∈Y Z {x:˜||x||>a_σn} ₁₂₈ σ4 n ||x||4_{+ 2(}_C 4logn)2 f0|J(˜x|yJ)1{y˜I ∈AyI}dxπ˜ 0J(yJ) ≤ X yJ∈YJ Z {x:˜||˜x||>a_σn} ₁₂₈ σ4 n ||x||˜ 4_{+ 2(}_C 4logn)2 f0|J(˜x|yJ)dxπ˜ 0J(yJ) ≤ 128 σ4 n X yJ∈YJ E0|yJ ˜ X 81/2 F0|yJ ˜ X > aσn 1/2 π0J(yJ) + 2(C4logn)2B0σ4β+2εn σ 8 n ≤C5σ2β+εn (4.30)

for some constantC5>0 and all sufficiently largen, where the last inequality holds by the tail condition in (4.8), (4.26), and (logn)2σn2β+εσ8n→0. Furthermore, asλn< e−1, log p0(yJ, yI, x) p(yJ, yI, x|θ, m) 1 _p₍_y J, yI, x|θ, m) p0(yJ, yI, x) < λn ≤ log p0(yJ, yI, x) p(yJ, yI, x|θ, m) 2 1 _p₍_y J, yI, x|θ, m) p0(yJ, yI, x) < λn and, therefore, X y∈Y Z X log p0(yJ, yI, x) p(yJ, yI, x|θ, m) 1 _p₍_y J, yI, x|θ, m) p0(yJ, yI, x) < λn p0(yJ, yI, x)dx≤C5σn2β+ε. (4.31)

Inequalities (4.27), (4.30), and (4.31) combined with Lemma5.9imply

E0 log p0(yJ, yI, x) p(yJ, yI, x|θ, m) ≤A˜2_n, E0 log p0(yJ, yI, x) p(yJ, yI, x|θ, m) 2! ≤A˜2_n

for anyθ ∈Sθ?, m=NJK, and some positive constant A(details are provided in Lemma 5.15in the Appendix).

By Lemma5.16in the Appendix for all sufficiently largen,s= 1 + 1/β+ 1/τ, and someC6>0,

Π(K(p0,˜n))≥Π(m=NJK, θ∈Sθ?)≥exp h −C6NJ˜−ndJ c/β{log(n)} dJ cs+max{τ1,1,τ2/τ} i .

(18)

The last expression of the above display is bounded below by exp{−Cn˜2 n} for any C > 0, ˜n = NJ n β/(2β+dJ c) (logn)tJ_{, any} _t

J > (dJcs+ max{τ₁,1, τ₂/τ})/(2 +d_Jc/β), and all sufficiently large n. Since the inequality in the definition oftJ is strict, the claim of the theorem follows.

When J = ∅ and NJ = 1, the preceding argument delivers the claim of the theorem if we add an artificial discrete coordinate with only one possible value to the vector of observables.

4.4.2. Proof of Theorem

4.2 for

J

c

=

∅

.

In this case, the proof from the previous subsection can be simplified as follows. Form=NJ and for any β >0 we defineθ? andSθ? as

θ?= {µ? 1, . . . , µ ? m}={k, k∈ YJ}, {α? 1, . . . , α ? m}={α ? k, k∈ YJ}={π0(k)}k∈YJ, σ?2={σ?2 i = 1 64N2 iβlog(1/σn) , i∈J} , Sθ?= {µ1, . . . , µm}={µk, k∈ YJ}, µk,i∈ ki− 1 4Ni , ki+ 1 4Ni , i= 1, . . . , dJ, σ={σi∈(0, σi?), i∈J}, {αj, j= 1, . . . , m}={αk, k∈ YJ} ∈∆m−1, X k∈YJ |αk−α?k| ≤2σ 2β n , min k∈YJ αk≥ σn2β 2m2 .

Form=NJ andθ∈Sθ?, a simplification of the proof of Lemma5.11delivers

d2H(p0(·), p(·|θ, m))≤2 max k∈YJ Z Ac k φ(˜yJ;µk, σ)dy˜J+ X k∈YJ |α? k−αk|.σ2βn . A simplification of derivations in Lemma5.14show that for allyJ ∈ YJ

p(yJ|θ, m) p0(yJ) ≥ 1 2 σ2β n 2m2 ≡λn. Then, for anyθ∈Sθ?

p0(yJ) ≥λn for all yJ ∈ YJ. As λn → 0, by Lemma 5.9 for λn < λ0, both E0(log p0(yJ) p(yJ|θ,m)) and

E0([log_p(yp0(yJ) J|θ,m)]

2_{) are bounded by} _C

7log(1/λn)2σ2βn ≤A˜2n for some constant A. By the simplification of Lemma5.16for this particular case for all sufficiently largenand someC8>0,

Π(K(p0,˜n))≥Π(m=NJ, θ∈Sθ?)≥exp h

−C8NJ{log(n)}max{τ1,1} i

.

The last expression of the above display is bounded below by exp{−Cn˜2_n} for any C > 0, ˜n = NJ

n 1/2

(logn)tJ_{, any} _t

J > max{τ1,1}/2, and all sufficiently large n. Since the inequality in the def-inition oftJ is strict, the claim of the theorem follows.

(19)

5. Future Work.

It seems feasible to extend the results of this paper to conditional density estimation by covariate dependent mixtures along the lines ofNorets and Pati(2017). We leave this to

future work.

References.

Aerts, M., I. Augustyns, and P. Janssen(1997): “Local Polynomial Estimation of Contingency Table Cell Probabilities,”Statistics, 30, 127–148.

Aitchison, J. and C. G. G. Aitken (1976): “Multivariate binary discrimination by the kernel method,”

Biometrika, 63, 413–420.

Albert, J. H. and S. Chib(1993): “Bayesian Analysis of Binary and Polychotomous Response Data,”Journal of the American Statistical Association, 88, 669–679.

Barron, A., L. Birg´e, and P. Massart(1999): “Risk bounds for model selection via penalization,”Probab. Theory Related Fields, 113, 301–413.

Bhattacharya, A., D. Pati, and D. Dunson(2014): “Anisotropic function estimation using multi-bandwidth

Gaussian processes,”The Annals of Statistics, 42, 352–381.

Burda, M., M. Harding, and J. Hausman(2008): “A Bayesian Mixed Logit-Probit Model for Multinomial Choice,”Journal of Econometrics, 147, pp. 232–246.

Burman, P.(1987): “Smoothing Sparse Contingency Tables,”Sankhy?: The Indian Journal of Statistics, Series A (1961-2002), 49, 24–36.

Canale, A. and D. B. Dunson (2011): “Bayesian Kernel Mixtures for Counts,” Journal of the American

Statistical Association, 106, 1528–1539.

——— (2015): “Bayesian multivariate mixed-scale density estimation,”Statistics and its Interface, 8, 195–201.

Chamberlain, G. and K. Hirano (1999): “Predictive Distributions Based on Longitudinal Earnings Data,”

Annales d’conomie et de Statistique, 211–242.

Chib, S. and E. Greenberg(2010): “Additive cubic spline regression with Dirichlet process mixture errors,”

Journal of Econometrics, 156, 322–336.

Dey, D., P. Muller, and D. Sinha, eds. (1998):Practical Nonparametric and Semiparametric Bayesian Statis-tics, Lecture Notes in Statistics , Vol. 133, Springer.

DeYoreo, M. and A. Kottas(2017): “Bayesian Nonparametric Modeling for Multivariate Ordinal Regression,”

Journal of Computational and Graphical Statistics, 0, 1–14.

Dong, J. and J. S. Simonoff(1995): “A Geometric Combination Estimator ford-Dimensional Ordinal Sparse

Contingency Tables,”Ann. Statist., 23, 1143–1159.

Efromovich, S.(2011): “Nonparametric estimation of the anisotropic probability density of mixed variables,” Journal of Multivariate Analysis, 102, 468 – 481.

Escobar, M. and M. West(1995): “Bayesian Density Estimation and Inference Using Mixtures,”Journal of the American Statistical Association, 90, 577–588.

Ghosal, S., J. K. Ghosh, and A. W. v. d. Vaart(2000): “Convergence Rates of Posterior Distributions,”

The Annals of Statistics, 28, 500–531.

Ghosal, S. and A. van der Vaart (2007): “Posterior convergence rates of Dirichlet mixtures at smooth densities,”The Annals of Statistics, 35, 697–723.

(20)

Ghosal, S. and A. W. van der Vaart(2001): “Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities,”The Annals of Statistics, 29, 1233–1263.

Hall, P. and D. M. Titterington(1987): “On Smoothing Sparse Multinomial Data,”Australian Journal of

Statistics, 29, 19–37.

Ibragimov, I. and R. Hasminskii (1977): “Estimation of infinite-dimensional parameter in Gaussian white noise,”Doklady Akademii Nauk SSSR, 236, 1053–1055.

Ibragimov, I. A. and R. Z. Hasminskii(1984): “More on the estimation of distribution densities,”Journal of Soviet Mathematics, 25, 1155–1165.

Jensen, M. J. and J. M. Maheu(2014): “Estimating a semiparametric asymmetric stochastic volatility model

with a Dirichlet process mixture,”Journal of Econometrics, 178, 523–538.

Kruijer, W., J. Rousseau, and A. van der Vaart (2010): “Adaptive Bayesian density estimation with location-scale mixtures,”Electronic Journal of Statistics, 4, 1225–1257.

MacEachern, S. and P. Muller(1998): “Estimating Mixture of Dirichlet Process Models,”Journal of Com-putational and Graphical Statistics, 7, 223–238.

McCulloch, R. and P. Rossi(1994): “An exact likelihood analysis of the multinomial probit model,”Journal

of Econometrics, 64, 207–240.

Miller, J. W. and M. T. Harrison(2017): “Mixture Models With a Prior on the Number of Components,”

Journal of the American Statistical Association, 0, 1–17.

Neal, R.(2000): “Markov Chain Sampling Methods for Dirichlet Process Mixture Models,”Journal of Compu-tational and Graphical Statistics, 9, 249–265.

Norets, A.(2017): “Optimal Auxiliary Priors and Reversible Jump Proposals for a Class of Variable Dimension

Models,” Unpublished manuscript, Brown University.

Norets, A. and D. Pati(2017): “Adaptive Bayesian Estimation of Conditional Densities,”Econometric Theory,

33, 980–1012.

Norets, A. and J. Pelenis (2012): “Bayesian modeling of joint and conditional distributions,” Journal of Econometrics, 168, 332–346.

——— (2014): “Posterior Consistency in Conditional Density Estimation by Covariate Dependent Mixtures,” Econometric Theory, 30, 606–646.

Rousseau, J.(2010): “Rates of convergence for the posterior distributions of mixtures of betas and adaptive

nonparametric estimation of the density,”The Annals of Statistics, 38, 146–180.

Schumaker, L.(2007):Spline functions : basic theory, Cambridge New York: Cambridge University Press.

Shen, W., S. T. Tokdar, and S. Ghosal (2013): “Adaptive Bayesian multivariate density estimation with

Dirichlet mixtures,”Biometrika, 100, 623–640.

Tsybakov, A. B. (2008): Introduction to Nonparametric Estimation (Springer Series in Statistics), Springer, New York, USA.

Appendix.

(21)

Lemma 5.1. For qj, ql, i 6= l defined in (3.4), the total variation distance is bounded below by const·Γn.

Proof. Let us establish several facts aboutgr in the definition of qj. For any (˜y, x)∈[0,1]d, there existsr(˜y, x) such that

gr(˜y, x) = 0,∀r6=r(˜y, x). (5.1) For (˜y, x)∈Br,r(˜y, x) =rand for (˜y, x)∈ ∪/ mr=1¯ Br,r(˜y, x) can have an arbitrary value. Thus,

dT V(qj, ql) = X y Z Z Ay "m¯ X r=1 (w_rj−w_rl)gr(˜y, x) # dy˜ dx =X y Z Z Ay

(wj_r(˜_y,x)−w_r(˜l _y,x))gr(˜y,x)(˜y, x)dy˜ dx.

From hi = (2/Ni)·Ri fori{1, . . . , dy}, where Ri is a positive integer, and the definitions of g,gr, and

Ay, it follows that for fixed y ∈ Y and x∈ [0,1]dx, (wj_r(˜_y,x)−w_r(˜l _y,x))gr(˜y,x)(˜y, x) does not change the sign as ˜y changes withinAy (r(˜y, x) is the same∀y˜∈Ay by the choice ofcri andhi). Therefore,

dT V(qj, ql) = Z Z (w j r(˜y,x)−w l r(˜y,x))gr(˜y,x)(˜y, x) dydx˜ = ¯ m X r=1 Z Br (w j r(z)−w l r(z))gr(z)(z) dz = ¯ m X r=1 |wjr−w l r| Z Br |gr(z)|dz. (5.2) Finally, dT V(qj, ql) = ¯ m X r=1 1{wj r6=w l r} ·Γn· d Y i=1 hi· " Z 1/2 −1/2 |g(u)|du #d (change of variables in (5.2)) ≥Γn· d Y i=1 mihi· " Z 1/2 −1/2 |g(u)|du #d /8 (by Lemma3.2) ≥Γn· " Z 1/2 −1/2 |g(u)|du/2 #d /8 (sincemihi>1/2).

Lemma5.2. ForΓn→0andm¯ ≥8and a sufficiently smallc0in the definition ofg, condition (3.3) in Lemma (3.1)holds for all sufficiently largen.

Proof. By Lemma3.2, it suffices to show that

dKL(Qnj, Q n

0) =n·dKL(qj, q0)<( ¯mlog 2)/64. (5.3) First, note that for a densityf ≥c >0 on [0,1]d,dKL(f,1[0,1]d) is bounded above by

dKL(f,1[0,1]d) +dKL(1[0,1]d, f) = Z [0,1] (f−1) logf ≤ R [0,1](f−1) 2 c . (5.4)

(22)

Next, note that for anyz∈[0,1]d_{, the density in the definition of}_q j 1[0,1]d(z) + ¯ m X r=1 w_rjgr(z)≥1−Γn max u∈[−1/2,1/2]g(u) d ≥1/2 (5.5)

for all sufficiently largen. Thus,

dKL(qj, q0)≤dKL 1[0,1]d+ ¯ m X r=1 w_rjgr,1[0,1]d ! (by (5.14)) ≤2 Z "m¯ X r=1 wj_rgr(z) #2 dz (by (5.4) and (5.5)) = 2 Z m¯ X r=1 wj_r(gr(z))2dz (sincegr(z)gl(z) = 0,∀r6=l) ≤2 ¯m Z (g1(z))2dz= 2Γ2n Y i (mihi) " Z 1/2 −1/2 g(u)2du #d ≤2Γ2_n " Z 1/2 −1/2 g(u)2du #d ≤2Γ2_nc2d₀ . (5.6) Finally, ¯ m= d Y i=1 mi≥2−d d Y i=1

h−_i 1 (by definitions of ¯mandmi)

= 2−d Y i∈J∗ (Ni/2)· Y i∈Jc ∗,i≤dy Γ−β −1 i n /%i · Y i∈Jc ∗,i>dy Γ−β −1 i n (by definition ofhi) ≥2−d Y i∈J∗ (Ni/2)· Y i∈Jc ∗,i≤dy Γ−β −1 i n /2 · Y i∈Jc ∗,i>dy Γ−β −1 i n (by restrictions on%i) = 2−d−dy _·_N J∗·Γ −β_{J c}−1 ∗ n = 2−d−dynΓ2n ≥2−d−dy_n_·_d KL(qj, q0)/(2c2d0 ) (by (5.6)). The last inequality implies (5.3) if

c0≤[2−(d+dy+7)log 2]1/(2d).

Lemma 5.3. For j ∈ {0, . . . , M},qj ∈ Cβ ∗

1,...,β

∗

d,L with L= 1 for any sufficiently small constant c₀ in the definition of g.

Proof. Forj= 0, the result is trivial. Forj6= 0, considerk= (k1, . . . , kd) andz,∆z∈Rd such that

for somei∈ {1, . . . , d}, ∆zi6= 0, for anyl6=i, ∆zl= 0,Pd_l=1kl/βl∗<1, and Pd l=1kl/β ∗ l + 1/βi∗≥1 so that 0≤β_i∗(1− d X l=1 kl/β∗l)≤1. (5.7)

(23)

Forr(·) defined in (5.1), Dkqj(z) = 1{k= (0, . . . ,0)}+wr(z)Γn d Y l=1 g(kl)₍₍_z l−c r(z) l )/hl)/hkli = 1{k= (0, . . . ,0)}+Bi·wr(z)h β∗_i(1−Pd l=1kl/βl∗) i d Y l=1 g(kl)₍₍_z l−c r(z) l )/hl), (5.8) where Bi ∈ {1,1/2, % −β∗_i

i } ⊂(0,1]. In what follows we consider k6= (0, . . . ,0) to simplify the notation; when k = (0, . . . ,0) the argument below goes through as the indicator function 1{k = (0, . . . ,0)} is canceled out in the differences of derivatives. FromTsybakov (2008), (2.33)-(2.34), for any sufficiently

smallc0ands≤maxlβ∗l + 1,

max z |g

(s)₍_z₎_{| ≤}₁_/₈_. _(5.9)

This imply that

|g(ki)₍₍_z

i+ ∆zi−cri)/hi)−g(ki)((zi−cri)/hi)| ≤ |∆zi|/(8hi). (5.10) First, let us consider the case whenr(z) =r(z+ ∆z) and|∆zi| ≤hi. From (5.8), (5.9), and (5.10),

where the last inequality follows from ∆zi ≤hi and (5.7).

Third, consider the case when r(z) 6= r(z+ ∆z) and |∆zi| ≤ hi/2. If wr(z) = wr(z+∆z) = 0 or

z, z+ ∆z /∈ ∪m¯ r=1Br

|Dk_q

j(z+ ∆z)−Dkqj(z)|=Dkqj(z+ ∆z) =Dkqj(z) = 0.

Ifwr(z)6=wr(z+∆z)or if one ofzandz+ ∆zis not in∪mr=1¯ Br, then without a loss of generality suppose that wr(z) = 1 or that z+ ∆z /∈ ∪r=1m¯ Br. Let|∆zi?| ∈ [0,|∆zi|] and ∆z? = (0, . . . ,0,∆z?i,0, . . . ,0) be such thatz+ ∆z? _{is a boundary point of}_B

r(z). Then,Dkqj(z+ ∆z?) = 0 and (5.11) imply

|Dk_q j(z+ ∆z)−Dkqj(z)|=|Dkqj(z)|=|Dkqj(z+ ∆z?)−Dkqj(z)| ≤ |∆z?_i|β_i∗(1−Pd l=1kl/βl∗)≤ |_∆_z_i|β ∗ i(1− Pd l=1kl/β∗l)_.

Ifwr(z)=wr(z+∆z)= 1 andz, z+ ∆z∈ ∪mr=1¯ Brthen by construction of qj andg

|Dkqj(z+ ∆z)−Dkqj(z)|=|Dkqj(z+ ∆z+ 0.5hi)−Dkqj(z+ 0.5hi)| ≤ |∆zi|β ∗ i(1−

Pd

(24)

where the last inequality follows from (5.11).

Finally, whenr(z)6=r(z+ ∆z) and ∆zi> hi/2,

Now, let us consider a general ∆zsuch that for ∆zi 6= 0,Pd_l=1kl/βl∗+ 1/β∗i ≥1.

|Dk_q j(z+ ∆z)−Dkqj(z)| ≤ d X i=1 |Dk_q j(z1, . . . , zi−1, zi+ ∆zi, . . . , zd+ ∆zd)−Dkqj(z1, . . . , zi, zi+1+ ∆zi+1, . . . , zd+ ∆zd)|.

The preceding argument applies to every term in this sum and, thus,qj ∈ Cβ ∗

1,...,β

∗ d,1.

Lemma5.4. Letfi: ˜Y × X →R,i∈ {1,2}, be densities with respect to a product measureλ×µon

˜

Y × X ⊂Rd. For a finite set Y, let{Ay, y∈ Y}be a partition ofY˜ and letpi(y, x) =R_A

yfi(˜y, x)dλ(˜y). Then,

dT V(p1, p2)≤dT V(f1, f2) (5.12)

dH(p1, p2)≤dH(f1, f2) (5.13)

dKL(p1, p2)≤dKL(f1, f2). (5.14)

Also, if for given (y, x),f2(˜y, x)>0 for anyy˜∈Ay, then

inf ˜ y∈Ay f1(˜y, x) f2(˜y, x) ≤p1(y, x) p2(y, x) ≤ sup ˜ y∈Ay f1(˜y, x) f2(˜y, x) . (5.15) Proof. Trivially, dT V(p1, p2) = X y Z Z Ay (f1(˜y, x)−f2(˜y, x))dy˜ dµ(x) ≤X y Z Z Ay |f1(˜y, x)−f2(˜y, x)|dλ(˜y)dµ(x) =dT V(f1, f2). By Holder inequality, dH(p1, p2) = 2 1− X y Z sZ 1Ay(˜y1)f1(˜y1, x)dλ(˜y1)· Z 1Ay(˜y2)f2(˜y2, x)dλ(˜y2)dµ(x) ! ≤2 1−X y Z Z 1Ay(˜y) p f1(˜y, x)f2(˜y, x)dλ(˜y)dµ(x) ! =dH(f1, f2). For fixed (y, x), Z Ay (f1(˜y, x)/p1(y, x)) log f1(˜y, x)/p1(y, x) f2(˜y, x)/p2(y, x) dλ(˜y)≥0

(25)

since the Kullback-Leibler divergence is nonnegative. Thus, Z Ay f1(˜y, x) log f1(˜y, x) f2(˜y, x) dλ(˜y)≥ Z Ay f1(˜y, x) log p1(y, x) p2(y, x) dλ(˜y) =p1(y, x) log p1(y, x) p2(y, x) .

This inequality integrated with respect todµ(x) and summed overyimplies (5.14). The last claim follows

from f2(˜y, x) inf ˜ z∈Ay f1(˜z, x) f2(˜z, x) ≤f1(˜y, x)≤f2(˜y, x) sup ˜ z∈Ay f1(˜z, x) f2(˜z, x) .

Lemma 5.5. For Γn, hi, %i, and βi∗ defined in Section 3, (i) βi∗ ≥ βi for i = 1, . . . , d and (ii)

%i∈(1,2]fori∈J∗c∩ {1, . . . , dy}.

Proof. Fori /∈J∗,βi∗=βi by definition. Fori∈J∗, from the definition of Γn,

Γn ≤ _N J∗/Ni n 1 2+β−_{J c}1 ∗+β −1 i = Γ 2+β−_{J c}1 ∗ 2+β−_{J c}1 ∗+β −1 i n N −1 2+β−_{J c}1 ∗+β −1 i i , which impliesN−βi i ≥Γn. By the definition ofβi∗,N −β∗_i i = Γn and, thus,βi∗≥βi. Fori∈J_∗c, from the definition of Γn,

NJ∗Ni n 1 2+β−_{J c}1 ∗ −β−1 i ≥ NJ∗ n 1 2+β_{J c}−1 ∗ _, which implies Ni≥ _N J∗ n 2+β−_{J c}1 ∗ −β−1 i 2+β−1 J c∗ _{= Γ}−β −1 i n =⇒ Γβ −1 i n ≥ 1 Ni , and, therefore, Γβ −1 i n Ni ≥1. Next, define %i= j Γβ −1 i n Ni/2 k + 1 Γβ −1 i n Ni/2 . Then%i∈(1,2] as Γ β_i−1 n Ni≥1.

5.2. Proofs and Auxiliary Results for Posterior Contraction Rates.

5.2.1. Prior Thickness.

Lemma 5.6. (Anisotropic Taylor Expansion) For f ∈ Cβ1,...,βd,L _and_r_{∈ {}₁_{, . . . , d}}

f(x1+y1, . . . , xd+yd) = X k∈Ir yk k!D k_f₍_x 1, . . . , xr, xr+1+yr+1, . . . , xd+yd) (5.16)

(26)

+ r X l=1 X k∈_I¯l yk k! Dkf(x1, . . . , xl+ζlk, xl+1+yl+1, . . . , xd+yd) (5.17) −Dkf(x1, . . . , xl, xl+1+yl+1, . . . , xd+yd) , (5.18) whereζk l ∈[xl, xl+yl]∪[xl+yl, xl], Il= k= (k1, . . . , kl,0, . . . ,0)∈Zd+: ki≤ βi(1− i−1 X j=1 kj/βj) , i= 1, . . . , l , ¯ Il= k∈Il: kl= βl(1− l−1 X j=1 kj/βj) ,

and the differences in derivatives in (5.17)-(5.18)are bounded by Lζ_lk

βl(1−Pdi=1ki/βi)

.

Proof. The lemma is proved by induction. For r= 1, (5.16)-(5.18) is a standard univariate Taylor

expansion off(x+y) in the first argument around (x1, x2+y2, . . . , xd+yd). Suppose (5.16)-(5.18) holds for somer∈ {1, . . . , d}. Then, let us show that (5.16)-(5.18) holds forr+1. For that, consider a univariate Taylor expansion ofDk_f _{in (}_5.16_{). The following notation will be useful. Let} _e

i ∈ Rd, i = 1, . . . , d, be

such thateij = 1 fori=j andeij = 0 fori6=j andk∗r+1=bβr+1(1−Pr_j=1kj/βj)c. Then,

Dkf(x1, . . . , xr, xr+1+yr+1, . . . , xd+yd) = k∗_r₊₁ X kr+1=0 ykr+1 r+1 kr+1! Dk+kr+1·er+1_f₍_x 1, . . . , xr+1, xr+2+yr+2, . . . , xd+yd) +y k∗_r₊₁ r+1 k∗_r+1! Dk+kr∗+1·er+1_f₍_x 1, . . . , xr, xr+1+ζ k+k∗_r₊₁·er+1 r+1 , xr+2+yl+2, . . . , xd+yd) −Dk+k∗r+1·er+1_f₍_x 1, . . . , xr, xr+1, xr+2+yl+2, . . . , xd+yd) .

Inserting this expansion into (5.16) delivers the result forr+ 1.

Lemma 5.7. Let R(x, y) denote the remainder term in the anisotropic Taylor expansion ((5.17)

-(5.18)forr=d). Supposef ∈ Cβ1,...,βd,L_and_L_satisfies₍_4.11₎_-₍_4.12₎_{. Let}_σ₌_{σ i=σ

β/βi

n , i= 1, . . . , d} andσn→0. Then, for all sufficiently large n,

Z

|R(x, y)|φ(y; 0, σ)dy_.L(x)σ_nβ.

Proof. Note that|R(x, y)|is bounded by a sum of the following terms overk∈I¯landl∈ {1, . . . , d}

yk k! Dkf(x1, . . . , xl+ζlk, xl+1+yl+1, . . . , xd+yd)−Dkf(x1, . . . , xl, xl+1+yl+1, . . . , xd+yd) ≤y k k!L x+ (0, . . . ,0, yl+1:d), ζ k lel ζ_lk βl(1−Pdi=1ki/βi)

≤L˜(x) expτ0||yl+1:d||2 exp τ0||ζlk|| 2 ζ k l βl(1−Pdi=1ki/βi)

(27)

≤_L˜₍_x₎yk

k! exp

τ0||y||2 |yl|

βl(1−Pdi=1ki/βi)_,

where we used inequalities (2.2), (4.11), and (4.12) and that ζ_lk ≤ |yl|. For all sufficiently largensuch thatτ0<0.5/maxiσ2i,

Z ˜ L(x)y k k! exp τ0||y||2 |yl|βl(1− Pd i=1ki/βi) φ(y; 0, σ)dy .L˜(x) l−1 Y i=1 Z |yi|kiφ(yi; 0;σi √ 2)dyi· Z ykl l |yl| βl(1−Pdi=1ki/βi)_φ₍_y l; 0;σl √ 2)dyl .L˜(x)σk1 1 · · ·σ kl−1 l−1 σ kl+βl(1−Pdi=1ki/βi) l = ˜L(x)σk1β/β1 n · · ·σ klβ/βl n σ β βlβl(1−Pdi=1ki/βi) n = ˜L(x)K2σβn, where we useR |z|ρ_φ₍_z,₀_{, ω}₎_dz

.ωρ andkl+1=· · ·=kd = 0 fork∈I¯l. Thus, the claim of the lemma follows.

Lemma 5.8. Suppose density f0 ∈ Cβ1,...,βd,L with a constant envelope L has support on [0,1]d and

f0(z)≥f >0. Then, f0|J∈ CβdJc,...,βd,L/f.

Proof. For ˜x,∆˜x∈ X,yJ ∈ YJ, and some ˜yJ∗ ∈AyJ, by the mean value theorem,

Dkf0|J(˜x+ ∆˜x|yJ)−Dkf0|J(˜x|yJ) = = 1 π0J(yJ) Z A_yJ D0,...,0,kf0(˜yJ,x˜+ ∆˜x)−D0,...,0,kf0(˜yJ,x˜) dy˜J = 1/NJ π0J(yJ) D0,...,0,kf0(˜yJ∗,x˜+ ∆˜x)−D 0,...,0,k_f 0(˜y∗J,x˜)

and the claim of the lemma follows from the definition ofCβ1,...,βd,L _and_π

0J(yJ)≥f /NJ.

Lemma 5.9. There is a λ0∈(0,1) such that for any λ∈(0, λ0)and any two conditional densities

p, q∈ F, a probability measureP onZ that has a conditional density equal top, anddhdefined with the distribution onX implied byP, Plogp q ≤d 2 h(p, q) 1 + 2 log1 λ + 2P logp q 1 _q p≤λ , P logp q 2 ≤d2h(p, q) 12 + 2 log1 λ 2! + 8P ( logp q 2 1 q p ≤λ ) ,

Proof. The proof is exactly the same as the proof of Lemma 4 ofShen et al.(2013), which in turn,

follows the proof of Lemma 7 inGhosal and van der Vaart(2007).

Lemma 5.10. Under the assumptions and notation of Section4, for for some B0∈(0,∞)and any

yJ ∈ YJ, F0|J ||X˜||> aσn yJ ≤B0σn4β+2εσ 8 n.

(28)

Proof. Note that in the proof of Proposition 1 of Shen et al. (2013) it is shown that aST Gσn > a, where aST G 0 = {(8β + 4ε+ 16)/(bδ)}1/τ and aST Gσn = a ST G 0 log(1/σn)1/τ. As a0 > aST G0 and aσn > aST G_σ_n , therefore aσn > a. Define E ∗ σn = n ˜ x∈RdJ c _: _f 0|J(˜x|yJ)≥σ (4β+2ε+8β/βmin)/δ n o . Note that by construction ofs2in proof of Proposition 1 of Shen et al.(2013) and asσn< s2 it follows that

(4β+ 2ε+ 8) bδ log ₁ σn ≥1 blog ¯f0 =⇒ σ −(4β+2ε+8) δ n ≥f¯0. For ˜x∈E_σ∗ n, f0|J(˜x|yJ)≥σn(4β+2ε+8β/βmin)/δ=σ (8β+4ε+8β/βmin+8)/δ n σ− (4β+2ε+8)/δ n ≥f¯0σn(8β+4ε+8β/βmin+8)/δ= ¯f0σ aτ₀b n = ¯f0exp −baτ 0log( 1 σn ) = ¯f0exp −b a0(log( 1 σn )1/τ) τ = ¯f0exp −baτσn .

Asaσn> aand asf0|J(˜x|yJ)≥f¯0exp{−ba τ

σn}, then the tail condition (4.8) is satisfied only if||x||˜ < aσn. Therefore, E_σ∗

n ⊂

˜

x∈_RdJ _: ||x|| ≤_˜ _a

σn . As in the proof of Proposition 1 of Shen et al. (2013), by Markov’s inequality, F0|J ||X˜||> aσn|yJ ≤F0|J(Eσ∗,cn|yJ) =F0|J f0|J(˜x|yJ)−δ > σn−(4β+2ε+8β/βmin)|yJ ≤B0σn4β+2ε+8β/βmin=B0σ4β+2εn σ 8 n as desired sinceσβ/βmin

n =σn and the tail condition onf0|J(·|yJ), (4.8), implies the existence of aδ >0 small enough such thatE0|J(f0−|Jδ)≤B0<∞for anyyJ∈ YJ.

Lemma 5.11. Under the assumptions and notation of Section 4, form=KNJ and any θ∈Sθ?

d2_H(p?_|_J(·|·)π0(·), p(·,·|θ, m)).σn2β.

Proof. Let us define

fJ(yJ,x|θ, m˜ ) = Z A_yJ f(˜yJ,x|θ, m˜ )dy˜J. Then, d2_H(p?_|_J(·|·)π0(·), p(·,·|θ, m))≤dT V(p?_|J(·|·)π0(·), p(·,·|θ, m)) ≤dT V(f|?J(·|·)π0(·), fJ(·,·|θ, m)) = X yJ∈YJ Z ˜ X X k∈YJ K X j=1 α?_j_|_kπ0(k)1{k=yJ}φ(˜x, µ?j|k, σ ? Jc) −αjk Z A_yJ φ(˜yJ, µjk,J, σJ)dy˜J·φ(˜x, µjk,Jc, σ_Jc) dx˜ ≤ X yJ∈YJ Z ˜ X X k∈YJ K X j=1 α?_jk1{k=yJ}φ(˜x, µ?j|k, σ ? Jc)−α ? jk1{k=yJ}φ(˜x, µjk,Jc, σ_Jc) dx˜

(29)

+ X yJ∈YJ Z ˜ X X k∈YJ K X j=1 α?_jk1{k=yJ}φ(˜x, µjk,Jc, σJc)−αjk Z A_yJ φ(˜y, µjk,J, σJ)dyφ˜ (˜x, µjk,Jc, σJc) dx,˜

where the first inequality follows from d2_H(·,·) ≤dT V(·,·), the second inequality holds by Lemma 5.4, and the last inequality is obtained by the triangle inequality.

Let’s explore the two parts of the right hand side in the last inequality independently. First, X yJ∈YJ Z ˜ X X k∈YJ K X j=1 α?_jk1{k=yJ}φ(˜x, µ?j|k, σ ? Jc)−α ? jk1{k=yJ}φ(˜x, µjk,Jc, σ_Jc) dx˜ ≤ X yJ∈YJ X k∈YJ K X j=1 α?_jk1{k=yJ} Z ˜ X φ(˜x, µ ? j|k, σ ? Jc)−φ(˜x, µjk,Jc, σ_Jc) dx˜ ≤ max j≤N,k∈YJ dT V(φ(·;µ?j|k, σ ? Jc), φ(·, µjk,Jc, σ_Jc)).σ2β_n , where the fact thatα?

j,k = 0 forj > N by design is used to getj ≤N rather than j ≤K in the max subscript. The last inequality is proved in Lemma5.12.

Second, X yJ∈YJ Z ˜ X X k∈YJ K X j=1 α?jk1{k=yJ}φ(˜x, µjk,Jc, σ_Jc)−α_jk Z A_yJ φ(˜yJ, µjk,J, σJ)dy˜Jφ(˜x, µjk,Jc, σ_Jc) dx˜ = K X j=1   X yJ∈YJ X k∈YJ α?_jk1{k=yJ} −αjk Z A_yJ φ(˜yJ, µjk,J, σJ)dy˜J Z ˜ X φ(˜x, µjk,Jc, σ_Jc)dx˜   = K X j=1 X yJ∈YJ X k∈YJ α?_jk1{k=yJ} −αjk Z A_yJ φ(˜yJ, µjk,J, σJ)dy˜J ≤ X yJ∈YJ X k∈YJ K X j=1 α?jk1{k=yJ} −α?jk Z A_yJ φ(˜yJ, µjk,J, σJ)dy˜J + X yJ∈YJ X k∈YJ K X j=1 α?_jk Z A_yJ φ(˜yJ, µjk,J, σJ)dy˜J−αjk Z A_yJ φ(˜yJ, µjk,J, σJ)dy˜J ≤ X yJ∈YJ X k∈YJ K X j=1 α?_jk 1{k=yJ} − Z A_yJ φ(˜yJ, µjk,J, σJ)dy˜J + X yJ∈YJ X k∈YJ K X j=1 α?jk−αjk Z A_yJ φ(˜yJ, µjk,J, σJ)dy˜J = X k∈YJ K X j=1  α?jk X yJ∈YJ 1{k=yJ} − Z A_yJ φ(˜yJ, µjk,J, σJ)dy˜J   + X k∈YJ K X j=1   α?_jk−αjk X yJ∈YJ Z A_yJ φ(˜yJ, µjk,J, σJ)dy˜J   ≤ X k∈YJ K X j=1 α?_jk   Z Ac k φ(˜yJ, µjk,J, σJ)dy˜J+ X yJ6=k Z A_yJ φ(˜yJ, µjk,J, σJ)dy˜J  + X k∈YJ K X j=1 α?_jk−αjk = X k∈YJ K X j=1 α?jk·2 Z Ac k φ(˜yJ, µjk,J, σJ)dy˜J+ X k∈YJ K X j=1 α?jk−αjk

(30)

≤2 max j≤N,k∈YJ Z Ac k φ(˜yJ, µjk,J, σJ)dy˜J+ X k∈YJ K X j=1 α?_jk−αjk .σ2β_n .

The last inequality follows from Lemma5.13and the definition ofSθ?.

Lemma 5.12. Under the assumptions and notation of Section 4,

max j≤N,k∈YJ

dT V(φ(·;µ?j|k, σ ?

Jc), φ(·, µjk,Jc, σ_Jc)).σ_n2β.

Proof. Fix somej≤N andk∈ YJ. It is known that

dT V(φ(·;µ?j|k, σ ? Jc), φ(·, µjk,Jc, σ_Jc))≤2 q dKL(φ(·;µ?_j_|_k, σ?_Jc), φ(·, µjk,Jc, σ_Jc)) and dKL(φ(·;µ?j|k, σ ? Jc), φ(·, µjk,Jc, σJc)) = X i∈Jc σ2 i σ?2 i −1−log σ 2 i σ?2 i +(µ ? j|k,i−µjk,i) 2 σ?2 i .

From the definition ofSθ?,

X i∈Jc (µ?_j_|_k,i−µjk,i)2 σ?2 i ≤˜4b1 n ≤σ 4β n . Sinceσ2

i ∈(σi?2(1 +σ2βn )−1, σi?2) and the fact that|z−1−logz|.|z−1|2forzin a neighborhood of 1, we have for all sufficiently largen

σ2 i σ?2 i −1−log σ 2 i σ?2 i . 1− σ 2 i σ?2 i 2 .σ4β_n .

The three inequalities derived above imply the claim of the lemma.

Lemma 5.13. Under the assumptions and notation of Section 4, forθ∈Sθ?, max j≤N,k∈YJ Z Ac k φ(˜yJ, µjk,J, σJ)dy˜J.σ2βn .

Proof. Fixj≤N,k∈ YJ, andθ∈Sθ?. Sinceµ_jk,i∈ h ki−_4N1 i, ki+ 1 4Ni i , Z Ac k φ(˜yJ, µjk,J, σJ)dy˜J≤ X i∈J P r ˜ yi∈/ ki− 1 2Ni , ki+ 1 2Ni ≤X i∈J P r ˜ yi∈/ µjk,i− 1 4Ni , µjk,i+ 1 4Ni = 2X i∈J Z −₄_Niσi1 −∞ φ(˜yi,0,1)dy˜i ≤2X i∈J exp − 1 2(4Niσi)2 ≤2X i∈J σ_n2β_.σ_n2β,

where the last inequality follows from the restrictions onσJinSθ?and the penultimate inequality follows from a bound on the normal tail probability derived below.

(31)

If ˜Yi hasN(0,1) distribution, then the moment generating function isM(θ) = exp{θ2/2}. Note that exp{θ( ˜Yi−(4Niσi)−1)} ≥1 when ˜Yi ≤(4Niσiy)−1 andθ≤0, therefore:

Z −₄_Niσi1 −∞ φ(˜yi,0,1)dy˜i≤inf θ≤0Pexp n θ( ˜Yi−(4Niσi)−1) o = inf θ≤0exp −θ(4Niσi)−1 M(θ) = inf θ≤0exp −θ(4Niσi)−1 exp θ2/2 = exp−(4Niσi)−2/2 .

Lemma5.14. Under the assumptions and notation of Section4, for any(yJ, yI)∈ Y, some constants

C3, C4>0 and all sufficiently largen,

p(yJ, yI, x|θ, m) p0(yJ, yI, x) ≥C3 σ2β n m2 ≡λn, (5.19) whenkxk ≤aσn and p(yJ, yI, x|θ, m) p0(yJ, yI, x) ≥exp −8||x|| 2 σ2 n −C4logn (5.20) whenkxk> aσn.

Proof. By assumption (4.8),f0|J(˜x|yJ)≤f¯0, and π0J(yJ)≤1 for all (˜x, yJ). Therefore,

fJ(yJ,x|θ, m˜ )

f0|J(˜x|yJ)π0J(yJ)

≥f¯₀−1fJ(˜x, yJ|θ, m) (5.21)

Letk?₌_y

J. Then, by Lemma 5.13, for any j∈ {1, . . . , K}, Z

A_yJ

φ(˜yJ;µjk∗_,J, σ_J)dy˜_J ≥ 1 2 for allnlarge enough asσn→0.

For any ˜x∈X˜ withkxk ≤˜ 2aσn, by the construction of setsUj|k?, there existsj

? _{∈ {}₁_{, . . . , K}} _such that ˜x, µj?_|_k?∈U_j?_|_k? and for all sufficiently largen,P_i_∈_Jc(˜x_i−µ_j?_|_k?_,i)2/σ2_i ≤4. Then,

φ(˜x, µj?_|_k?, σ_Jc) = (2π)−dJ c/2 Y i∈Jc σ−_i 1exp ( −0.5X i∈Jc (˜xi−µj?_|_k?_,i)2/σ2_i ) ≥(2π)−dJ c/2_σ−dJ c n e− 2_. Thus, fJ(yJ,x|θ˜ ) = X k∈YJ K X j=1 αjk Z A_yJ φ(˜yJ, µjk,J, σJ)dy˜Jφ(˜x, µjk,Jc, σ_Jc) ≥αj?k∗φ(˜x, µ_j?k?,Jc, σJc) Z A_yJ φ(˜yJ, µj?k?,J, σJ)dy˜J and forC3= ¯f0−1(2π)−dJ c/2e−2/8, fJ(yJ,x|θ, m˜ ) f0|J(˜x|yJ)π0J(yJ) ≥f¯₀−1· min j≤K,k∈YJ αjk·(2π)−dJ c/2σ−ndJ ce− 2_·1 2

(32)

≥2C3

σ2β n

m2 = 2λn. (5.22)

By assumption (4.9), for anyx∈ X, anyyJ∈ YJ, and all sufficiently largen, Z A_yI f0|J(˜x|yJ)π0J(yJ)dy˜I ≤2 Z A_yI∩{y˜I:ky˜Ik≤aσn} f0|J(˜x|yJ)π0J(yJ)dy˜I. (5.23)

For anyx∈ X withkxk ≤aσn and ˜yI ∈AyI∩ {y˜I : ky˜Ik ≤aσn}, we havekxk ≤˜ 2aσn and

p(yJ, yI, x|θ, m) p0(yJ, yI, x) = R A_yI fJ(yJ,x|θ, m˜ )dy˜I R A_yI f0|J(˜x|yJ)π0J(yJ)dy˜I ≥ R A_yI∩{y˜I:ky˜Ik≤aσn}fJ(yJ,x|θ, m˜ )dy˜I 2R A_yI∩{y˜I:ky˜Ik≤aσn}f0|J(˜x|yJ)π0J(yJ)dy˜I ≥λn, (5.24)

where the first inequality follows from (5.23) and the second one from (5.22) combined with Lemma5.4.

Next, let us bound fJ(yJ,x|θ, m˜ )/f0|J(˜x|yJ)π0(yJ) from below for ˜x∈ X˜ such that kxk > aσn and

ky˜Ik ≤aσn. For anyj≤Kandk∈ YJ,||x−µ˜ jk,Jc||

2_≤₂₍_||_x||_˜ 2₊_||µ

jk,Jc||2)≤16||x||2as||µ_jk,Jc|| ≤2a_σ_n by construction ofUj|k and 2||x||>||x||˜ . Then

φ(˜x, µjk,Jc, σJc) = (2π)−dJ c /2 Y i∈Jc σ−_i1exp ( −0.5X i∈Jc (˜xi−µjk,i)2/σi2 ) ≥(2π)−dJ c /2_σ−dJ c n exp −8||x|| 2 σ2 n .

Then, fornlarge enough

fJ(yJ,x|θ, m˜ ) = X k∈YJ K X j=1 αjk Z A_yJ φ(˜yJ, µjk,J, σJ)dy˜Jφ(˜x, µjk,Jc, σ_Jc) ≥(2π)−dJ c /2_σ−dJ c n exp −8||x|| 2 σ2 n K X j=1 αjk X k∈YJ Z A_yJ φ(˜yJ, µjk,J, σJ)dy˜J ≥(2π)−dJ c /2_σ−dJ c n exp −8||x|| 2 σ2 n ₁ 2Kminj,k αjk. Combining this inequality with (5.21), we get

fJ(yJ,x|θ, m˜ ) f0|J(˜x|yJ)π0J(yJ) ≥ 1 2(2π) −d_{J c /}₂_¯ f₀−1σ−dJ c n K σ2β+dJ c n 2m2 exp −8||x|| 2 σ2 n ≥exp −8||x|| 2 σ2 n −C4logn (5.25)

for sufficiently largeC4 because|log

Kσ2β_n /m2|_.logn.

Thus, for||x||> aσn, (5.25) and the first inequality in (5.24), which holds for anyx∈ X, deliver

p(yJ, yI, x|θ, m) p0(yJ, yI, x) ≥exp −8||x|| 2 σ2 n −C4logn . (5.26)