Adaptive Bayesian inference on the mean of an infinite-dimensional normal distribution

(1)

2003, Vol. 31, No. 2, 536–559 © Institute of Mathematical Statistics, 2003

ADAPTIVE BAYESIAN INFERENCE ON THE MEAN OF AN INFINITE-DIMENSIONAL NORMAL DISTRIBUTION

BY EDUARD BELITSER ANDSUBHASHISGHOSAL

Utrecht University and North Carolina State University

We consider the problem of estimating the mean of an infinite-dimensional normal distribution from the Bayesian perspective. Under the as-sumption that the unknown true mean satisfies a “smoothness condition,” we first derive the convergence rate of the posterior distribution for a prior that is the infinite product of certain normal distributions and compare with the min-imax rate of convergence for point estimators. Although the posterior distrib-ution can achieve the optimal rate of convergence, the required prior depends on a “smoothness parameter”q. When this parameterqis unknown, besides the estimation of the mean, we encounter the problem of selecting a model. In a Bayesian approach, this uncertainty in the model selection can be handled simply by further putting a prior on the index of the model. We show that if q takes values only in a discrete set, the resulting hierarchical prior leads to the same convergence rate of the posterior as if we had a single model. A slightly weaker result is presented whenqis unrestricted. An adaptive point estimator based on the posterior distribution is also constructed.

1. Introduction. Suppose we observe an infinite-dimensional random vector

X=(X1, X2, . . .), where Xi’s are independent, Xi has distribution N (θi, n−1),

i=1,2, . . . , andθ =(θ1, θ2, . . .)∈2, that is,

_∞

i=1θi2 <∞. The parameterθ is unknown and the goal is to make inference about θ. Let Pθ , n stand for the distribution of X; here and throughout, we suppress the dependence of X and

Pθ =Pθ , n on n unless indicated otherwise. We shall write · for the 2 norm

throughout the paper.

We also note that X may be thought of as the sample mean vector of the first n observations of an i.i.d. sample Y1,Y2, . . . , each taking values in R∞

and distributed like Y=(Y1, Y2, . . .), where Yi’s are independent and Yi has distributionN (θi,1),i=1,2, . . . .Since the sample mean is sufficient, these two formulations are statistically equivalent. To avoid complicated notation, we shall generally work with the first formulation. However, the latter formulation will help us apply the general theory of posterior convergence developed by Ghosal, Ghosh and van der Vaart (2000).

The interest in the infinite-dimensional normal model is partly due to its equivalence with the prototypical white noise model

dXε(t)=f (t) dt+ε dW (t), 0≤t≤1,

Received February 2000; revised March 2002.

AMS 2000 subject classifications.Primary 62G20; secondary 62C10, 62G05.

Key words and phrases.Adaptive Bayes procedure, convergence rate, minimax risk, posterior

distribution, model selection.

(2)

whereXε(t)is the noise-corrupted signal,f (·)∈L2[0,1] is the unknown signal,

W (t)is a standard Wiener process and ε >0 is a small parameter. The statistical estimation problem is to recover the signalf (t), based on the observationXε(t). The above model arises as the limiting experiment in some curve estimation problems such as in density estimation [Nussbaum (1996); Klemelä and Nussbaum (1998)] and nonparametric regression [Brown and Low (1996)], whereε=n−1/2

andnis the sample size.

Suppose that the functions {φi, i =1,2, . . .} form an orthonormal basis in

L2[0,1]. Under the assumptionf (·)∈L2[0,1], we can reduce this problem to the

problem of estimating the mean θ =(θ1, θ2, . . .)∈2 of an infinite-dimensional

normal distribution. Indeed,

Xi=θi+εξi, i=1,2, . . . ,

where Xi = _T

0 φi(t) dXε(t), θi = _T

0 φi(t)f (t) dt and ξi = _T

0 φi(t) dW (t), so that the ξi’s are independent standard normal random variables. In so doing, we arrive at the infinite-dimensional Gaussian shift experiment withε=n−1/2. The signal f (t)can be recovered from the basis expansion f (t)=∞_i₌₁θiφi(t) and the map relatingf andθ is an isometric isomorphism.

Coming back to the infinite-dimensional normal model, if a Bayesian analysis is intended, one assigns a prior to θ and looks at the posterior distribution. Diaconis and Freedman (1997) and Freedman (1999) considered the independent normal prior N (0, τ_i2) for θi, i = 1,2, . . . , and comprehensively studied the nonlinear functional θ − ˆθ2, where θˆ is the Bayes estimator, both from the Bayesian and the frequentist perspectives. As a consequence of their main results, they established frequentist consistency of the Bayes estimator for all θ ∈2 if

_∞

i=1τi2 <∞. A peculiar implication of their result is that the Bayesian and frequentist asymptotic distributions ofθ − ˆθ2 differ, and hence the Bernstein– von Mises theorem fails to hold for this simple infinite-dimensional problem. A similar conclusion was reached earlier by Cox (1993) in a slightly different model.

This estimation problem was first studied in a minimax setting by Pinsker (1980). He showed that if the unknown infinite-dimensional parameter θ is assumed to belong to an ellipsoidq(Q)= {θ:∞i=1i2qθi2≤Q}, then the exact asymptotics of the minimax quadratic risk over the ellipsoidq(Q)are given by

lim

n→∞inf_θ_ˆ _θ_∈sup q(Q)

n2q/(2q+1)Eθˆθ −θ2=Q1/(2q+1)γ (q), (1.1)

where γ (q) = (2q +1)1/(2q+1)(q/(q + 1))2q/(2q+1) is the Pinsker constant [Pinsker (1980)] and Eθ denotes the expectation with respect to the probability measurePθ generated byXgivenθ.

(3)

and showed that the posterior mean attains the minimax rate n−q/(2q+1). We shall write q to denote this prior. We show in Theorem 2.1 that the posterior distribution q(·|X) obtained from the prior q also converges at the rate

n−q/(2q+1)in the sense that, for any sequenceMn→ ∞,

q

θ:nq/(2q+1)θ −θ0> Mn|X

→0

in Pθ0-probability as n→ ∞, where the true value of the parameterθ0 belongs to the linear subspaceq = {θ:∞i=1i2qθi2<∞}. It has long been known that for finite-dimensional models the posterior distribution converges at the classical

n−1/2 rate, but for infinite-dimensional models results have only been obtained recently by Ghosal, Ghosh and van der Vaart (2000) and Shen and Wasserman (2001).

Our main goal in the present paper is to construct, without knowing the value of the smoothness parameterq, a prior such that the posterior distribution attains the optimal rate of convergence. In this case, instead of one single modelθ ∈q, we have a nested collection of models θ ∈q, q ∈Q. The inference based on such a prior is therefore adaptive in the sense that the same prior gives the optimal rate of convergence irrespective of the smoothness condition. Here, besides the problem of estimation, we encounter the problem of model selection. In a Bayesian framework, perhaps the most natural candidate for a prior that gives the optimal rate over various competing models is a mixture of the appropriate priors in different models, indexed by the smoothness parameter q. So we consider a mixture of q’s over q, that is, a prior of the form λq q, where λq >0,

λq=1 and the sum is taken over some countable set. We show that the resulting hierarchical or mixture prior leads to the optimal convergence rate of the posterior simultaneously for allq under consideration. Similar results were also found by Ghosal, Lember and van der Vaart (2002) in the context of density estimation and Huang (2000) for density estimation and regression.

The problem of adaptive estimation in a minimax sense was first studied by Efromovich and Pinsker (1984). They proposed an estimator θˆ of θ which does not assume knowledge of the smoothness parameter q, yet attains the minimum in (1.1).

(4)

result. Uniformity of the convergence of the posterior distribution with respect to the parameter lying in an infinite dimensional ellipsoid is discussed in Section 8.

From now on, all symbols O and o refer to the asymptotics n→ ∞ unless otherwise specified. Byx, we shall mean the greatest integer less than or equal tox.

2. Known smoothness. Recall that we have independent observations Xi distributed asN (θi, n−1),i=1,2, . . . .From now on, we assume that the unknown parameter θ = (θ1, θ2, . . .) belongs to the set q = {θ:∞i=1i2qθi2 < ∞}, a Sobolev-type subspace of 2. Note that q measures the “smoothness” of θ,

since, in the equivalent white noise model with the standard trigonometric Fourier basis, the set q essentially corresponds to the class of all periodic functions

f (·)∈L2[0,1]whoseL2[0,1]norm of theqth derivative is bounded, whenqis an

integer (otherwise theqth Weyl derivative is meant). So, in some senseqstands for the “number of derivatives” of the unknown signal in the equivalent white noise model. For this reason,q will often be referred to as the smoothness parameter ofθ.

Letθ0=(θ10, θ20, . . .)denote the true value of the parameter so that we have θ0∈q. Let the prior pbe defined as

θi’s are independentN (0, i−(2p+1)).

Then the posterior distribution ofθ givenXis described by

θi’s are independentN

nXi

n+i2p+1,

1

n+i2p+1

.

The posterior mean is given byθˆ =(θˆ1,θˆ2, . . .), where

ˆ

θi=

nXi

n+i2p+1, i=1,2, . . . .

Note that the posterior distribution of θi depends on X only through Xi. In Theorem 5.1 of Zhao (2000) [and implicitly in Section 3 of Cox (1993) and Theorem 5 of Freedman (1999)], it is shown that θˆ converges to θ0 at the rate

n−min(p, q)/(2p+1). By a slight extension of Zhao’s (2000) argument, we observe that the posterior distribution also converges at this rate.

THEOREM 2.1. For any sequenceMn→ ∞,

p

θ:nmin(p, q)/(2p+1)θ−θ0> Mn|X

→0,

(2.1)

(5)

PROOF. Applying Chebyshev’s inequality, the posterior probability in (2.1) is at most

M_n−2n2 min(p, q)/(2p+1) ∞

i=1

(θˆi−θi0)2+var(θi|Xi) .

It suffices to show that thePθ0-expectation of the above expression tends to zero. Now it is easy to see that

∞

i=1

var(θi|Xi)=

∞

i=1

(n+i2p+1)−1=On−2p/(2p+1) .

It follows from Theorem 5.1 of Zhao (2000) that

∞

i=1

Eθ0(θˆi−θi0)

2₌_O_n−2 min(p, q)/(2p+1)

and hence the result follows.

REMARK 2.1. Shen and Wasserman (2001) also calculated the rate of

convergence of the posterior distribution using a different method for the special case whenθi0=i−pand also showed that the obtained rate is sharp.

REMARK 2.2. Interestingly, as Zhao (2000) pointed out, although the pri-or p with p =q is the best choice (among independent normal priors with power–variance structure) as far as the convergence rate is concerned, q has a peculiar property. Both the prior and the posterior assign zero probability toq. This is an easy consequence of the criterion for summability of a random series. It is also interesting to note that summability of ∞_i₌₁i2qθ_i2 is barely missed by the prior since ∞_i₌₁i2pθ_i2 <∞ almost surely for any p < q and soq=sup{p:∞_i₌₁i2pθ_i2<∞almost surely}. To fix this problem, Zhao (2000) proposed a compound prior∞_k₌₁wkπk, where underπk, theθi’s are independent with distribution N (0, i−(2q+1)) for i =1, . . . , k and degenerate at 0 otherwise, and the wk’s are weights bounded from below by a sequence exponentially decaying ink. She showed that, on the one hand, the posterior mean attains the minimax rate of convergence and, on the other hand, assigns probability 1 toq.

3. Discrete spectrum adaptation. So far our approach relies heavily on the fact that we know the smoothness, that is, we have chosen a single model as the “correct” one from the collection of possible models corresponding to different choices of q >0. In general, when the parameter specifying the model is not chosen correctly, this may lead to suboptimal or inconsistent procedures, since further analysis does not take into account the possibility of other models.

(6)

the assertion holds for no larger value ofq. If actuallyθ0∈qfor someq> q, then we lose in terms of the rate of convergence since we could have obtained the better raten−q/(2q+1)by using the prior q. On the other hand, ifθ0∈q,

q< qonly, then the model is misspecified. In general, one expects inconsistency in this type of situation. In this case, however, since all the models are dense in2,

the posterior is nevertheless consistent. Still, there is a loss in rate of convergence again as we only get the rate n−q/(2q+1) from Theorem 2.1 instead of the rate

n−q/(2q+1)achievable by the prior q.

Therefore, we intend to present a prior that achieves the posterior rate of convergence n−q/(2q+1) at θ0 whenever θ0∈q for different values of the smoothness parameterq. LetQ= {. . . , q₋1, q0, q1, . . .} be a countable subset of

the positive real semiaxis without any accumulation point other than 0 or∞. Such a set may be arranged in an increasing sequence that preserves the natural order, that is, 0<· · ·< q₋1< q0< q1<· · ·.We show that there is a prior forθ such

that wheneverθ0∈q,q∈Q, the posterior converges at the raten−q/(2q+1). We may think of q as another unknown parameter. So, instead of one single model, we have a sequence of nested models parameterized by a “smoothness” parameter q ranging over the discrete set Q. In a Bayesian approach, perhaps the most natural way to handle this uncertainty in the model is to put a prior on the index of the model or the “hyperparameter” q. The resulting prior is therefore the two-level hierarchical prior

givenq, theθi’s are independentN (0, i−(2q+1));

qhas distributionλ.

The main purpose of this article is to show that this simple and natural procedure is rate adaptive, that is, the mixture prior achieves the convergence raten−q/(2q+1)

wheneverθ0∈q andq∈Q.

Let λm=λ (q=qm). Henceforth, we shall write m for qm. We thus have =∞m=−∞λm m. Throughout, we shall assume thatλm>0 for allm.

Without loss of generality, we may assume thatθ0∈q0. Otherwise, we simply relabel.

THEOREM 3.1. For any sequenceMn→ ∞,the posterior probability

θ:nq0/(2q0+1)_θ −_θ

0> Mn|X

→0 (3.1)

inPθ0-probability asn→ ∞.

(7)

prior to the form _m_≥₀λm m. For such a prior we then show that the posterior probability of {θ:∞_i₌₁i2q0_θ2

i > B, q > q0} is small for a sufficiently large B, while Theorem 2.1 implies that{θ:nq0/(2q0+1)_θ−_θ

0> Mn, q=q0}converges

to 0 inPθ0-probability. Therefore, the effective parameter space is further reduced to{θ:∞_i₌₁i2q0_θ2

i ≤B}for which the general theory developed by Ghosal, Ghosh and van der Vaart (2000) applies.

REMARK 3.1. It should be noted that the cases q < q0 and q > q0 are not

symmetrically treated. Unlike q < q0, we do not show that (q > q0|X) tends

to 0. An obvious reason is that it may still be possible thatθ0∈qm for m >0. For instance, if θ_i2₀ decreases exponentially with i, then θ0∈ q for all q. In such cases, clearly it is not to be expected that (q > q0|X) tends to 0. Even

if θ0 ∈/ q1, q1 > q0, it is still possible that (q > q0|X) does not tend to 0. Indeed, considering two values{q0, q1} and the special caseθi0=i−p, a referee

exhibited that ifp≥q1+1−(2q1+1)/(2(2q0+1)), then (q =q1|x)→1 in

probability. In general, the q1 prior leads to a slower posterior convergence rate than the requiredn−q0/(2q0+1)_{for a general}_θ

0∈q0, so it seems at first glance that our mixture prior cannot simultaneously achieve the optimal rate. The apparent paradox is resolved when one observes, as the referee does, that the latter happens only whenp < q1+1−(2q1+1)/(2(2q0+1)).

The following lemma shows that, given the observations, the conditional probability of misspecifying the model in favor of a coarser model (than the true one) from the posterior converges to zero.

LEMMA 3.1. There exist an integerN and a constantc >0such that for any

m <0andn > N,

Eθ0 (q=qm|X)≤ λm

λ0

exp−cn1/(q0+qm+1) (3.2)

and,therefore,

(q < q0|X)→0

(3.3)

inPθ0-probability asn→ ∞.

The proof of the Lemma 3.1 is somewhat lengthy and involves some messy calculations. We defer it to Section 5.

(8)

According to Lemma 3.1, the posterior mass is asymptotically concentrated on

q≥q0. We separate two possibilities: q=q0 andq > q0. For m >0, introduce

˜

λm = λm/

_∞

j=1λj, and ˜ =

_∞

m=1λ˜m m. The next lemma shows that the posterior (˜ ·|X) is effectively concentrated on the set {θ:∞_i₌₁i2q0_θ2

i ≤B} for a largeB. The proof of this lemma is postponed to Section 6.

LEMMA3.2.

lim

B→∞sup_n_≥₁Eθ0˜

θ:

∞

i=1

i2q0_θ2 i > BX

=0.

(3.4)

Finally, the following lemma shows that the posterior converges at rate

n−q0/(2q0+1)_{. In the proof which is given in Section 6, we exploit the general theory} of posterior convergence developed by Ghosal, Ghosh and van der Vaart (2000).

For m≥0, introduce λ¯m=λm/∞j=0λj and ¯ =∞m=0λ¯m m. Note that, unlike in ˜, the possibilityq=q0 is not ruled out here.

LEMMA3.3. For anyB >0andMn→ ∞,

lim n→∞Eθ0 ¯

θ:nq0/(2q0+1)_θ−_θ

0> Mn,

∞

i=1

i2q0_θ2 i ≤BX

=0.

(3.5)

With the help of Lemmas 3.1–3.3 and Theorem 2.1, the main result can be easily proved now.

PROOF OF THEOREM 3.1. For the sake of brevity, denote rn =rn(q0)=

nq0/(2q0+1)_{. Clearly,}

{θ:rnθ−θ0> Mn|X}

=

θ:rnθ−θ0> Mn, q > q0,

∞

i=1

i2q0_θ2 i > BX

+

θ:rnθ−θ0> Mn, q > q0,

∞

i=1

i2q0_θ2 i ≤BX

+ {θ:rnθ −θ0> Mn, q < q0|X}

+ {θ:rnθ −θ0> Mn, q=q0|X}

≤ ˜

θ:

∞

i=1

i2q0_θ2 i > BX

(q > q0|X)

+

θ:rnθ−θ0> Mn, q≥q0,

∞

i=1

i2q0_θ2 i ≤BX

(9)

+ {θ:rnθ −θ0> Mn, q < q0|X}

+ 0{θ:rnθ −θ0> Mn|X} (q=q0|X)

≤ ˜

θ:

∞

i=1

i2q0_θ2 i > BX

+ ¯

θ:rnθ −θ0> Mn,

∞

i=1

i2q0_θ2 i ≤BX

+ (q < q0|X)

+ 0{θ:rnθ −θ0> Mn|X}.

Givenε >0,B can be chosen sufficiently large to make the first term less thanε

by Lemma 3.2. For thisB, apply Lemma 3.3 to the second term. The third term goes to zero in Pθ0-probability by Lemma 3.1, while the last term converges to zero by Theorem 2.1. The theorem is proved.

REMARK 3.3. In place of m, Zhao’s (2000) mixture prior described in Remark 2.2 may also possibly be used. However, we do not pursue this approach for the following reasons. First, the expressions will be even more complicated. Second, we think that when2 is considered as the parameter space, the property

that massigns zero probability toq is not as bad as it might first appear since

q is not closed and its closure, the whole of 2, obviously receives the whole

mass. Indeed, both mand Zhao’s mixture prior have support of the whole of2.

Finally, the criterion that “q should receive the whole mass” loses much of its original motivation in the present context of adaptation, when 2 is the grand

parameter space.

4. Adaptive estimator. As a consequence of the result on adaptivity of the posterior distribution, we now show that there is an estimator θˆ based on the posterior distribution that is rate adaptive in the frequentist sense. The problem of adaptive estimation in a minimax sense was first studied by Efromovich and Pinsker (1984). They proposed an estimator which attains the minimum in (1.1) without knowledge of the smoothness parameterq. Their method is based on adaptively determining optimal damping coefficients in an orthogonal series estimator.

To construct an estimator based on the posterior distribution, one may maximize the posterior probability of a ball of appropriate radius as in Theorem 2.5 of Ghosal, Ghosh and van der Vaart (2000). However, their construction requires knowledge of the convergence rate and hence does not apply to the adaptive setup of the problem. The following modification of their construction is due to van der Vaart (2000). Let

δ_n∗=infδn: (θ:θ−θ ≤δn|X)≥3/4 for someθ∈2

.

(10)

Take any pointθˆ∈2which satisfies

θ:θ − ˆθ ≤δ_n∗+n−1X ≥3/4.

(4.2)

In words, θˆ is the center of the ball of nearly the smallest radius subject to the constraint that its posterior mass is not less than 3/4. Note that while definingθˆ, we have not used the value of the smoothness parameter. The next theorem shows thatθˆ has the optimal frequentist rate of convergencen−q0/(2q0+1)_{and hence is an} adaptive estimator.

THEOREM 4.1 [van der Vaart (2000)]. For anyMn→ ∞andθ0∈q0,

Pθ0

nq0/(2q0+1)ˆ_θ−_θ

0> Mn

→0,

whereθˆ is defined by(4.1)and(4.2).

PROOF. By Theorem 3.1, we have that for anyMn→ ∞,

θ:θ−θ0 ≤Mnn−q0/(2q0+1)X

≥3/4 (4.3)

withPθ0-probability tending to 1.

Let B(θ˜, r)= {θ:θ − ˜θ ≤r} denote the ball around θ˜ of size r. The balls B(θˆ, δ_n∗+1/n)andB(θ0, Mnn−q0/(2q0+1)) both have posterior probability at least 3/4, so they must intersect; otherwise, the total posterior mass would exceed 1. Therefore, by the triangle inequality,

ˆθ−θ0 ≤δn∗+n−1+Mnn−q0/(2q0+1)≤2Mnn−q0/(2q0+1)+n−1, since by the definition ofδ_n∗and (4.3), δ∗_n≤Mnn−q0/(2q0+1). The proof follows.

5. Proof of the Lemma 3.1. We begin with a couple of preliminary lemmas. We recall that givenθ, theXi’s are independent and distributed asN (θi, n−1), and given q =qm, theθi’s are independent and distributed as N (0, i−(2qm+1)). Therefore, givenq=qm, the marginal distribution ofXis given by the countable product ofN (0, n−1+i−(2qm+1)₎_,_i=₁_,₂_{, . . . .}_{Let us denote this measure by}_P

m. Let Pλ=

_∞

m=−∞λmPm, the marginal distribution of X when q is distributed according toλ. The true distribution ofXis, however,Pθ0, the countable product ofN (θi0, n−1),i=1,2, . . . .

LEMMA 5.1. For anyn≥1andθ0∈2,PλandPθ0 are mutually absolutely continuous.

(11)

densitiesf (x)andg(x)on the real line is defined byA(f, g)=√f (x)g(x) dx.

The affinity between two normal densitiesN (µ1, σ₁2)andN (µ2, σ₂2)is therefore

given by

1−(σ1−σ2)

2

σ₁2+σ₂2

1/2

exp

−(µ1−µ2)2

4(σ₁2+σ₂2)

.

Since both Pm andPθ0 are product measures, according to Kakutani’s criterion [see, e.g., Williams (1991), page 144], we need only to show that the infinite product

∞

i=1

AN (0, n−1+i−(2q+1)), N (θi0, n−1)

or, equivalently,

∞

i=1

1−((n

−1₊_i−(2q+1)₎1/2₋_n−1/2₎2

2n−1₊_i−(2q+1)

(5.1)

and

∞

i=1

exp

− θi20

4(2n−1+i−(2q+1)₎ (5.2)

do not diverge to zero. A product of the form∞_i₌₁(1−ai),ai>0,i=1,2, . . . , converges if∞_i₌₁aiconverges. Also,(a−b)2≤a2−b2fora≥b≥0. Therefore, (5.1) converges as

∞

i=1

i−(2q+1)

2n−1₊_i−(2q+1)≤

n

2

∞

i=1

i−(2q+1)<∞.

The product in (5.2) also converges since

∞

i=1

θ_i2₀

8n−1₊₄_i−(2q+1) ≤

n

8

∞

i=1

θ_i2₀<∞.

The lemma is proved.

The following lemma estimates the posterior probability of themth model.

LEMMA5.2. For any integersmandn≥1,

λ0

exp

1 2

∞

i=1

(i−(2q0+1)−_i−(2qm+1)_)(i−(2q0+1)−_θ2 i0)

i−2(q0+qm+1)+₂_n−1_i−(2q0+1)+_n−2

(12)

PROOF. By the martingale convergence theorem [see, e.g., Williams (1991), page 109],

(q=qm|X)= lim

k→∞ (q=qm|X1, . . . , Xk) a.s.[Pλ].

Therefore, by Lemma 5.1, (q =qm|X1, . . . , Xk) converges, as k → ∞, to

(q =qm|X) a.s. [Pθ0]. Since these, being probabilities, are bounded by 1, the convergence also takes place inL1(Pθ0). By Bayes theorem,

(q=qm|X1, . . . , Xk)

= λm k

i=1(n−1+i−(2qm+1))−1/2exp[−12

k

i=1(n−1+i−(2qm+1))−1Xi2]

_∞

l=−∞λl k

i=1(n−1+i−(2ql+1))−1/2exp[−12

k

i=1(n−1+i−(2ql+1))−1X2i] ≤λm

_k

i=1(n−1+i−(2qm+1))−1/2exp[−12

_k

i=1(n−1+i−(2qm+1))−1Xi2]

λ0k_i₌1(n−1+i−(2q0+1))−1/2exp[−12

k

i=1(n−1+i−(2q0+1))−1Xi2]

.

Putai=(n−1+i−(2qm+1))−1−(n−1+i−(2q0+1))−1. Exploiting the independence of theXi’s underPθ0and using

E exp −α 2X

2₌_√ 1

1+ασ2exp

− µ2α 2(1+ασ2₎

forX distributed asN (µ, σ2)andα >−σ−2, we obtain

Eθ0 (q=qm|X) = lim

k→∞Eθ0 (q=qm|X1, . . . , Xk)

≤ λm

λ0

lim sup k→∞

k

i=1

_n−1₊_i−(2q0+1)

n−1+_i−(2qm+1) 1/2

Eθ0

exp

−ai

2X

2

i

=λm

λ0

∞

i=1

_n−1₊_i−(2q0+1)

n−1+i−(2qm+1) 1/2

1+ai

n

−1/2

exp

− aiθi20

2(1+(ai/n))

=λm

λ0

∞

i=1

1+aii

−(2q0+1)

1+(ai/n) 1/2

exp

− aiθ_i2₀ 2(1+(ai/n))

≤λm

λ0 exp 1 2 ∞

i=1

ai 1+(ai/n)

(i−(2q0+1)−_θ2 i0)

=λm

λ0 exp 1 2 ∞

i=1

(i−(2q0+1)−_i−(2qm+1)_)(i−(2q0+1)−_θ2 i0)

i−2(q0+qm+1)+₂_n−1_i−(2q0+1)+_n−2

.

(13)

REMARK5.1. By exactly the same arguments, we can also conclude that for anyl,

λl exp

1 2

∞

i=1

(i−(2ql+1)−_i−(2qm+1)_)(i−(2ql+1)−_θ2 i0)

i−2(ql+qm+1)+₂_n−1_i−(2ql+1)+_n−2

.

PROOF OFLEMMA 3.1. Set

S1=

∞

i=1

(i−(2q0+1)−_i−(2qm+1)_)i−(2q0+1) i−2(q0+qm+1)+₂_n−1_i−(2q0+1)+_n−2

and

S2= −

∞

i=1

(i−(2q0+1)−_i−(2qm+1)_)θ2 i0

i−2(q0+qm+1)+₂_n−1_i−(2q0+1)+_n−2.

We shall show that there exists anN not depending onm,m <0, such that for

n > N,

S1≤ −₁₂1n1/(q0+qm+1)

(5.3)

and

S2≤₂₄1 n1/(q0+qm+1).

(5.4)

Note that

i−(2q0+1)−_i−(2qm+1)≤

0, for alli,

−1

2i−(2qm+1), fori > I,

where I =21/(2(q0−q−1))_{. Also if} _{n > N}₁ =₂(q0+q−1+1)/(q0−q−1)_{, then for} _i ≤ n1/(q0+qm+1)_{, the first term in the common denominator in the expressions for}

S1 and S2 dominates the other two terms. Piecing these facts together, we see

that the terms inS1 are less than or equal to−1/6 forI < i≤n1/(q0+qm+1)and

less than or equal to zero in general. Thus

S1≤ −1₆

n1/(q0+qm+1) −_I ≤ −1 12n

1/(q0+qm+1)

ifn > N2=(2I+2)2q0+1, and so (5.3) follows.

LetC0=max(1,∞i=1i2q0θi20). Now for anyk,

S2≤

∞

i=1

i−(2qm+1)_θ2 i0

i−2(q0+qm+1)+₂_n−1_i−(2q0+1)+_n−2

≤ k

i=1

i2q0+1_θ2 i0+n2

∞

i=k+1

i−(2qm+1)_θ2 i0

(5.5)

≤kC0+n2(k+1)−(2q0+2qm+1)

∞

i=k+1

(14)

The first term can be made less than or equal to(48)−1n1/(q0+qm+1)_{by choosing}

k= (48C0)−1n1/(q0+qm+1). Sincek+1≥(48C0)−1n1/(2q0+1), the second term

is also less than or equal to (48)−1n1/(q0+qm+1) _for _{n > N}₃_{, where} _N₃ _{is the} smallestnsuch that

i≥(48C0)−1n1/(2q0+1) i2q0_θ2

i0< (48C0)−2(2q0+1).

(5.6)

Note thatN1, N2 andN3do not depend on a particularm <0. Thus forn > N =

max(N1, N2, N3), (5.3) and (5.4) follow. Equation (3.2) now follows from

Lemma 5.2. Equation (3.3) is an immediate consequence of (3.2).

6. Proof of Lemmas 3.2 and 3.3. To simplify notation, we drop the overhead tildes from ˜ andλ˜m’s and simply write =

_∞

m=1λm m, where

_∞

m=1λm=1.

PROOF OFLEMMA 3.2. By Chebyshev’s inequality,

θ:

∞

i=1

i2q0_θ2 i > BX

≤B−1

∞

i=1

i2q0_E(θ2 i|X).

Now

E(θ_i2|X)= ∞

m=1

(q=qm|X)E(θi2|X, q=qm)

= ∞ m=1

(q=qm|X)

₁

n+i2qm+1 +

n2X2_i (n+i2qm+1₎2

≤ 1

n+i2q1+1 +

n2X2_i (n+i2q0+1₎2.

Therefore,

Eθ0

θ:

∞

i=1

i2q0_θ2 i > BX

≤B−1 ∞

i=1

i2q0

1

n+i2q1+1 +

n2Eθ0X

2

i

(n+i2q0+1₎2

=B−1

_∞

i=1

i2q0 n+i2q1+1 +

∞

i=1

n2i2q0_θ2 i0

(n+i2q0+1₎2 + ∞

i=1

ni2q0 (n+i2q0+1₎2

.

It therefore suffices to show that the sums in the above display are finite and bounded inn. The first two sums are clearly so because these are bounded by the convergent series ∞_i₌₁i−1−2(q1−q0) _and ∞

(15)

last sum into sums over ranges{i: 1≤i≤k}and{i:i > k}, withk= n1/(2q0+1)_. Then

∞

i=1

ni2q0

(n+i2q0+1₎2 ≤n −1

k

i=1

i2q0+_n ∞

i=k+1

i−2q0−2

≤ n−1k2q0+1+n(k+1)−2q0−1 2q0+1

,

which is bounded by 2/(2q0+1).

To prove Lemma 3.3, it is more convenient to view the sample as n i.i.d. observationsY1,Y2, . . . ,Yn, where each Yj is distributed like Y=(Y1, Y2, . . .)

with distribution Pθ: the Yi’s are independently distributed as N (θi,1), i = 1,2, . . . .Note that here we distinguish between thisPθ andPθ , n, the distribution ofX. We need some preparatory lemmas.

LEMMA6.1. For anyθ,θ0∈2:

(i) Pθ is absolutely continuous with respect toPθ0.

(ii) ∞_i₌₁(Yi −θi0)(θi −θi0) converges a.s. [Pθ0] and in L2(Pθ0), and has mean0and varianceθ−θ02.

(iii) dPθ

dPθ0

(Y)=exp

_∞

i=1

(Yi−θi0)(θi−θi0)−

1

2θ−θ0

2

.

(iv) −

log dPθ

dPθ0

(y)dPθ0(y)= 1

2θ−θ0

2

.

(v) _log dPθ

dPθ0 (y)

2

dPθ0(y)= θ−θ0

2₊1

4θ−θ0

4_.

(vi) The Hellinger distance

H (θ0,θ)=

_dP

θ

dPθ0 )(y)

1/2

−1 2

dPθ0(y)

1/2

satisfies

H2(θ0,θ)=2

1−exp−1₈θ−θ02 ,

(6.1)

so that,in particular,H (θ0,θ)≤ 1₂θ−θ0,and ifθ−θ0 ≤1,thenH (θ0,θ)≥

(16)

PROOF. WhereasPθ and Pθ0 are product measures and the infinite product of affinities of their respective components converges to exp[−1₈θ −θ02],

(i) follows from Kakutani’s criterion as in the proof of Lemma 5.1. For (ii), consider the mean zero martingale Sk=

_k

i=1(Yi −θi0)(θi−θi0) and note that

sup_k_≥₁ES_k2 = θ −θ02<∞. The martingale convergence theorem [see, e.g.,

Williams (1991), page 109] now applies. For (iii), note that on the sigma-field generated by(Y1, Y2, . . . , Yk), the Radon–Nikodym derivative is given by

exp

_k

i=1

(Yi−θi0)(θi−θi0)−1₂

k

i=1

(θi−θi0)2

.

The rest follows from (ii) and the martingale convergence theorem. Assertions (iv) and (v) are immediate consequences of (ii) and (iii). For (vi), note that

H2(θ0,θ)=2−2Eθ0((dPθ/dPθ0)(Y))

1/2_{. To evaluate}_E

θ0((dPθ/dPθ0)(Y))

1/2_,

consider the martingale

Sk=exp

1 2

k

i=1

(Yi−θi0)(θi−θi0)−1₈

k

i=1

(θi−θi0)2

.

Clearly Sk is positive, has unit expectation and bounded second moment, so its limit (dPθ/dPθ0)

1/2_(Y)_exp_[1

8θ −θ02] also has expectation 1. This

implies (6.1). The next assertion in (vi) now follows from the inequality 1−e−x≤xand the last from the mean value theorem.

The following lemma, which estimates the probability of a ball under product normal measure from below, will be used in the calculation of certain prior probabilities. The result appeared as Lemma 5 in Shen and Wasserman (2001). Its simple proof is included for completeness.

LEMMA 6.2. Let W1, . . . , WN be independent random variables, with Wi having distributionN (−ξi, i−2d),d >0.Then

Pr

_N

i=1

W_i2≤δ2

≥2−N/2e−dNexp

− N

i=1

i2dξ_i2

Pr

_N

i=1

V_i2≤2δ2N2d

,

whereV1, . . . , VN are independent standard normal random variables.

PROOF. Using (a +b)2 ≤ 2(a2 +b2), N! ≥ e−NNN and the change of variablevi=

√

2Ndwi, we obtain

Pr

_N

i=1

W_i2≤δ2

=_N i=1w2i≤δ2

N

i=1

(2π )−1/2idexp−1₂i2d(wi+ξi)2

(17)

≥(2π )−N/2(N!)d

N

i=1w2i≤δ2 exp

−

N

i=1

i2d(w_i2+ξ_i2)

dw1· · ·dwN

≥(2π )−N/2(N!)dexp

− N

i=1

i2dξ_i2

×_N i=1w2i≤δ2

exp

−N2d

N

i=1

w2_i

dw1· · ·dwN

≥(N!)dexp

− N

i=1

i2dξ_i2

2−N/2N−dN(2π )−N/2

×_N

i=1vi2≤2N2dδ2 exp −1 2 N

i=1

v2_i

dv1· · ·dvN

≥2−N/2e−dNexp

− N

i=1

i2dξ_i2

Pr

_N

i=1

V_i2≤2N2dδ2

.

For simplicity, we now drop the overhead bars from ¯ andλ¯ and simply write =∞m=0λm m, where

_∞

m=0λm=1.

LEMMA6.3. There exist positive constantsC, csuch that for allε >0,

{θ:θ−θ0 ≤ε} ≥Cλ0exp

−cε−1/q0_. (6.2)

PROOF. Clearly the left-hand side (LHS) of (6.2) is bounded from below by λ0 0{θ:∞i=1(θi−θi0)2≤ε2}. Now by independence, for anyN,

0

θ:

∞

i=1

(θi−θi0)2≤ε2 ≥ 0 θ: N

i=1

(θi−θi0)2≤ε2/2 0 θ: ∞

i=N+1

(θi−θi0)2≤ε2/2

.

Also

∞

i=N+1

(θi−θi0)2≤2

∞

i=N+1

θ_i2+2

∞

i=N+1

θ_i2₀.

(6.3)

The second sum in (6.3) is less than or equal to

2N−2q0 ∞

i=N+1

i2q0_θ2

i0≤2N−2q0C0<

ε2

(18)

wheneverN > N1=(8C0)1/2q0ε−1/q0, whereC0=∞i=1i2q0θi20. By Chebyshev’s

inequality, the first sum on the right-hand side (RHS) of (6.3) is less thanε2/4 with probability at least

1− 8

ε2 ∞

i=N+1

E ₀(θ_i2)=1− 8

ε2 ∞

i=N+1

i−(2q0+1)≥₁− 4 q0N2q0ε2

>1

2

if N > N2 = (8/q0)1/2q0ε−1/q0. To bound 0{θ:Ni=1(θi − θi0)2 ≤ ε2/2},

we apply Lemma 6.2 with d =q0 + 1₂, ξi =θi0 and δ2 =ε2/2. Note that by

the central limit theorem, the factor Pr{N_i₌₁V_i2≤2δ2N2d} on the RHS of the inequality from Lemma 6.2 is at least 1₄ if 2δ2N2d > N and N is large, that is, ifN > N3=ε−1/q0andN > N4. ChoosingN =max(N1, N2, N3, N4)and noting

thatN_i₌₁i2q0+1_θ2

i0≤N C0, (6.2) is obtained.

The final lemma gives the entropy estimates. Recall that for a totally bounded subset S of a metric space with a distance function d, the ε-packing number

D(ε, S, d)is defined to be the largest integermsuch that there exists1, . . . , sm∈S withd(sj, sk) > εfor allj, k=1,2, . . . , m,j =k. Theε-entropy is logD(ε, S, d). A bound for the entropy of Sobolev balls was obtained in Theorem 5.2 of Birman and Solomjak (1967). It may be possible to derive a bound similar to that in Lemma 6.4 from their result by a Fourier transformation. However, the exact correspondence is somewhat unclear because of the possible fractional value of the index q. Below, we present a simple proof which could be of some independent interest as well.

LEMMA6.4. For allε >0,

logD

ε,

θ:

∞

i=1

i2qθ_i2≤B

, ·

≤ 1

2(8B)

1/(2q)_log₄₍₂_e)2q _ε−1/q_. (6.4)

PROOF. Denoteq(B)= {θ:

_∞

i=1i2qθi2≤B}. Ifε2>4B, then

∞

i=1

θ_i2≤ ∞

i=1

i2qθ_i2≤B <ε

2

4

and hence for any two θ,θ ∈ q(B), θ − θ < ε, implying that D(ε,

q(B))=1. Thus (6.4) is trivially satisfied for such an ε. We shall therefore assume thatε2≤4B.

Letθ1, . . . ,θm∈q(B)be such thatθi−θj> ε,j, k=1,2, . . . , m,j =k. For an integerN, consider the set

q, N(B)=

(θ1, . . . , θN,0, . . .): N

i=1

i2qθ_i2≤B

(19)

Withθ¯ =(θ1, . . . , θN,0, . . .), note that for anyθ ∈q(B),

θ− ¯θ2=

∞

i=N+1

θ_i2≤(N +1)−2qB≤ε

2

8

ifN = (8B)1/(2q)ε−1/q. For this choice ofN, we have

1≤N≤(8B)1/(2q)ε−1/q< N+1≤2N .

(6.5)

Therefore, forj, k=1,2, . . . , m,j=k,

ε2<θj−θk2= ¯θj − ¯θk2+ (θj − ¯θj)−(θk− ¯θk)2≤ ¯θj − ¯θk2+4

ε2

8 ,

implying that ¯θj − ¯θk > ε/ √

2. The set q, N(B) may be viewed as an

N-dimensional ellipsoid and the θ¯j’s as N-dimensional vectors in RN with the usual Euclidean distance. Fort=(t1, . . . , tN)andδ >0, letB(t, δ)stand for the Euclidean ball with radius δ centered at t. The balls B(θ¯j, ε/2

√

2) are clearly disjoint. Note that ift∈q(B)andt=(t1, . . . , tN )∈B(t, ε/2

√

2), we have

N

i=1

i2qt_i2≤2 N

i=1

i2qt_i2+2 N

i=1

i2q(t_i−ti)2≤2B+2N2q

ε2

8 ≤4B

by (6.5), and hence

m

j=1

B

¯ θj,

ε

2√2

⊂q, N(4B).

LetVN stand for the volume of the unit ball inRN. Note that the volume of the ellipsoid{t:N_i₌₁a_i2t_i2≤A}is given byVNAN/2/Ni=1ai. Bounding the volume ofm_j₌₁B(θ¯j, ε/2

√

2)by that ofq, N(4B), we obtain

m

ε2

8 N/2

VN ≤(4B)N/2VN N

i=1

i−q=(4B)N/2VN(N!)−q.

SinceN! ≥NNe−N for allN≥1, we arrive at the estimate

m≤

₃₂_Be2q

N2q_ε2 N/2

.

(6.6)

From (6.5), we also have N2qε2 ≥ 23−2qB. Substituting in (6.6), the required bound is obtained.

(20)

THEOREM 6.1. Suppose we have i.i.d. observations Y1, Y2, . . . from a

distribution P having a density p (with respect to some sigma-finite measure)

belonging to a class P. Let H (P0, P ) denote the Hellinger distance between

P0 andP. Let be a prior on P and P¯ ⊂P. Let εn be a positive sequence such thatεn→0andnεn2→ ∞and suppose that

logD(εn,P¯, H )≤c1nε2n (6.7)

and

P:− log p

p0

dP0≤εn2, log

p p0

2

dP0≤εn2

≥c2e−c3nε

2

n (6.8)

for some constantsc1, c2, c3, wherep0 stands for the density of P0. Then for a

sufficiently large constantM,the posterior probability

P ∈ ¯P:H (P0, P )≥Mεn|Y1, Y2, . . . , Yn

→0 (6.9)

inP₀n-probability.

PROOF OF LEMMA 3.3. TakeP = {Pθ:θ ∈2} andP¯ = {Pθ:

_∞

i=1i2q0×

θ_i2 ≤ B}. By part (vi) of Lemma 6.1, the Hellinger distance H (Pθ0, Pθ) is equivalent to the2distanceθ0−θ, so in (6.7) and in the conclusion (6.9), the

former may be replaced by the latter. Lemma 6.4 shows that (6.7) is satisfied by a multiple ofn−q0/(2q0+1)_{. For (6.8), note that by parts (iv) and (v) of Lemma 6.1,} the set

θ:−Eθ0

log dPθ

dPθ0 (Y)

≤ε2, Eθ0

log dPθ

dPθ0 (Y)

2

≤ε2

contains{θ:θ−θ02≤ε2/2}forε <1. Therefore, by Lemma 6.3, (6.8) is also

satisfied by a multiple ofn−q0/(2q0+1)_{. The result thus follows.}

7. Continuous spectrum adaptation. Suppose now thatq0∈Q, whereQis

an arbitrary subset of the positive semiaxisR₊. Since our basic approach relies on the assumption that the smoothness parameter can only take countably many possible values, we proceed as follows. Choose a countable dense subsetQ∗ofQ

and put a priorλon it such thatλ(q=s) >0 for eachs∈Q∗.

THEOREM 7.1. For any sequence Mn → ∞ and for any given δ >0, the posterior probability

θ:nq0/(2q0+1)−δ_θ−_θ

0> Mn|X

→0

(21)

Note that the prior does not depend onδ, so the obtained rate is almost optimal. The proof of the following theorem is very similar to that of Theorem 3.1. The main differences in the proofs are indicated in the outline of the proof below.

OUTLINE OF THE PROOF. The true q0 may not belong to Q∗, but we may

choose q∗∈Q∗, q∗< q0 arbitrarily close to q0. Choose also q,¯ q˜ ∈Q∗ which

satisfyq < q¯ ∗<q < q˜ 0, and henceθ0∈q0⊂q˜ ⊂q∗⊂q¯.

Lemmas 3.1 and 3.2 go through in the sense that (q < q∗|X)→0 asn→ ∞

and

sup n

θ:

∞

i=1

i2q¯θ_i2> B, q≥q∗X

→0 asB→ ∞

inPθ0-probability. To see this, apply Lemma 3.1 with q0 replaced by q˜ andq−1 byq∗and Lemma 3.2 withq0replaced byq¯ andq1byq∗.

As in the proof of Theorem 3.1, we obtain the bound

θ:nq/(¯ 2q¯+1)θ−θ0> MnX

= θ:nq/(¯ 2q¯+1)θ−θ0> Mn, q≥q∗X

+ θ:nq/(¯ 2q¯+1)θ−θ0> Mn, q < q∗X

≤

θ:

∞

i=1

i2q¯θ_i2> B, q≥q∗X

+

θ:nq/(¯ 2q¯+1)θ−θ0> Mn,

∞

i=1

i2q¯θ_i2≤B, q≥q∗X

+ (q < q∗|X).

Note that the fourth term on the RHS of the series of inequalities in the proof of Theorem 3.1 has been absorbed into the first two terms on the RHS of the last inequality. Now, the second term on the RHS of the last display converges to zero by Lemma 3.3 with q0 replaced by q¯. Therefore all the terms go to zero. Since

¯

q can be made arbitrarily close toq0, this means that for any given δ >0, the

posterior converges at least at the raten−q0/(2q0+1)+δ _in

2.

REMARK 7.1. The loss of a power δ is due to the fact that the q’s are not strictly separated. It is intuitively clear that the requirement of strict separation stems from the fact that the closer the possible values of the smoothness parameter are, the more difficult it is to distinguish between them.

(22)

on the estimates in Lemmas 5.2 and 6.3 does not seem to work because of the absence of point masses. Nevertheless, we believe that Theorem 7.1 continues to hold in this case, but a complete analogue of Theorem 3.1 will possibly not hold. The intuition behind our belief is essentially the same as in Theorem 7.1. Since the slightest deviation in the assumed value ofq from the trueq0affects the

rate by a power onn, and without a point mass at q0, one can at best hope for a

concentration of the posterior distribution ofqvalues in a neighborhood ofq0, the

rate of convergence of the posterior distribution will be affected by an arbitrarily small power ofn.

8. Uniformity over ellipsoids. It is of some interest to know the extent to which the posterior convergence in Theorems 2.1, 3.1 and 7.1 is uniform over θ0; see Brown, Low and Zhao (1997) in this context. For Theorem 2.1, it

is immediately seen from the proof that (2.1) holds uniformly over an ellipsoid, that is, for anyQ≥0 andMn→ ∞,

sup θ0∈q(Q)

Eθ0 q

θ:nq/(2q+1)θ−θ0> MnX

→0.

For Theorem 3.1, such a proposition is much more subtle. One needs to check uniformity in each of Lemmas 3.1, 3.2 and 3.3. In Theorem 8.1 below, we show that a local uniformity holds in the sense that for everyθ0 ∈0, there exists a

small ellipsoid around it where the convergence is uniform.

THEOREM 8.1. For anyθ∗∈q0there existsε >0such that

sup θ0∈Eq₀(θ∗, ε)

Eθ0 q

θ:nq0/(2q0+1)_θ−_θ

0> MnX

→0,

whereEq0(θ

∗_{, ε)}_{= {}_θ

0:

_∞

i=1i2q0(θi0−θi∗)2< ε}.

OUTLINE OF THEPROOF. We need to check the uniformity in every step of

the proof of Theorem 3.1. Going through the proof of Lemma 3.2, it is easily seen that for anyQ >0,

lim

B→∞_θ sup 0∈q0(Q)

sup n≥1

Eθ0˜

θ:

∞

i=1

i2q0_θ2 i > BX

=0.

For the uniform version of Lemma 3.3, we need to show that for anyB >0,

lim n→∞_θ sup

0∈q₀(Q)

Eθ0¯

θ:rnθ−θ0> Mn,

∞

i=1

i2q0_θ2 i ≤BX

=0.

(8.1)

Further, note that (6.9) in Theorem 6.1 could have been stated as an inequality [see the proof of Theorem 2.1 of Ghosal, Ghosh and van der Vaart (2000)],

Eθ0

P ∈ ¯P:H (P0, P )≥Mεn|Y1, Y2, . . . , Yn} ≤Ce−cnε 2

n+ 1

nε2

n

(23)

where C and c are positive constants depending on c1, c2 andc3 that appeared

in the conditions of Theorem 6.1. Therefore, (8.1) holds if the conditions of Theorem 6.1 hold simultaneously for θ0 ∈q0(Q). The entropy equation (6.7) does not depend on the true valueθ0[see relation (6.4) in Lemma 6.4], soc1is free

ofθ0. It is possible to choose the samec2 andc3 forθ0 belonging to the ellipsoid

q(Q) by the estimates given by Lemmas 6.2 and 6.3. Therefore, it remains to check uniformity in Lemma 3.1.

However, (3.3) in Lemma 3.1 does not hold uniformly overθ0∈q0(Q)for a givenQ >0. The problem is that by varyingθ0 overq0(Q), one can encounter arbitrarily slow convergence of the LHS of (5.6) to zero, making the integer N3

defined there dependent onθ0. Nevertheless, for a givenθ∗=(θ₁∗, θ₂∗, . . .)∈q0, it is possible to choose anε >0 such that

lim

n→∞ sup

θ0∈Eq₀(θ∗, ε)

Eθ0 (q < q0|X)→0. (8.2)

The proof is largely a repetition of the arguments given in the proof of Lemma 3.1, so we restrict ourselves only to those places where modifications are needed.

First, if 0 < ε < 1, θ0 ∈ E(θ∗, ε) and C0 = ∞i=1i2q0(θi∗)

2_{, then using}

(a−b)2≤2(a2+b2), we have∞_i₌₁i2q0_θ2

i0≤D0, whereD0=2C0+2. Proceed

as in the proof of Lemma 3.1 with k= (48D0)−1n1/(q0+qm+1). If we choose

ε=4−1(48D0)−2(2q0+1)andN3the smallest integernsuch that

i≥(48D0)−1n1/(2q0+1) i2q0_(θ∗

i)2<4−1(48D0)−2(2q0+1),

then it easily follows that forn≥N3,

i≥(48D0)−1n1/(2q0+1) i2q0_θ2

i0< (48D0)−2(2q0+1).

The proof of (8.2) now follows as before and, hence, the theorem is proved.

REMARK8.1. The uniformity holds forθ0 belonging to a compact set in the

topology generated by the norm(∞_i₌₁i2q0_θ2

i)1/2onq0.

REMARK8.2. In a similar manner, one can formulate the uniform version of

posterior convergence for the case of a continuous spectrum as well.

(24)

REFERENCES

BIRMAN, M. S. and SOLOMJAK, M. Z. (1967). Piecewise-polynomial approximation of functions of the classW_pα.Soviet Math. Sb.73295–317.

BROWN, L. D. and LOW, M. G. (1996). Asymptotic equivalence of nonparametric regression and white noise.Ann. Statist.242384–2398.

BROWN, L. D., LOW, M. G. and ZHAO, L. H. (1997). Superefficiency in nonparametric function estimation.Ann. Statist.252607–2625.

COX, D. D. (1993). An analysis of Bayesian inference for nonparametric regression.Ann. Statist.21

903–923.

DIACONIS, P. and FREEDMAN, D. (1997). On the Bernstein–von Mises theorem with infinite dimensional parameters. Technical Report 492, Dept. Statistics, Univ. California. EFROMOVICH, S. Y. and PINSKER, M. S. (1984). Learning algorithm for nonparametric filtering.

Autom. Remote Control451434–1440.

FREEDMAN, D. (1999). On the Bernstein–von Mises theorem with infinite dimensional parameters.

Ann. Statist.271119–1140.

GHOSAL, S., GHOSH, J. K. andVAN DERVAART, A. W. (2000). Convergence rates of posterior distributions.Ann. Statist.28500–531.

GHOSAL, S., LEMBER, J. and VAN DER VAART, A. W. (2002). On Bayesian adaptation. In

Proceedings of 8th Vilnius Conference on Probability and Statistics