• No results found

Minimising MCMC variance via diffusion limits, with an application to simulated tempering

N/A
N/A
Protected

Academic year: 2020

Share "Minimising MCMC variance via diffusion limits, with an application to simulated tempering"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

warwick.ac.uk/lib-publications

Original citation:

Roberts, Gareth O. and Rosenthal, Jeffrey S. (Jeffrey Seth). (2014) Minimising MCMC variance via diffusion limits, with an application to simulated tempering. The Annals of Applied Probability, 24 (1). pp. 131-149.

Permanent WRAP URL:

http://wrap.warwick.ac.uk/60148

Copyright and reuse:

The Warwick Research Archive Portal (WRAP) makes this work by researchers of the University of Warwick available open access under the following conditions. Copyright © and all moral rights to the version of the paper presented here belong to the individual author(s) and/or other copyright owners. To the extent reasonable and practicable the material made available in WRAP has been checked for eligibility before being made available.

Copies of full items can be used for personal research or study, educational, or not-for-profit purposes without prior permission or charge. Provided that the authors, title and full bibliographic details are credited, a hyperlink and/or URL is given for the original metadata page and the content is not changed in any way.

Publisher’s statement:

Published version: http://dx.doi.org/10.1214/12-AAP918

A note on versions:

The version presented in WRAP is the published version or, version of record, and may be cited as it appears here.

(2)

DOI:10.1214/12-AAP918

©Institute of Mathematical Statistics, 2014

MINIMISING MCMC VARIANCE VIA DIFFUSION LIMITS, WITH AN APPLICATION TO SIMULATED TEMPERING

BYGARETHO. ROBERTS ANDJEFFREYS. ROSENTHAL1

University of Warwick and University of Toronto

We derive new results comparing the asymptotic variance of diffusions by writing them as appropriate limits of discrete-time birth–death chains which themselves satisfy Peskun orderings. We then apply our results to sim-ulated tempering algorithms to establish which choice of inverse tempera-tures minimises the asymptotic variance of all functionals and thus leads to the most efficient MCMC algorithm.

1. Introduction. Markov chain Monte Carlo (MCMC) algorithms are very widely used to approximately compute expectations with respect to complicated high-dimensional distributions; see, for example, [7,24]. Specifically, if a Markov chain {Xn} has stationary distribution π on state space X, and h:XR with

π|h|< ∞, then π(h):= h(x)π(dx) can be estimated by 1nni=1h(Xi) for suitably large n. This estimator is unbiased if the chain is started in station-arity (i.e., if X0 ∼ π), and in any case has bias only of order 1/n.

Further-more, it is consistent provided the Markov chain is φ-irreducible. Thus, the ef-ficiency of the estimator is often measured in terms of the asymptotic variance Varπ(h, P ):=limn→∞1nVarπ(

n

i=1h(Xi))(where the subscriptπ indicates that {Xn}is in stationarity): the smaller the variance, the better the estimator.

An important question in MCMC research is how to optimiseit, that is, how to choose the Markov chain optimally; see, for example, [10,15]. This leads to the question of how to comparedifferent Markov chains. Indeed, for two differ-ent φ-irreducible Markov chain kernels P1 and P2 on X, both having the same

invariant probability measureπ, we say thatP1dominatesP2in the efficiency

or-dering, written P1P2, if Varπ(h, P1)≤Varπ(h, P2) for all L2(π ) functionals h:XR, that is, ifP1 is “better” thanP2 in the sense of being uniformly more

efficient for estimating expectations of functionals.

It was proved by Peskun [18] for finite state spaces, and by Tierney [25] for general state spaces (see also [15,16]), that ifP1andP2are discrete-time Markov

chains which are both reversible with respect to the same stationary distributionπ, then a sufficient condition forP1P2 is thatP1(x, A)P2(x, A)for allxX

andAF withx /A, that is, thatP1 dominatesP2off the diagonal.

Received March 2012; revised December 2012.

1Supported in part by NSERC of Canada.

MSC2010 subject classifications.Primary 60J22; secondary 62M05, 62F10.

Key words and phrases.Markov chain Monte Carlo, simulated tempering, optimal scaling, diffu-sion limits.

(3)

Meanwhile, diffusion limits have become a common way to establish asymp-totic comparisons of MCMC algorithms [2–5, 20–22]. Specifically, if P1,d and

P2,d are two different Markov kernels in dimensiond (ford =1,2,3, . . .), with diffusion limitsP1,∗andP2,∗respectively as d→ ∞, then one way to show that

P1,d is more efficient than P2,d for large d is to prove that P1,∗ is more effi-cient that P2,∗. This leads to the question of how to establish that one diffusion is more efficient than another. In some cases (e.g., random-walk Metropolis [20], and Langevin algorithms [21]), this is easy since one diffusion is simply a time-change of the other. But more general diffusion comparisons are less clear; for example, the processes’ spectral gaps

1−sup

h(y)P (x, dy):

h(y)π(dy)=0,

h2(y)π(dy)=1

can be ordered directly by using Dirichlet forms, but this does not lead to bounds on the asymptotic variances.

In this paper, we develop (Section2) a new comparison of asymptotic variance of diffusions. Specifically, we prove (Theorem 1) that if Pi are Langevin diffu-sions with respect to the same stationary distributionπ, with variance functions

σi2(fori=1,2), then ifσ12(x)σ22(x)for allx, thenP1P2, that is,P1is more

efficient thanP2. (We note that Mira and Leisen [12,17] extended the Peskun

or-dering in an interesting way to continuous-time Markov processes on finite state spaces, and on general state spaces when the processes have generators which can be represented asGif (x)=

f (y)Qi(x, dy)and which satisfy the condition that

Q1(x, A\ {x})Q2(x, A\ {x})for allx andA. However, their results do not

ap-pear to apply in our context, since generators of diffusions involve differentiation and thus do not admit such representation.)

We then consider (Section3) simulated tempering algorithms [10,14], and in particular the question of how best to choose the intermediate temperatures. It was previously shown in [1], generalising some results in the physics literature [11,19], that a particular choice of temperatures (which leads to an asymptotic temperature-swap acceptance rate of 0.234) maximises the asymptotic L2 jumping distance, that is, limn→∞E(|XnXn−1|2). (Indeed, this result has already influenced

adap-tive MCMC algorithms for simulated tempering; see, for example, [9].) However, the previous papers did not prove a diffusion limit, nor did they provide any com-parisons of Markov chain variances. In this paper, we establish (Theorem6) diffu-sion limits for certain simulated tempering algorithms. We then apply our diffudiffu-sion comparison results to prove (Theorem7) that the given choice of temperatures does indeed minimise the asymptotic variance of all functionals.

2. Comparison of diffusions. Let π:X(0,) be a C1 target density function, where X is either R or some finite interval [a, b]. We shall consider

nonexplosive Langevin diffusions onX with stationary densityπ, satisfying

dXtσ=σXσdBt +

1

2σ 2Xσ

t

logπXtσ+σXtσσXtσdt

(4)

for some C1 function σ:X → [k, k] for some fixed 0< k < k <∞, and with reflecting boundaries ataandbin the caseX = [a, b].

For two such diffusions 1 and Xσ2, we write (similarly to the above) that 1Xσ2, and say that Xσ1 dominatesXσ2 in the efficiency ordering, if for all L2(π )functionalsf:XR,

lim T→∞T

−1/2Var T 0 f

1

s

ds

≤ lim T→∞T

−1/2Var T 0 f

2

s

ds

.

We wish to argue that ifσ1(x)σ2(x) for allx, then12. Intuitively,

this is because1 “moves faster” thanXσ2, while maintaining the same

station-ary distribution. Indeed, if σ1 andσ2 are constants, then this result is trivial (and

implicit in earlier works [20–22]), since then1

t has the same distribution asX σ2

ct where c=σ12>1; that is, 1 accomplishes the same sampling as 2 in a

shorter time, so it must be more efficient. However, ifσ1 andσ2 are nonconstant

functions, then the comparison of1 andXσ2is less clear.

To make theoretical progress, we assume:

(A1) π is log-Lipschitz function onX; that is, there isL <∞with

logπ(y)logπ(x)L|yx|, x, yX.

(2)

(A2) Either (a)X is a bounded interval[a, b], and the diffusions have re-flecting boundaries at a and b, or (b) X is all of R, and π has exponentially-bounded tails; that is, there is 0< K <∞andr >0 such that

π(x+y)π(x)ery, x > K, y >0

and

π(xy)π(x)ery, x <K, y >0.

In case (A2)(b), we can then find sufficiently largeqKsuch that

i

|i/m|≥q

π(i/m)(1/4)

i

π(i/m) for allmN

(3)

[where the sums in (3) must be finite due to (2)], and then set

Q=infπ(x):|x| ≤q+1,

(4)

which must be positive by continuity ofπand compactness of the interval[−q−1, q+1].

Our main result is then the following.

THEOREM1. IfXσ1 andXσ2are two Langevin diffusions of the form(1)with

respect to the same densityπ,with variance functionsσ1 andσ2 respectively,and

(5)

2.1. Proof of Theorem 1. To prove Theorem 1, we introduce auxiliary pro-cesses for eachmN. Givenσ:XR, letS=2keL, and letZm,σ be a discrete-time birth and death process on the discrete state space Xm:= {i/m;iZ} in case (A2)(b), orXm:= {i/m;iZ} ∩ [a, b]in case (A2)(a), with transition prob-abilities given by

Pi/m, (i+1)/m= 1

2S σ

2(i/m)+σ2((i+1)/m)π((i+1)/m) π(i/m)

,

Pi/m, (i−1)/m= 1

2S σ

2(i/m)+σ2((i−1)/m)π((i−1)/m) π(i/m)

and

P (i/m, i/m)=1−Pi/m, (i+1)/mPi/m, (i−1)/m.

(In case (A2)(a), any transitions which would cause the process to move out of the interval [a, b] are instead given probability 0.) These transition rates are chosen to satisfy detailed balance with respect to the stationary distribution πm on Xm given byπm(i/m)=π(i/m)/xXmπ(x)[andSis chosen to be large enough to

ensure thatP (i/m, (i+1)/m)+P (i/m, (i−1)/m)≤1].

In terms ofZm,σ, we then let{Ym,tσ }t≥0be the continuous-time version ofZm,σ,

speeded up by a factor of m2S/2, that is, defined by Ym,tσ =Zm,σm2St /2 fort≥0.

(Here and throughout, ris the floor function which rounds r down to the next integer, e.g. 6.8 =6 and −2.1 = −3.) It then follows that Ym,t converges toXm,σ, as stated in the following lemma (whose proof is deferred until the end of the paper, since it uses similar ideas to those of the following section).

LEMMA 2. Assuming(A1)and(A2),asm→ ∞,the processesYmσ converge weakly(in the Skorokhod topology)toXσ.

We then apply the usual discrete-time Peskun ordering to theZm,σ processes, as follows.

LEMMA3. Suppose thatσ1(x)σ2(x)for allxR.ThenZm,σ1Zm,σ2.

PROOF. By inspection, the fact thatσ1(x)σ2(x)implies that

PZm,σ1

(i+1)/m=j+1|Z m,σ1

i/m =j

PZm,σ2

(i+1)/m=j+1|Z m,σ2

i/m =j

and

PZm,σ1

(i+1)/m=j−1|Z m,σ1

i/m =j

PZm,σ2

(i+1)/m=j −1|Z m,σ2

i/m =j

.

It follows that Zm,σ1 dominatesZm,σ2 off the diagonal. The usual discrete-time

(6)

To continue, let

V(f, σ ):= lim T→∞T

−1Var π

T

0

fXσsds

,

which we assume satisfies the usual relation

V(f, σ )=

−∞Covπ

fX0σ, fXσsds.

Also, let

Vm(f, σ ):=nlim→∞n−1Varπ

mn

i=1

fZim,σ

,

which we assume satisfies the usual relation

Vm(f, σ )=

i=−∞ Covπ

fZ0m,σ, fZm,σi .

(In both cases, the subscriptπ indicates that the process is assumed to be in sta-tionarity, all the way from time−∞to∞.) We then have the following.

LEMMA 4. Let Gm be the spectral gap of the process Zm,σ. Assume there is some constant g >0 such thatGmg/m2 for all m. Then for all bounded functionsf :RR, limm→∞(m2S/2)Vm(f, σ )=V(f, σ ).

PROOF. Let

Am,t =Covπ

fZ0m,σ, fZm,σm2St /2

and let

A,t=Covπ

fXσ0, fXσt .

Then

V(f, σ )=

A,tdt

and (sincem2St/2is a step-function oft, with steps of sizem2S/2)

Vm(f, σ )=

−∞Am,tdt

m2S/2 .

Now, by Lemma2, sincef is bounded,

lim

m→∞Am,t=A,t.

To continue, letF be the forward operator corresponding to the chainZm,σ, that is, F h(x)=E[h(Z1m,σ)|Z0m,σ =x]. Then sinceF is reversible, it follows from Lemma 2.3 of [13] that

Ft= Ft =supCov

π

h1

Zm,σ0 , h2

Zm,σt

: Varπ(h1)=Varπ(h2)=1

(7)

Lettingv=Varπ[f (X)], we then have, for allmNandt≥0, that

Am,t =Covπ

fZm,σ0 , fZm,σm2St /2

≤supCovπ

hZ0m,σ, hZm,σm2St /2

:hL2(π ),Varπ

h(X)=v

=vFm2St /2=vFm2St /2=v(1−Gm)m

2St /2

v1−g/m2m2St /2≤veg/m2m2St /2=vegSt /2.

Hence,

Vm(f, σ )=

−∞Am,tdt≤2

0 Am,tdt≤4v/gS <.

Hence, by the dominated convergence theorem,

lim m→∞

−∞Am,tdt=mlim→∞

−∞A,tdt,

that is,

lim m→∞

m2S/2Vm(f, σ )=V(f, σ )

as claimed.

To make use of Lemma 4, we need to bound the spectral gaps of the Zm,σ

processes. We do this using a capacitance argument; see, for example, [23]. Let

κm= inf

AXm 0<π(A)≤1/2

1

πm(A)

xA

Pm

x, ACπm(x)

be the capacitance ofZm,σ. We prove

LEMMA5. The capacitanceκmsatisfies that

κm≥min

keLr

2m ,

Qke−2L/m

2m

,

where the quantities LandQare defined in(2) and(4),respectively,and where the bound reduces to simplyκmke

Lr

2m in case(A2)(a).

PROOF. We consider two different cases [only the second of which can occur in case (A2)(a)]:

(i) ∃aAwith|a| ≤q. Then, sinceπm(A)≤1/2, there isjZwith|j/m| ≤

(8)

that(j +1)/mAC. We will need the following estimate onjZπ(j/m). For x∈ [i/m, (i+1)/m),

π(x)π(i/m)eL(xi/m)

so that

(i+1)/m

i/m

π(x)π(i/m)

1/m

0

eLudu=π(i/m) 1−eL/m

L

=π(i/m)eL/m e

L/m1

L

π(i/m)eL/m L/m

L

=π(i/m)eL/m

m .

Therefore summing both sides over alliZ,

1=

−∞π(x) dx

eL/m m

iZ

π(i/m),

whence

iZ

π(i/m)meL/m.

Then

xA

Pm

x, ACπm(x)πm(j/m)Pm

j/m, (j+1)/m

=πm(j/m)(1/22(j/m)eL/mπ(j/m)/m(k/2)e−2L/m

Qke−2L/m/2m.

(ii) A(−∞, q)(q,). LetaAwithπ(a)=max{π(x):xA}. Assume WOLOG thata >0. Then

xA

Pm

x, ACπm(x)πm(a)Pm

a, a(1/m)

keL/mπ(a)

i

|i/m|≥a

π(i/m)

keL/mπ(a)2

j=0

π(a)erj/m

=1 2ke

L/m1er/m1 2ke

(9)

Thus, in either case, the conclusion of the lemma is satisfied.

Now, it is known (e.g., [23]) that the spectral gap can be bounded in terms of the capacitance, specifically thatGmκm2/2. Thus, form≥1,

Gm

min12keL(r/m), Qke−2L/m/2m2/2

≥min12keL(r/m), Qke−2L/2m2/2 =g/m2,

where g = [min(12keLr, Qke−2L/2)]2/2 > 0. This together with Lemma 2 shows that the conditions of Lemma 4 are satisfied. Hence, by Lemma 4, limm→∞(m2S/2)Vm(f, σ )=V(f, σ )for all bounded functionsf.

On the other hand, by Lemma3,Zm,σ1Zm,σ2, that is,V

m(f, σ1)Vm(f, σ2).

Hence, for all bounded functionsf,

V(f, σ1)= lim m→∞

m2S/2Vm(f, σ1)

≤ lim m→∞

m2S/2Vm(f, σ2)

(5)

=V(f, σ2).

Finally, iff is inL2 but not bounded, then letting

fm(x)=

⎧ ⎨ ⎩

m, f (x) > m,

f (x),mf (x)m,

m, f (x) <m,

we have by the monotone (or dominated) convergence theorem that V(f, σ1)=

limm→∞V(fm, σ1) and V(f, σ2) = limm→∞V(fm, σ2). Hence, it follows

from (5) thatV(f, σ1)V(f, σ2)for allL2(π )functionsf. That is,12,

thus proving Theorem1.

3. Simulated tempering diffusion limit. We now apply our results to a ver-sion of the simulated tempering algorithm. Specifically, following [1], we consider ad-dimensional target density

fd(x)=edK d

i=1 f (xi) (6)

for some unnormalised one-dimensional density function f:R→ [0,), where

(10)

We consider simulated tempering ind dimensions, with inverse-temperatures

cho-sen as follows: β0(d)=1, and βi(d)+1=βi(d) (d) i )

d1/2 for some fixed C1 function :[0,1] →R. (The question then becomes, what is the optimal choice of.) As for when to stop adding new temperature values, we fix someχ(0,1)and keep going until the temperatures drop below χ; that is, we stop at temperature βk(d)(d)

wherek(d)=sup{i:βi(d)χ}.

We shall consider a joint process (yn(d), Xn), with XnRd, and with yn(d)

Ed := {βi(d);0≤ik(d)} defined as follows. If yn−1 =βi(d) [where 0< i <

k(d)], then the chain proceeds by choosing Xn−1 ∼, then proposing Zn to beβi+1orβi−1with probability 1/2 each, and finally acceptingZnwith the usual Metropolis acceptance probability. (A proposed move toβ(d)1 orβk(d)(d)+1 is auto-matically rejected.) We assume, as in [1], that the chain then immediately jumps to stationary at the new temperature, that is, that mixing within a temperature is infinitely more efficient than mixing between temperatures.

The process(yn(d), Xn)is thus a Markov chain on the state spaceEd×Rd, with joint stationary density given by

fd(β, x)=edK(β) d

i=1

fβ(xi),

whereK(β)= −logfβ(x) dxis the normalising constant.

We now prove that the{yn(d)} process has a diffusion limit (similar to random-walk Metropolis and Langevin algorithms, see [20–22]), and furthermore the asymptotic variance of the algorithm is minimised by choosing the functionthat leads to an asymptotic temperature acceptance rate=. 0.234. Specifically, we prove the following:

THEOREM 6. Under the above assumptions, the {yn(d)} inverse-temperature process, when speeded up by a factor ofd,converges in the Skorokhod topology asd→ ∞to a diffusion limit{Xt}t≥0 satisfying

dXt =

22 −I 1/2

2

1/2 dBt (7)

+

(X)(X) I

1/2

2

2 I 1/2

2

φI

1/2

2

dt

(11)

Then, combining Theorems1and6, we immediately obtain:

THEOREM7. For the above simulated tempering algorithm,for anyL2 func-tionalf,the choice ofwhich minimises the limiting asymptotic varianceV(f )=

limm→∞Vm(f ),is the same as the choice which maximises σ (x),that is, is the choice which leads to an asymptotic temperature acceptance probability of0.234 (to three decimal places).

REMARK. In this context, it was proved in [1] that as d → ∞, the choice ofleading to an asymptotic temperature acceptance rate=. 0.234 maximises the expected squared jumping distance of the {yn(d)} process. However, the question of whether that choice would also minimise the asymptotic variance for any L2

function was left open. That question is resolved by Theorem7.

3.1. Proof of Theorem6. The key computation for proving Theorem6will be given next, but first we require some additional notation. We let int(Ed) denote

Ed \ {1, βk(d)(d)}. We also denote by G(d) the generator of the inverse-temperature process{yn(d)}and setH to be the set of all functionshC2[χ ,1]with h(χ )=

h(1)=0. We also letG∗ be the generator of the diffusion given in (7), defined, for all functionshH, by

Gh=σ

2(x)h(x)

2 +μ(x)h

(x), hH,

(8)

where

μ(x)=(x)(x) I

1/2

2

2 I 1/2

2

φI

1/2

2

and

σ2(x)=22 −I

1/2

2

.

(9)

(12)

(hd, dG(d)hd)dNsuch that

lim d→∞xsupE

d

h(x)hd(x)=0

(10)

and

lim d→∞xsupEd

Gh(x)dG(d)hd(x)=0. (11)

To establish this convergence on int(Ed), we can simply let hd =h (see Lemma8 below). However, to establish the convergence on the boundary ofEd (Lemma9), we need to modifyhslightly [without destroying the convergence on int(Ed)]. We do this as follows. First, given anyhH, we let

hd(x)=h

γd(x)

,

where

γd(x)=

(1−χ )x+χχd 1−χd

,

so thathd is just likehexcept “stretched” to be defined on[χd,1] instead of just on[χ ,1]. Here we setχd =βk(d)(d) , andχd+=βk(d)(d)1; thusχdχχd+. Notice that sinceχdχasd→ ∞,hd and its first and second derivatives converge toh and its corresponding derivatives uniformly forx∈ [χd,1]asd→ ∞.

Finally, given the functionh, we letη(x)to be any smooth function:[χ ,1] →R

satisfying

η(χ )=h(χ )(χ )/2 and η(1)=h(1)(1)/2

and then set

hd(x)=hd(x)+d−1/2η

γd(x)

=hγd(x)

+d−1/2ηγd(x)

,

so thathd(x)is similar tohd(x)except with the addition of a separateO(d−1/2) term (which will only be relevant at the boundary points, i.e., in Lemma9below). In particular, (10) certainly holds.

In light of the above discussion, Theorem6 will follow by establishing (11), which is done in Lemmas8and9below.

LEMMA8. For allhH,

lim d→∞xsupint(E

d)

dG(d)h(x)Gh(x)=0

(13)

and

lim d→∞xsupint(E

d)

dG(d)h

d(x)Gh(x)=0. (13)

PROOF. We begin with a Taylor series expansion forG(d). Since the compu-tations shall get somewhat messy, we wish to keep only higher-order terms, so for

simplicity we shall use the notation r(d)≈ to mean that the expansion holds up to

terms of order 1/r(d), uniformly forxEd, asd→ ∞[e.g.,LHS d

RHSmeans that limd→∞supxEdd(LHSRHS)=0]. Then for boundedC

2functionalsh, we

have (combining the twohterms together) that forβi(d)∈int(Ed):

G(d)hβi(d)d h (d)

i ) 2

α+βi(d)+1βi(d)+αβi(d)1βi(d)

+h(β (d) i ) 2

βi(d)+1βi(d)2α+

dh(β

(d) i ) 2

α+βi(d)+1βi(d)+αβi(d)1βi(d)

+h(β (d) i ) 2

βi(d)+1βi(d)2α+

= h(β (d) i ) 2

αi(d)1)α+i(d)) d1/2

+h(β (d) i ) 2

(d)

i )2α+

d

,

whereα+ is the probability of accepting an upwards move, andα−is the proba-bility of accepting a downwards move.

To continue, we letg=logf, and

M(β)=Eβ(g)=

logf (x)fβ(x) dx

(x) dx and

I (β)=Varβ(g)=

(logf (x))2fβ(x) dx

(x) dxM(β)

2.

It follows, as in [1], that M(β) =I (β) and K(β)= −M(β), so K(β)=

M(β)= −I (β). We also defineg=gM(β).

(14)

Then, withX,

α− = E

1∧f β+ε

d (X)edK(β+ε)

fdβ(X)edK(β)

= E

1∧exp

K(β+ε)K(β)d+εdM(β)+ε

d

i=1 g(Xi)

d1/2 ≈ E

1∧exp

2

2 K

+3

6 K

+N0, I ε2d

(14)

= E

1∧exp

2

2K

+ε2

6 K

+N0, I 2

= −I1/2 2 +

εK

6I1/2

+expε2K/6 −I

1/2

2 −

εK

6I1/2

.

Similarly,

α+=E

1∧f βε

d (X)edK(βε)

fdβ(X)edK(β)

=E

1∧exp

K(βε)K(β)dεdM(β)ε

d

i=1 g(Xi)

1

E

1∧exp

2

2 K

N0, I ε2d

=E

1∧exp

2

2I

ε2

6 K

N0, I 2

= −I1/2 2 −

εK

6I1/2

+exp−ε2K/6 −I

1/2

2 −

εK

6I1/2

.

Hence

α+βi(d)d 1/2

≈ −I1/2 (d) i )

2 −

εK(βi(d))

6I1/2(d) i )

+exp−ε2βi(d)K(βi)/6

I1/2 (d) i )

2 +

εK(βi(d))

6I1/2(d) i )

(15)

A first order approximation of this expression is

α+βi(d)≈1 2 −I

1/2(d) i ) 2

.

Next, we note that in the current setting,β is itself marginally a Markov chain with uniform stationary distribution among all temperatures. In fact it is a birth and death process, and hence reversible. So, by detailed balance,

α−=α+βi(d)/d.

Therefore,

αβi(d) = α+βi(d)/d

d1/2

α+βi(d)

((β (d) i )I1/2

(d) i )) 2

d

φI

1/2(d) i )

2 −

εK(βi(d))

6I1/2(d) i )

−exp−ε2βi(d)K(βi)/6

((βi(d))I1/2i(d)))

2

× √−

d

φI

1/2(d) i )

2 +

εK(βi(d))

6I1/2(d) i )

.

Then, sinced 1/2

+εd 1/2

+ε=+d1/2, we compute that

μβi(d)d 1/2

≈ 1

2d1/2

α++ +

d1/2

× α+βi(d)

((β (d) i )I1/2

(d) i )) 2

× √−

d

φI

1/2(d) i )

2 −

εK(βi(d))

6I1/2(d) i )

−exp−ε2βi(d)K(βi)/6

((βi(d))I1/2i(d)))

2

× √−

d

φI

1/2(d) i )

2 +

εK(βi(d))

6I1/2(d) i )

(16)

Hence, ignoring all lower order terms,

μβi(d)d 1/2

≈ 1

2d1/2

((β

(d) i )I1/2

(d) i )) 2 × √− d

φI

1/2(d) i )

2 −

εK(βi(d))

6I1/2(d) i )

exp((β (d) i )I

1/2(d) i )) 2

× √−

d

φI

1/2(d) i )

2 +

εK(βi(d))

6I1/2(d) i )

+2 (I1/2 (d)

i )/2)

d1/2

d1/2 ≈ 1

d

2((β

(d) i )I1/2

(d) i ))

2 φ

I1/2i(d))

2

+ −I1/2 (d) i ) 2 .

Similarlyσ2i(d))is to first order

22

d

I1/2i(d))

2

so that we can write (for 0< β <1)

Gdhd 1

d

2 I1/2 (d) i ) 2

h(β)

+ −I1/2 (d) i ) 2

2((β

(d) i )I1/2

(d) i ))

2 φ

I1/2i(d))

2

h(β)

.

However, this expression is justd−1Gh, thus establishing (12).

Finally, to establish (13), we note that in this case the termsd−1/2η(γd(x))and

hd(x)h(x)are both lower-order and do not affect the limit. Hence, (13) follows directly from (12).

(17)

would be complete simply by settinghd =hand applying Lemma8.) However, the following lemma shows that with the definition ofhd used here, the extension to the boundary does indeed hold.

LEMMA9. For allhH,forx=1and forx=χd,

lim d→∞

dG(d)h

d(x)Gh(x)=0.

PROOF. We prove the case whenx=χd; the casex=1 is similar but some-what easier (since thenx does not depend ond).

Mimicking the Taylor expansion of Lemma8,

G(d)hd(χd) d

hd(χd)[α(χd+−χd)] 2

+hd(χd) 4

χdχd+

2 α

=hd(χd) 2

αd+) d1/2 +

hd(χd) 4

d)2α

d

d

α(χd+) 2d1/2

h(χ )+η(χ )d−1/2

+hd(χd) 4

d)2α

d

.

Thus sinceh(χ )=0, this expression equals

hd(χd) 2

d)2α

d

.

Next we note from (14) that

α−≈1 2 −I

1/2

2

.

Hence, the above results show that

lim

d→∞dGdhd(χd)= 2

(χ )h(χ ) I 1/2

2

.

In light of formulae (8) and (9), this completes the proof.

(18)

PROOF OF LEMMA 2. We first compute that, to first order as h0 and

m→ ∞, writingx=i/mande=1/m, we have

E Ym,tσ +hYm,tσ Ym,tσ = i m

m2Sh 2

1

m

1 2S

×

σ2 i m

+π((i+1)/m)σ2((i+1)/m)

π(i/m)

σ2 i m

π((i−1)/m)σ2((i−1)/m)

π(i/m)

=hm 4

π(x+e)σ2(x+e)

π(x)

π(xe)σ2(xe) π(x)

hm 4

π(x)+eπ(x)σ2(x)+2(x)

π(x)eπ(x)σ2(x)2(x)/π(x)

=hm 4

2

eπ(x)σ2(x)+2eπ(x)(σ2)(x) π(x)

=hm 4 (2e)

(logπ )(x)σ2(x)+2σ (x)σ(x)

=h

1

2(logπ )

(x)σ2(x)+σ (x)σ(x)

and also

E Ym,tσ +hYm,tσ 2Ym,tσ = i m

m2Sh 2

1 2S

1

m2

2σ2(x)+2σ2(x)=2(x).

A comparison with (1) then shows thatYmσ satisfies the same first and second mo-ment characteristics ast , so thatt is indeed the correct putative limit.

(19)

4. Discussion. This paper has linked the usual Peskun ordering on asymp-totic variance of discrete-time Markov chains, to asympasymp-totic variance of diffusion processes. It has then applied these results to simulated tempering algorithms, by proving that the inverse-temperatures of such algorithms converge (in an appropri-ate limit) to a diffusion. By maximising the speed of the resulting diffusion, it has obtained results about the optimal choice of the temperature spacings.

We believe that Theorem1could be useful in other contexts as well, whenever we wish to compare two Langevin diffusion algorithms directly, or alternatively whenever we wish to compare two discrete-time processes which both have ap-propriate diffusion limits.

Of course, Theorem1requires assumptions (A1) and (A2). These are primarily just regularity assumptions, which would likely be satisfied in most applications of interest. On the other hand, the “exponentially-bounded tails” aspect of assump-tion (A2) is more than technical; rather, it provides us with some control over the extreme tail excursions of the processes which we consider, and we suspect that our limiting results might fail if no such control is provided.

Finally, our simulated tempering diffusion limit is only proven under the rather strong and artificial assumption (6) involving a product form of the target density. Indeed, this assumption is central to our method of proof. However, as mentioned earlier, it is known [2–5,20,22] that the general conclusions in this special case often hold in greater generality, either approximately in numerical simulation stud-ies, or theoretically through more general methods of proof. In a similar spirit, we believe that the simulated tempering diffusion limit proven herein would approxi-mately hold numerically in greater generality. In addition, it might be possible to prove a stronger version of our diffusion limit, with weaker assumptions, though such proofs would get rather technical and we do not pursue them here.

REFERENCES

[1] ATCHADÉ, Y. F., ROBERTS, G. O. and ROSENTHAL, J. S. (2011). Towards optimal scaling of Metropolis-coupled Markov chain Monte Carlo.Stat.Comput.21555–568.MR2826692 [2] BÉDARD, M. (2007). Weak convergence of Metropolis algorithms for non-i.i.d. target

distribu-tions.Ann.Appl.Probab.171222–1244.MR2344305

[3] BÉDARD, M. (2008). Optimal acceptance rates for Metropolis algorithms: Moving beyond 0.234.Stochastic Process.Appl.1182198–2222.MR2474348

[4] BÉDARD, M. and ROSENTHAL, J. S. (2008). Optimal scaling of Metropolis algorithms: Head-ing toward general target distributions.Canad.J.Statist.36483–503.MR2532248 [5] BESKOS, A., ROBERTS, G. and STUART, A. (2009). Optimal scalings for local Metropolis–

Hastings chains on nonproduct targets in high dimensions.Ann.Appl.Probab.19863– 898.MR2537193

[6] BHATTACHARYA, R. N. and WAYMIRE, E. C. (1990).Stochastic Processes with Applications. Wiley, New York.MR1054645

[7] BROOKS, S., GELMAN, A., JONES, G. L. and MENG, X.-L., eds. (2011). Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC, Boca Raton, FL.MR2742422 [8] ETHIER, S. N. and KURTZ, T. G. (1986).Markov Processes:Characterization and

(20)

[9] FORT, G., MOULINES, E. and PRIOURET, P. (2011). Convergence of adaptive and interacting Markov chain Monte Carlo algorithms.Ann.Statist.393262–3289.MR3012408 [10] GEYER, C. (1992). Practical Markov chain Monte Carlo.Statist.Sci.7473–483.

[11] KOFKE, D. A. (2002). On the acceptance probability of replica-exchange Monte Carlo trials. J.Chem.Phys.1176911. Erratum:J. Chem. Phys.12010852.

[12] LEISEN, F. and MIRA, A. (2008). An extension of Peskun and Tierney orderings to continuous time Markov chains.Statist.Sinica181641–1651.MR2469328

[13] LIU, J. S., WONG, W. H. and KONG, A. (1994). Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes.Biometrika 8127–40.MR1279653

[14] MARINARI, E. and PARISI, G. (1992). Simulated tempering: A new Monte Carlo scheme. Europhys.Lett.19451–458.

[15] MIRA, A. (2001). Ordering and improving the performance of Monte Carlo Markov chains. Statist.Sci.16340–350.MR1888449

[16] MIRA, A. and GEYER, C. J. (2000). On non-reversible Markov chains. InMonte Carlo Meth-ods(Toronto,ON, 1998).Fields Institute Communications2695–110. Amer. Math. Soc., Providence, RI.MR1772309

[17] MIRA, A. and LEISEN, F. (2009). Covariance ordering for discrete and continuous time Markov chains.Statist.Sinica19651–666.MR2514180

[18] PESKUN, P. H. (1973). Optimum Monte-Carlo sampling using Markov chains.Biometrika60 607–612.MR0362823

[19] PREDESCU, C., PREDESCU, M. and CIOBANU, C. V. (2004). The incomplete beta function law for parallel tempering sampling of classical canonical systems.J.Chem.Phys.120 4119–4128.

[20] ROBERTS, G. O., GELMAN, A. and GILKS, W. R. (1997). Weak convergence and opti-mal scaling of random walk Metropolis algorithms. Ann. Appl. Probab. 7 110–120. MR1428751

[21] ROBERTS, G. O. and ROSENTHAL, J. S. (1998). Optimal scaling of discrete approximations to Langevin diffusions.J.R.Stat.Soc.Ser.B Stat.Methodol.60255–268.MR1625691 [22] ROBERTS, G. O. and ROSENTHAL, J. S. (2001). Optimal scaling for various Metropolis–

Hastings algorithms.Statist.Sci.16351–367.MR1888450

[23] SINCLAIR, A. (1992). Improved bounds for mixing rates of Markov chains and multicommod-ity flow.Combin.Probab.Comput.1351–370.MR1211324

[24] TIERNEY, L. (1994). Markov chains for exploring posterior distributions. Ann. Statist. 22 1701–1762.MR1329166

[25] TIERNEY, L. (1998). A note on Metropolis–Hastings kernels for general state spaces.Ann. Appl.Probab.81–9.MR1620401

[26] WARD, A. R. and GLYNN, P. W. (2003). A diffusion approximation for a Markovian queue with reneging.Queueing Syst.43103–128.MR1957808

DEPARTMENT OFSTATISTICS UNIVERSITY OFWARWICK COVENTRY, CV4 7AL UNITEDKINGDOM

E-MAIL:[email protected]

DEPARTMENT OFSTATISTICS UNIVERSITY OFTORONTO TORONTO, ONTARIO, M5S 3G3 CANADA

E-MAIL:[email protected]

References

Related documents

Similarly, fiscal consolidation programs combined with inflation are likely to increase inequality and the effects of fiscal adjustments on inequality are amplified during periods

T a ble 2 .1 CDN composition taxonomy m apping CDN N ame and T ype CDN O rg anization S erv ers Relationships Interaction P rotocols Content/Service T ypes Commer cial CDNs Akamai

Pharmacy records may also include evidence of the pharmacist work up and problem identification, evidence used in the assessment and the detailed care plan (e.g. drug-related

Lady Macbeth confessed that she would have killed King Duncan herself except for the fact that a.. she couldn't gain easy access to

International human rights standards require governments to take steps to “improve child and maternal health, sexual and reproductive health services, including access to

Services and Real Estate Somague SGPS Real Estate Portugal Windmills farms Co-generation Somague Concessions Portugal Brazil Spain Somague Environment Portugal Brazil China

A possible explanation of this could be that the self-perception of being a victim is conditioned by mainly external factors (e.g. legal mismanagement of waste, illegal disposal

Indeed, if we calculate the rate of saving of each group weighted by the income of each household (third line in Table 3), we notice that for 1988 survey both the group with