Minimising MCMC variance via diffusion limits, with an application to simulated tempering

(1)

warwick.ac.uk/lib-publications

Original citation:

Roberts, Gareth O. and Rosenthal, Jeffrey S. (Jeffrey Seth). (2014) Minimising MCMC variance via diffusion limits, with an application to simulated tempering. The Annals of Applied Probability, 24 (1). pp. 131-149.

Permanent WRAP URL:

http://wrap.warwick.ac.uk/60148

Copyright and reuse:

The Warwick Research Archive Portal (WRAP) makes this work by researchers of the University of Warwick available open access under the following conditions. Copyright © and all moral rights to the version of the paper presented here belong to the individual author(s) and/or other copyright owners. To the extent reasonable and practicable the material made available in WRAP has been checked for eligibility before being made available.

Copies of full items can be used for personal research or study, educational, or not-for-profit purposes without prior permission or charge. Provided that the authors, title and full bibliographic details are credited, a hyperlink and/or URL is given for the original metadata page and the content is not changed in any way.

Publisher’s statement:

Published version: http://dx.doi.org/10.1214/12-AAP918

A note on versions:

The version presented in WRAP is the published version or, version of record, and may be cited as it appears here.

(2)

DOI:10.1214/12-AAP918

©Institute of Mathematical Statistics, 2014

MINIMISING MCMC VARIANCE VIA DIFFUSION LIMITS, WITH AN APPLICATION TO SIMULATED TEMPERING

BYGARETHO. ROBERTS ANDJEFFREYS. ROSENTHAL1

University of Warwick and University of Toronto

We derive new results comparing the asymptotic variance of diffusions by writing them as appropriate limits of discrete-time birth–death chains which themselves satisfy Peskun orderings. We then apply our results to sim-ulated tempering algorithms to establish which choice of inverse tempera-tures minimises the asymptotic variance of all functionals and thus leads to the most efficient MCMC algorithm.

1. Introduction. Markov chain Monte Carlo (MCMC) algorithms are very widely used to approximately compute expectations with respect to complicated high-dimensional distributions; see, for example, [7,24]. Specifically, if a Markov chain {Xn} has stationary distribution π on state space X, and h:X →R with

π|h|< ∞, then π(h):= h(x)π(dx) can be estimated by 1_nn_i₌₁h(Xi) for suitably large n. This estimator is unbiased if the chain is started in station-arity (i.e., if X0 ∼ π), and in any case has bias only of order 1/n.

Further-more, it is consistent provided the Markov chain is φ-irreducible. Thus, the ef-ficiency of the estimator is often measured in terms of the asymptotic variance Varπ(h, P ):=limn→∞1_nVarπ(

n

i=1h(Xi))(where the subscriptπ indicates that {Xn}is in stationarity): the smaller the variance, the better the estimator.

An important question in MCMC research is how to optimiseit, that is, how to choose the Markov chain optimally; see, for example, [10,15]. This leads to the question of how to comparedifferent Markov chains. Indeed, for two differ-ent φ-irreducible Markov chain kernels P1 and P2 on X, both having the same

invariant probability measureπ, we say thatP1dominatesP2in the efficiency

or-dering, written P1P2, if Varπ(h, P1)≤Varπ(h, P2) for all L2(π ) functionals h:X →R, that is, ifP1 is “better” thanP2 in the sense of being uniformly more

efficient for estimating expectations of functionals.

It was proved by Peskun [18] for finite state spaces, and by Tierney [25] for general state spaces (see also [15,16]), that ifP1andP2are discrete-time Markov

chains which are both reversible with respect to the same stationary distributionπ, then a sufficient condition forP1P2 is thatP1(x, A)≥P2(x, A)for allx∈X

andA∈F withx /∈A, that is, thatP1 dominatesP2off the diagonal.

Received March 2012; revised December 2012.

1_{Supported in part by NSERC of Canada.}

MSC2010 subject classifications.Primary 60J22; secondary 62M05, 62F10.

Key words and phrases.Markov chain Monte Carlo, simulated tempering, optimal scaling, diffu-sion limits.

(3)

Meanwhile, diffusion limits have become a common way to establish asymp-totic comparisons of MCMC algorithms [2–5, 20–22]. Specifically, if P1,d and

P2,d are two different Markov kernels in dimensiond (ford =1,2,3, . . .), with diffusion limitsP1,∗andP2,∗respectively as d→ ∞, then one way to show that

P1,d is more efficient than P2,d for large d is to prove that P1,∗ is more effi-cient that P2,∗. This leads to the question of how to establish that one diffusion is more efficient than another. In some cases (e.g., random-walk Metropolis [20], and Langevin algorithms [21]), this is easy since one diffusion is simply a time-change of the other. But more general diffusion comparisons are less clear; for example, the processes’ spectral gaps

1−sup

h(y)P (x, dy):

h(y)π(dy)=0,

h2(y)π(dy)=1

can be ordered directly by using Dirichlet forms, but this does not lead to bounds on the asymptotic variances.

In this paper, we develop (Section2) a new comparison of asymptotic variance of diffusions. Specifically, we prove (Theorem 1) that if Pi are Langevin diffu-sions with respect to the same stationary distributionπ, with variance functions

σ_i2(fori=1,2), then ifσ₁2(x)≥σ₂2(x)for allx, thenP1P2, that is,P1is more

efficient thanP2. (We note that Mira and Leisen [12,17] extended the Peskun

or-dering in an interesting way to continuous-time Markov processes on finite state spaces, and on general state spaces when the processes have generators which can be represented asGif (x)=

f (y)Qi(x, dy)and which satisfy the condition that

Q1(x, A\ {x})≥Q2(x, A\ {x})for allx andA. However, their results do not

ap-pear to apply in our context, since generators of diffusions involve differentiation and thus do not admit such representation.)

We then consider (Section3) simulated tempering algorithms [10,14], and in particular the question of how best to choose the intermediate temperatures. It was previously shown in [1], generalising some results in the physics literature [11,19], that a particular choice of temperatures (which leads to an asymptotic temperature-swap acceptance rate of 0.234) maximises the asymptotic L2 jumping distance, that is, limn→∞E(|Xn−Xn−1|2). (Indeed, this result has already influenced

adap-tive MCMC algorithms for simulated tempering; see, for example, [9].) However, the previous papers did not prove a diffusion limit, nor did they provide any com-parisons of Markov chain variances. In this paper, we establish (Theorem6) diffu-sion limits for certain simulated tempering algorithms. We then apply our diffudiffu-sion comparison results to prove (Theorem7) that the given choice of temperatures does indeed minimise the asymptotic variance of all functionals.

2. Comparison of diffusions. Let π:X →(0,∞) be a C1 target density function, where X is either R or some finite interval [a, b]. We shall consider

nonexplosive Langevin diffusionsXσ onX with stationary densityπ, satisfying

dX_tσ=σXσdBt +

₁

2σ 2_Xσ

t

logπX_tσ+σX_tσσX_tσdt

(4)

for some C1 function σ:X → [k, k] for some fixed 0< k < k <∞, and with reflecting boundaries ataandbin the caseX = [a, b].

For two such diffusions Xσ1 _and _Xσ2_{, we write (similarly to the above) that} Xσ1_Xσ2_{, and say that} _Xσ1 _dominates_Xσ2 _{in the efficiency ordering, if for all} L2(π )functionalsf:X →R,

lim T→∞T

−1/2_Var T 0 f

Xσ1

s

ds

≤ lim T→∞T

−1/2_Var T 0 f

Xσ2

s

ds

.

We wish to argue that ifσ1(x)≥σ2(x) for allx, thenXσ1Xσ2. Intuitively,

this is becauseXσ1 _{“moves faster” than}_Xσ2_{, while maintaining the same}

station-ary distribution. Indeed, if σ1 andσ2 are constants, then this result is trivial (and

implicit in earlier works [20–22]), since thenXσ1

t has the same distribution asX σ2

ct where c=σ1/σ2>1; that is, Xσ1 accomplishes the same sampling as Xσ2 in a

shorter time, so it must be more efficient. However, ifσ1 andσ2 are nonconstant

functions, then the comparison ofXσ1 _and_Xσ2_{is less clear.}

To make theoretical progress, we assume:

(A1) π is log-Lipschitz function onX; that is, there isL <∞with

_log_π(y)₋_log_π(x)_≤_L_|_y₋_x_|_, _{x, y}_∈_X_.

(2)

(A2) Either (a)X is a bounded interval[a, b], and the diffusionsXσ have re-flecting boundaries at a and b, or (b) X is all of R, and π has exponentially-bounded tails; that is, there is 0< K <∞andr >0 such that

π(x+y)≤π(x)e−ry, x > K, y >0

and

π(x−y)≤π(x)e−ry, x <−K, y >0.

In case (A2)(b), we can then find sufficiently largeq≥Ksuch that

i

|i/m|≥q

π(i/m)≤(1/4)

i

π(i/m) for allm∈N

(3)

[where the sums in (3) must be finite due to (2)], and then set

Q=infπ(x):|x| ≤q+1,

(4)

which must be positive by continuity ofπand compactness of the interval[−q−1, q+1].

Our main result is then the following.

THEOREM1. IfXσ1 _and_Xσ2_{are two Langevin diffusions of the form}₍₁₎_with

respect to the same densityπ,with variance functionsσ1 andσ2 respectively,and

(5)

2.1. Proof of Theorem 1. To prove Theorem 1, we introduce auxiliary pro-cesses for eachm∈N. Givenσ:X→R, letS=2keL, and letZm,σ be a discrete-time birth and death process on the discrete state space Xm:= {i/m;i ∈Z} in case (A2)(b), orXm:= {i/m;i∈Z} ∩ [a, b]in case (A2)(a), with transition prob-abilities given by

Pi/m, (i+1)/m= 1

2S σ

2_(i/m)₊σ2((i+1)/m)π((i+1)/m) π(i/m)

,

Pi/m, (i−1)/m= 1

2S σ

2_(i/m)₊σ2((i−1)/m)π((i−1)/m) π(i/m)

and

P (i/m, i/m)=1−Pi/m, (i+1)/m−Pi/m, (i−1)/m.

(In case (A2)(a), any transitions which would cause the process to move out of the interval [a, b] are instead given probability 0.) These transition rates are chosen to satisfy detailed balance with respect to the stationary distribution πm on Xm given byπm(i/m)=π(i/m)/x∈Xmπ(x)[andSis chosen to be large enough to

ensure thatP (i/m, (i+1)/m)+P (i/m, (i−1)/m)≤1].

In terms ofZm,σ, we then let{Y_m,tσ }t≥0be the continuous-time version ofZm,σ,

speeded up by a factor of m2S/2, that is, defined by Y_m,tσ =Zm,σ_m2_{St /}₂ fort≥0.

(Here and throughout, ris the floor function which rounds r down to the next integer, e.g. 6.8 =6 and −2.1 = −3.) It then follows that Ym,t converges toXm,σ, as stated in the following lemma (whose proof is deferred until the end of the paper, since it uses similar ideas to those of the following section).

LEMMA 2. Assuming(A1)and(A2),asm→ ∞,the processesY_mσ converge weakly(in the Skorokhod topology)toXσ.

We then apply the usual discrete-time Peskun ordering to theZm,σ processes, as follows.

LEMMA3. Suppose thatσ1(x)≥σ2(x)for allx∈R.ThenZm,σ1Zm,σ2.

PROOF. By inspection, the fact thatσ1(x)≥σ2(x)implies that

PZm,σ1

(i+1)/m=j+1|Z m,σ1

i/m =j

≥PZm,σ2

(i+1)/m=j+1|Z m,σ2

i/m =j

and

PZm,σ1

(i+1)/m=j−1|Z m,σ1

i/m =j

≥PZm,σ2

(i+1)/m=j −1|Z m,σ2

i/m =j

.

It follows that Zm,σ1 _dominates_Zm,σ2 _{off the diagonal. The usual discrete-time}

(6)

To continue, let

V_∗(f, σ ):= lim T→∞T

−1_Var π

T

0

fXσ_sds

,

which we assume satisfies the usual relation

V_∗(f, σ )=

_∞

−∞Covπ

fX₀σ, fXσ_sds.

Also, let

Vm(f, σ ):=_nlim_→∞n−1Varπ

_mn

i=1

fZ_im,σ

,

which we assume satisfies the usual relation

Vm(f, σ )=

∞

i=−∞ Covπ

fZ₀m,σ, fZm,σ_i .

(In both cases, the subscriptπ indicates that the process is assumed to be in sta-tionarity, all the way from time−∞to∞.) We then have the following.

LEMMA 4. Let Gm be the spectral gap of the process Zm,σ. Assume there is some constant g >0 such thatGm≥g/m2 for all m. Then for all bounded functionsf :R→R, limm→∞(m2S/2)Vm(f, σ )=V∗(f, σ ).

PROOF. Let

Am,t =Covπ

fZ₀m,σ, fZm,σ_m2_{St /}₂

and let

A_∗,t=Covπ

fXσ₀, fXσ_t .

Then

V_∗(f, σ )=

_∞

∞ A∗,tdt

and (sincem2St/2is a step-function oft, with steps of sizem2S/2)

Vm(f, σ )=

_∞

−∞Am,tdt

m2_S/₂ .

Now, by Lemma2, sincef is bounded,

lim

m→∞Am,t=A∗,t.

To continue, letF be the forward operator corresponding to the chainZm,σ, that is, F h(x)=E[h(Z₁m,σ)|Z₀m,σ =x]. Then sinceF is reversible, it follows from Lemma 2.3 of [13] that

_Ft₌_Ft ₌_sup_Cov

π

h1

Zm,σ₀ , h2

Zm,σt

: Varπ(h1)=Varπ(h2)=1

(7)

Lettingv=Varπ[f (X)], we then have, for allm∈Nandt≥0, that

Am,t =Covπ

fZm,σ₀ , fZm,σ_m2_{St /}₂

≤supCovπ

hZ₀m,σ, hZm,σ_m2_{St /}₂

:h∈L2(π ),Varπ

h(X)=v

=vFm2St /2=vFm2St /2=v(1−Gm)m

2_{St /}₂

≤v1−g/m2m2St /2≤ve−g/m2m2St /2=ve−gSt /2.

Hence,

Vm(f, σ )=

_∞

−∞Am,tdt≤2

_∞

0 Am,tdt≤4v/gS <∞.

Hence, by the dominated convergence theorem,

lim m→∞

_∞

−∞Am,tdt=mlim→∞

_∞

−∞A∗,tdt,

that is,

lim m→∞

m2S/2Vm(f, σ )=V∗(f, σ )

as claimed.

To make use of Lemma 4, we need to bound the spectral gaps of the Zm,σ

processes. We do this using a capacitance argument; see, for example, [23]. Let

κm= inf

A⊆Xm 0<π(A)≤1/2

1

πm(A)

x∈A

Pm

x, ACπm(x)

be the capacitance ofZm,σ. We prove

LEMMA5. The capacitanceκmsatisfies that

κm≥min

ke−Lr

2m ,

Qke−2L/m

2m

,

where the quantities LandQare defined in(2) and(4),respectively,and where the bound reduces to simplyκm≥ ke

−L_r

2m in case(A2)(a).

PROOF. We consider two different cases [only the second of which can occur in case (A2)(a)]:

(i) ∃a∈Awith|a| ≤q. Then, sinceπm(A)≤1/2, there isj∈Zwith|j/m| ≤

(8)

that(j +1)/m∈AC. We will need the following estimate onj∈Zπ(j/m). For x∈ [i/m, (i+1)/m),

π(x)≥π(i/m)e−L(x−i/m)

so that

(i+1)/m

i/m

π(x)≥π(i/m)

1/m

0

e−Ludu=π(i/m) 1−e −L/m

L

=π(i/m)e−L/m e

L/m₋₁

L

≥π(i/m)e−L/m L/m

L

=π(i/m)e−L/m

m .

Therefore summing both sides over alli∈Z,

1=

_∞

−∞π(x) dx≥

e−L/m m

i∈Z

π(i/m),

whence

i∈Z

π(i/m)≤meL/m.

Then

x∈A

Pm

x, ACπm(x)≥πm(j/m)Pm

j/m, (j+1)/m

=πm(j/m)(1/2)σ2(j/m)e−L/m ≥π(j/m)/m(k/2)e−2L/m

≥Qke−2L/m/2m.

(ii) A⊆(−∞, q)∪(q,∞). Leta∈Awithπ(a)=max{π(x):x∈A}. Assume WOLOG thata >0. Then

x∈A

Pm

x, ACπm(x)≥πm(a)Pm

a, a−(1/m)

≥ke−L/mπ(a)

i

|i/m|≥a

π(i/m)

≥ke−L/mπ(a)2

∞

j=0

π(a)e−rj/m

=1 2ke

−L/m₁₋_e−r/m_≤1 2ke

(9)

Thus, in either case, the conclusion of the lemma is satisfied.

Now, it is known (e.g., [23]) that the spectral gap can be bounded in terms of the capacitance, specifically thatGm≥κm2/2. Thus, form≥1,

Gm≥

min1₂ke−L(r/m), Qke−2L/m/2m2/2

≥min1₂ke−L(r/m), Qke−2L/2m2/2 =g/m2,

where g = [min(1₂ke−Lr, Qke−2L/2)]2/2 > 0. This together with Lemma 2 shows that the conditions of Lemma 4 are satisfied. Hence, by Lemma 4, limm→∞(m2S/2)Vm(f, σ )=V∗(f, σ )for all bounded functionsf.

On the other hand, by Lemma3,Zm,σ1_Zm,σ2_{, that is,}_V

m(f, σ1)≤Vm(f, σ2).

Hence, for all bounded functionsf,

V_∗(f, σ1)= lim m→∞

m2S/2Vm(f, σ1)

≤ lim m→∞

m2S/2Vm(f, σ2)

(5)

=V_∗(f, σ2).

Finally, iff is inL2 but not bounded, then letting

fm(x)=

⎧ ⎨ ⎩

m, f (x) > m,

f (x), −m≤f (x)≤m,

−m, f (x) <−m,

we have by the monotone (or dominated) convergence theorem that V_∗(f, σ1)=

limm→∞V∗(fm, σ1) and V∗(f, σ2) = limm→∞V∗(fm, σ2). Hence, it follows

from (5) thatV_∗(f, σ1)≤V∗(f, σ2)for allL2(π )functionsf. That is,Xσ1Xσ2,

thus proving Theorem1.

3. Simulated tempering diffusion limit. We now apply our results to a ver-sion of the simulated tempering algorithm. Specifically, following [1], we consider ad-dimensional target density

fd(x)=edK d

i=1 f (xi) (6)

for some unnormalised one-dimensional density function f:R→ [0,∞), where

(10)

We consider simulated tempering ind dimensions, with inverse-temperatures

cho-sen as follows: β₀(d)=1, and β_i(d)₊₁=β_i(d)− (β (d) i )

d1/2 for some fixed C1 function :[0,1] →R. (The question then becomes, what is the optimal choice of.) As for when to stop adding new temperature values, we fix someχ∈(0,1)and keep going until the temperatures drop below χ; that is, we stop at temperature β_k(d)(d)

wherek(d)=sup{i:β_i(d)≥χ}.

We shall consider a joint process (yn(d), Xn), with Xn∈Rd, and with yn(d)∈

Ed := {βi(d);0≤i ≤k(d)} defined as follows. If yn−1 =βi(d) [where 0< i <

k(d)], then the chain proceeds by choosing Xn−1 ∼fβ, then proposing Zn to beβi+1orβi−1with probability 1/2 each, and finally acceptingZnwith the usual Metropolis acceptance probability. (A proposed move toβ₋(d)₁ orβ_k(d)(d)₊₁ is auto-matically rejected.) We assume, as in [1], that the chain then immediately jumps to stationary at the new temperature, that is, that mixing within a temperature is infinitely more efficient than mixing between temperatures.

The process(yn(d), Xn)is thus a Markov chain on the state spaceEd×Rd, with joint stationary density given by

fd(β, x)=edK(β) d

i=1

fβ(xi),

whereK(β)= −logfβ(x) dxis the normalising constant.

We now prove that the{yn(d)} process has a diffusion limit (similar to random-walk Metropolis and Langevin algorithms, see [20–22]), and furthermore the asymptotic variance of the algorithm is minimised by choosing the functionthat leads to an asymptotic temperature acceptance rate=. 0.234. Specifically, we prove the following:

THEOREM 6. Under the above assumptions, the {yn(d)} inverse-temperature process, when speeded up by a factor ofd,converges in the Skorokhod topology asd→ ∞to a diffusion limit{Xt}t≥0 satisfying

dXt =

22 −I 1/2

2

1/2 dBt (7)

+

(X)(X) −I

1/2

2

−2 I 1/2

2

φ −I

1/2

2

dt

(11)

Then, combining Theorems1and6, we immediately obtain:

THEOREM7. For the above simulated tempering algorithm,for anyL2 func-tionalf,the choice ofwhich minimises the limiting asymptotic varianceV_∗(f )=

limm→∞Vm(f ),is the same as the choice which maximises σ (x),that is, is the choice which leads to an asymptotic temperature acceptance probability of0.234 (to three decimal places).

REMARK. In this context, it was proved in [1] that as d → ∞, the choice ofleading to an asymptotic temperature acceptance rate=. 0.234 maximises the expected squared jumping distance of the {yn(d)} process. However, the question of whether that choice would also minimise the asymptotic variance for any L2

function was left open. That question is resolved by Theorem7.

3.1. Proof of Theorem6. The key computation for proving Theorem6will be given next, but first we require some additional notation. We let int(Ed) denote

Ed \ {1, β_k(d)(d)}. We also denote by G(d) the generator of the inverse-temperature process{yn(d)}and setH to be the set of all functionsh∈C2[χ ,1]with h(χ )=

h(1)=0. We also letG∗ be the generator of the diffusion given in (7), defined, for all functionsh∈H, by

G∗h=σ

2_(x)h_(x)

2 +μ(x)h

_(x), _h_∈_H,

(8)

where

μ(x)=(x)(x) −I

1/2

2

−2 I 1/2

2

φ −I

1/2

2

and

σ2(x)=22 −I

1/2

2

.

(9)

(12)

(hd, dG(d)hd)d∈Nsuch that

lim d→∞_xsup_∈_E

d

_h(x)₋_h_d_(x)₌₀

(10)

and

lim d→∞_xsup_∈_E_d

G∗h(x)−dG(d)hd(x)=0. (11)

To establish this convergence on int(Ed), we can simply let hd =h (see Lemma8 below). However, to establish the convergence on the boundary ofEd (Lemma9), we need to modifyhslightly [without destroying the convergence on int(Ed)]. We do this as follows. First, given anyh∈H, we let

hd(x)=h

γd(x)

,

where

γd(x)=

(1−χ )x+χ−χd 1−χd

,

so thathd is just likehexcept “stretched” to be defined on[χd,1] instead of just on[χ ,1]. Here we setχd =β_k(d)(d) , andχ_d+=β_k(d)(d)₋₁; thusχd ≤χ≤χ_d+. Notice that sinceχd→χasd→ ∞,hd and its first and second derivatives converge toh and its corresponding derivatives uniformly forx∈ [χd,1]asd→ ∞.

Finally, given the functionh, we letη(x)to be any smooth function:[χ ,1] →R

satisfying

η(χ )=h(χ )(χ )/2 and η(1)=h(1)(1)/2

and then set

hd(x)=hd(x)+d−1/2η

γd(x)

=hγd(x)

+d−1/2ηγd(x)

,

so thathd(x)is similar tohd(x)except with the addition of a separateO(d−1/2) term (which will only be relevant at the boundary points, i.e., in Lemma9below). In particular, (10) certainly holds.

In light of the above discussion, Theorem6 will follow by establishing (11), which is done in Lemmas8and9below.

LEMMA8. For allh∈H,

lim d→∞_x_∈sup_int_(E

d)

_dG(d)_h(x)₋_G∗_h(x)₌₀

(13)

and

lim d→∞_x_∈sup_int_(E

d)

_dG(d)_h

d(x)−G∗h(x)=0. (13)

PROOF. We begin with a Taylor series expansion forG(d). Since the compu-tations shall get somewhat messy, we wish to keep only higher-order terms, so for

simplicity we shall use the notation r(d)≈ to mean that the expansion holds up to

terms of order 1/r(d), uniformly forx∈Ed, asd→ ∞[e.g.,LHS d

≈RHSmeans that limd→∞supx∈Edd(LHS−RHS)=0]. Then for boundedC

2_functionals_h_{, we}

have (combining the twohterms together) that forβ_i(d)∈int(Ed):

G(d)hβ_i(d)≈d h _(β(d)

i ) 2

α+β_i(d)₊₁−β_i(d)+α−β_i(d)₋₁−β_i(d)

+h(β (d) i ) 2

β_i(d)₊₁−β_i(d)2α+

d ≈ h(β

(d) i ) 2

α+β_i(d)₊₁−β_i(d)+α−β_i(d)₋₁−β_i(d)

+h(β (d) i ) 2

β_i(d)₊₁−β_i(d)2α+

= h(β (d) i ) 2

α−(β_i(d)₋₁)−α+(β_i(d)) d1/2

+h(β (d) i ) 2

_(β(d)

i )2α+

d

,

whereα+ is the probability of accepting an upwards move, andα−is the proba-bility of accepting a downwards move.

To continue, we letg=logf, and

M(β)=Eβ(g)=

logf (x)fβ(x) dx

fβ_{(x) dx} and

I (β)=Varβ(g)=

(logf (x))2fβ(x) dx

fβ_{(x) dx} −M(β)

2_.

It follows, as in [1], that M(β) =I (β) and K(β)= −M(β), so K(β)=

−M(β)= −I (β). We also defineg=g−M(β).

(14)

Then, withX∼fβ,

α− = E

1∧f β+ε

d (X)edK(β+ε)

f_dβ(X)edK(β)

= E

1∧exp

K(β+ε)−K(β)d+εdM(β)+ε

d

i=1 g(Xi)

d1/2 ≈ E

1∧exp dε

2

2 K

₊dε3

6 K

₊_N₀_{, I ε}2_d

(14)

= E

1∧exp

2

2K

₊ε2

6 K

₊_N₀_{, I}2

= −I1/2 2 +

εK

6I1/2

+expε2K/6 −I

1/2

2 −

εK

6I1/2

.

Similarly,

α+=E

1∧f β−ε

d (X)edK(β−ε)

f_dβ(X)edK(β)

=E

1∧exp

K(β−ε)−K(β)d−εdM(β)−ε

d

i=1 g(Xi)

1

≈E

1∧exp dε

2

2 K

₋_N₀_{, I ε}2_d

=E

1∧exp

2

2I−

ε2

6 K

₋_N₀_{, I}2

= −I1/2 2 −

εK

6I1/2

+exp−ε2K/6 −I

1/2

2 −

εK

6I1/2

.

Hence

α+β_i(d)d 1/2

≈ −I1/2(β (d) i )

2 −

εK(β_i(d))

6I1/2_(β(d) i )

+exp−ε2β_i(d)K(βi)/6

−I1/2(β (d) i )

2 +

εK(β_i(d))

6I1/2_(β(d) i )

(15)

A first order approximation of this expression is

α+β_i(d)≈1 2 −I

1/2_(β(d) i ) 2

.

Next, we note that in the current setting,β is itself marginally a Markov chain with uniform stationary distribution among all temperatures. In fact it is a birth and death process, and hence reversible. So, by detailed balance,

α−=α+β_i(d)−/√d.

Therefore,

α−β_i(d) = α+β_i(d)−/√d

d1/2

≈ α+β_i(d)

−((β (d) i )I1/2(β

(d) i )) 2

−

√

d

φ −I

1/2_(β(d) i )

2 −

εK(β_i(d))

6I1/2_(β(d) i )

−exp−ε2β_i(d)K(βi)/6

((β_i(d))I1/2(β_i(d)))

2

× √−

d

φ −I

1/2_(β(d) i )

2 +

εK(β_i(d))

6I1/2_(β(d) i )

.

Then, sinced 1/2

≈ +εd 1/2

≈ +ε=+_d1/2, we compute that

μβ_i(d)d 1/2

≈ 1

2d1/2

−α++ +

d1/2

× α+β_i(d)

−((β (d) i )I1/2(β

(d) i )) 2

× √−

d

φ −I

1/2_(β(d) i )

2 −

εK(β_i(d))

6I1/2_(β(d) i )

−exp−ε2β_i(d)K(βi)/6

((β_i(d))I1/2(β_i(d)))

2

× √−

d

φ −I

1/2_(β(d) i )

2 +

εK(β_i(d))

6I1/2_(β(d) i )

(16)

Hence, ignoring all lower order terms,

μβ_i(d)d 1/2

≈ 1

2d1/2

−((β

(d) i )I1/2(β

(d) i )) 2 × √− d

φ −I

1/2_(β(d) i )

2 −

εK(β_i(d))

6I1/2_(β(d) i )

−exp((β (d) i )I

1/2_(β(d) i )) 2

× √−

d

φ −I

1/2_(β(d) i )

2 +

εK(β_i(d))

6I1/2_(β(d) i )

+2 (−I1/2(β (d)

i )/2)

d1/2

d1/2 ≈ 1

d

−2((β

(d) i )I1/2(β

(d) i ))

2 φ −

I1/2(β_i(d))

2

+ −I1/2(β (d) i ) 2 .

Similarlyσ2(β_i(d))is to first order

22

d −

I1/2(β_i(d))

2

so that we can write (for 0< β <1)

Gdh≈d 1

d

2 ₋I1/2(β (d) i ) 2

h(β)

+ −I1/2(β (d) i ) 2

−2((β

(d) i )I1/2(β

(d) i ))

2 φ −

I1/2(β_i(d))

2

h(β)

.

However, this expression is justd−1G∗h, thus establishing (12).

Finally, to establish (13), we note that in this case the termsd−1/2η(γd(x))and

hd(x)−h(x)are both lower-order and do not affect the limit. Hence, (13) follows directly from (12).

(17)

would be complete simply by settinghd =hand applying Lemma8.) However, the following lemma shows that with the definition ofhd used here, the extension to the boundary does indeed hold.

LEMMA9. For allh∈H,forx=1and forx=χd,

lim d→∞

_dG(d)_h

d(x)−G∗h(x)=0.

PROOF. We prove the case whenx=χd; the casex=1 is similar but some-what easier (since thenx does not depend ond).

Mimicking the Taylor expansion of Lemma8,

G(d)hd(χd) d

≈hd(χd)[α−(χd+−χd)] 2

+hd(χd) 4

χd −χ_d+

2 α−

=hd(χd) 2

α−(χ_d+) d1/2 +

h_d(χd) 4

_(χ

d)2α−

d

≈α−(χd+) 2d1/2

h(χ )+η(χ )d−1/2

+hd(χd) 4

_(χ

d)2α−

d

.

Thus sinceh(χ )=0, this expression equals

h_d(χd) 2

_(χ

d)2α−

d

.

Next we note from (14) that

α−≈1 2 −I

1/2

2

.

Hence, the above results show that

lim

d→∞dGdhd(χd)= 2

(χ )h(χ ) −I 1/2

2

.

In light of formulae (8) and (9), this completes the proof.

(18)

PROOF OF LEMMA 2. We first compute that, to first order as h0 and

m→ ∞, writingx=i/mande=1/m, we have

E Y_m,tσ ₊_h−Y_m,tσ Y_m,tσ = i m

≈ m2Sh 2

1

m

1 2S

×

σ2 i m

+π((i+1)/m)σ2((i+1)/m)

π(i/m)

−σ2 i m

−π((i−1)/m)σ2((i−1)/m)

π(i/m)

=hm 4

_π(x₊_e)σ2_(x₊_e)

π(x) −

π(x−e)σ2(x−e) π(x)

≈hm 4

π(x)+eπ(x)σ2(x)+eσ2(x)

−π(x)−eπ(x)σ2(x)−eσ2(x)/π(x)

=hm 4

₂

eπ(x)σ2(x)+2eπ(x)(σ2)(x) π(x)

=hm 4 (2e)

(logπ )(x)σ2(x)+2σ (x)σ(x)

=h

₁

2(logπ )

_(x)σ2_(x)₊_{σ (x)σ}_(x)

and also

E Y_m,tσ ₊_h−Y_m,tσ 2Y_m,tσ = i m

≈ m2Sh 2

1 2S

1

m2

2σ2(x)+2σ2(x)=hσ2(x).

A comparison with (1) then shows thatY_mσ satisfies the same first and second mo-ment characteristics asXσ_t , so thatXσ_t is indeed the correct putative limit.

(19)

4. Discussion. This paper has linked the usual Peskun ordering on asymp-totic variance of discrete-time Markov chains, to asympasymp-totic variance of diffusion processes. It has then applied these results to simulated tempering algorithms, by proving that the inverse-temperatures of such algorithms converge (in an appropri-ate limit) to a diffusion. By maximising the speed of the resulting diffusion, it has obtained results about the optimal choice of the temperature spacings.

We believe that Theorem1could be useful in other contexts as well, whenever we wish to compare two Langevin diffusion algorithms directly, or alternatively whenever we wish to compare two discrete-time processes which both have ap-propriate diffusion limits.

Of course, Theorem1requires assumptions (A1) and (A2). These are primarily just regularity assumptions, which would likely be satisfied in most applications of interest. On the other hand, the “exponentially-bounded tails” aspect of assump-tion (A2) is more than technical; rather, it provides us with some control over the extreme tail excursions of the processes which we consider, and we suspect that our limiting results might fail if no such control is provided.

Finally, our simulated tempering diffusion limit is only proven under the rather strong and artificial assumption (6) involving a product form of the target density. Indeed, this assumption is central to our method of proof. However, as mentioned earlier, it is known [2–5,20,22] that the general conclusions in this special case often hold in greater generality, either approximately in numerical simulation stud-ies, or theoretically through more general methods of proof. In a similar spirit, we believe that the simulated tempering diffusion limit proven herein would approxi-mately hold numerically in greater generality. In addition, it might be possible to prove a stronger version of our diffusion limit, with weaker assumptions, though such proofs would get rather technical and we do not pursue them here.

REFERENCES

[1] ATCHADÉ, Y. F., ROBERTS, G. O. and ROSENTHAL, J. S. (2011). Towards optimal scaling of Metropolis-coupled Markov chain Monte Carlo.Stat.Comput.21555–568.MR2826692 [2] BÉDARD, M. (2007). Weak convergence of Metropolis algorithms for non-i.i.d. target

distribu-tions.Ann.Appl.Probab.171222–1244.MR2344305

[3] BÉDARD, M. (2008). Optimal acceptance rates for Metropolis algorithms: Moving beyond 0.234.Stochastic Process.Appl.1182198–2222.MR2474348

[4] BÉDARD, M. and ROSENTHAL, J. S. (2008). Optimal scaling of Metropolis algorithms: Head-ing toward general target distributions.Canad.J.Statist.36483–503.MR2532248 [5] BESKOS, A., ROBERTS, G. and STUART, A. (2009). Optimal scalings for local Metropolis–

Hastings chains on nonproduct targets in high dimensions.Ann.Appl.Probab.19863– 898.MR2537193

[6] BHATTACHARYA, R. N. and WAYMIRE, E. C. (1990).Stochastic Processes with Applications. Wiley, New York.MR1054645

[7] BROOKS, S., GELMAN, A., JONES, G. L. and MENG, X.-L., eds. (2011). Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC, Boca Raton, FL.MR2742422 [8] ETHIER, S. N. and KURTZ, T. G. (1986).Markov Processes:Characterization and

(20)

[9] FORT, G., MOULINES, E. and PRIOURET, P. (2011). Convergence of adaptive and interacting Markov chain Monte Carlo algorithms.Ann.Statist.393262–3289.MR3012408 [10] GEYER, C. (1992). Practical Markov chain Monte Carlo.Statist.Sci.7473–483.

[11] KOFKE, D. A. (2002). On the acceptance probability of replica-exchange Monte Carlo trials. J.Chem.Phys.1176911. Erratum:J. Chem. Phys.12010852.

[12] LEISEN, F. and MIRA, A. (2008). An extension of Peskun and Tierney orderings to continuous time Markov chains.Statist.Sinica181641–1651.MR2469328

[13] LIU, J. S., WONG, W. H. and KONG, A. (1994). Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes.Biometrika 8127–40.MR1279653

[14] MARINARI, E. and PARISI, G. (1992). Simulated tempering: A new Monte Carlo scheme. Europhys.Lett.19451–458.

[15] MIRA, A. (2001). Ordering and improving the performance of Monte Carlo Markov chains. Statist.Sci.16340–350.MR1888449

[16] MIRA, A. and GEYER, C. J. (2000). On non-reversible Markov chains. InMonte Carlo Meth-ods(Toronto,ON, 1998).Fields Institute Communications2695–110. Amer. Math. Soc., Providence, RI.MR1772309

[17] MIRA, A. and LEISEN, F. (2009). Covariance ordering for discrete and continuous time Markov chains.Statist.Sinica19651–666.MR2514180

[18] PESKUN, P. H. (1973). Optimum Monte-Carlo sampling using Markov chains.Biometrika60 607–612.MR0362823

[19] PREDESCU, C., PREDESCU, M. and CIOBANU, C. V. (2004). The incomplete beta function law for parallel tempering sampling of classical canonical systems.J.Chem.Phys.120 4119–4128.

[20] ROBERTS, G. O., GELMAN, A. and GILKS, W. R. (1997). Weak convergence and opti-mal scaling of random walk Metropolis algorithms. Ann. Appl. Probab. 7 110–120. MR1428751

[21] ROBERTS, G. O. and ROSENTHAL, J. S. (1998). Optimal scaling of discrete approximations to Langevin diffusions.J.R.Stat.Soc.Ser.B Stat.Methodol.60255–268.MR1625691 [22] ROBERTS, G. O. and ROSENTHAL, J. S. (2001). Optimal scaling for various Metropolis–

Hastings algorithms.Statist.Sci.16351–367.MR1888450

[23] SINCLAIR, A. (1992). Improved bounds for mixing rates of Markov chains and multicommod-ity flow.Combin.Probab.Comput.1351–370.MR1211324

[24] TIERNEY, L. (1994). Markov chains for exploring posterior distributions. Ann. Statist. 22 1701–1762.MR1329166

[25] TIERNEY, L. (1998). A note on Metropolis–Hastings kernels for general state spaces.Ann. Appl.Probab.81–9.MR1620401

[26] WARD, A. R. and GLYNN, P. W. (2003). A diffusion approximation for a Markovian queue with reneging.Queueing Syst.43103–128.MR1957808

DEPARTMENT OFSTATISTICS UNIVERSITY OFWARWICK COVENTRY, CV4 7AL UNITEDKINGDOM

E-MAIL:[email protected]

DEPARTMENT OFSTATISTICS UNIVERSITY OFTORONTO TORONTO, ONTARIO, M5S 3G3 CANADA

E-MAIL:[email protected]