Accelerating Metropolis - Hastings algorithms by delayed acceptance

(1)

warwick.ac.uk/lib-publications

Manuscript version: Author’s Accepted Manuscript

The version presented in WRAP is the author’s accepted manuscript and may differ from the

published version or Version of Record.

Persistent WRAP URL:

http://wrap.warwick.ac.uk/115289

How to cite:

Please refer to published version for the most recent bibliographic citation information.

If a published version is known of, the repository item page linked to above, will contain

details on accessing it.

Copyright and reuse:

The Warwick Research Archive Portal (WRAP) makes this work by researchers of the

University of Warwick available open access under the following conditions.

Copyright © and all moral rights to the version of the paper presented here belong to the

individual author(s) and/or other copyright owners. To the extent reasonable and

practicable the material made available in WRAP has been checked for eligibility before

being made available.

Copies of full items can be used for personal research or study, educational, or not-for-profit

purposes without prior permission or charge. Provided that the authors, title and full

bibliographic details are credited, a hyperlink and/or URL is given for the original metadata

page and the content is not changed in any way.

Publisher’s statement:

Please refer to the repository item page, publisher’s statement section, for further

information.

(2)

AIMS’ Journals

Volume X, Number 0X, XX 200X pp. X–XX

ACCELERATING METROPOLIS–HASTINGS ALGORITHMS BY DELAYED ACCEPTANCE

Marco Banterle∗

Department of Medical Statistics London School of Hygiene and Tropical Medicine Keppel St, Bloomsbury, London WC1E 7HT, UK

Clara Grazian

Dipartimento di Economia

Universit`a degli Studi “Gabriele D’Annunzio” Viale Pindaro, 42, 65127 Pescara, Italy

Anthony Lee

School of Mathematics, University of Bristol University Walk, Bristol BS8 1TW, UK

and Christian P. Robert

Department of Statistics, University of Warwick Gibbet Hill Road, Coventry CV4 7AL, UK

(Communicated by the associate editor name)

Abstract. MCMC algorithms such as Metropolis–Hastings algorithms are slowed down by the computation of complex target distributions as exemplified by huge datasets. We offer a useful generalisation of the Delayed Acceptance approach, devised to reduce such computational costs by a simple and univer-sal divide-and-conquer strategy. The generic acceleration stems from breaking the acceptance step into several parts, aiming at a major gain in computing time that out-ranks a corresponding reduction in acceptance probability. Each component is sequentially compared with a uniform variate, the first rejection terminating this iteration. We develop theoretical bounds for the variance of associated estimators against the standard Metropolis–Hastings and produce results on optimal scaling and general optimisation of the procedure.

1. Introduction. When running an MCMC sampler such as Metropolis–Hastings algorithms (Robert and Casella,2004), the complexity of the target density required by the acceptance ratio may lead to severe slow-downs in the execution of the algorithm. A direct illustration of this difficulty is the simulation from a posterior distribution involving a large dataset of n points for which the computing time is at least of order O(n). Several solutions to this issue have been proposed in the

2010 Mathematics Subject Classification. Primary: 68U20, 65C40; Secondary: 62C10. Key words and phrases. Large Scale Learning and Big Data, MCMC algorithms, likelihood function, acceptance probability, mixtures of distributions, Jeffreys prior.

∗_{Corresponding author: Christian Robert.}

(3)

literature (Korattikara et al.,2014;Neiswanger et al.,2013;Scott et al.,2013;Wang and Dunson,2013), taking advantage of the likelihood decomposition

n

Y

i=1

`(θ|xi) (1)

to handle subsets of the data on different processors (CPU), graphical units (GPU), or even computers. However, there is no consensus on the method of choice, some leading to instabilities by removing most prior inputs and others to approximations delicate to evaluate or even to implement.

Our approach here is to delay acceptance (rather than rejection as in Tierney and Mira(1998)) by sequentially comparing parts of the MCMC acceptance ratio to independent uniforms, in order to stop earlier the computation of the aforesaid ratio, namely as soon as one term is below the corresponding uniform.

More formally, consider a generic Metropolis–Hastings algorithm where the ac-ceptance ratio π(y)q(y, x)/π(x)q(x, y) is compared with a U (0, 1) variate to decide whether or not the Markov chain switches from the current value x to the proposed value y (Robert and Casella,2004). If we now decompose the ratio as an arbitrary product π(y) q(y, x)π(x)q(x, y) = d Y k=1 ρk(x, y) (2)

where the only constraint is that the functions ρk are all positive and satisfy the

balance condition ρk(x, y) = ρk(y, x)−1 and then accept the move with probability d

Y

k=1

min {ρk(x, y), 1} , (3)

i.e. by successively comparing uniform variates uk to the terms ρk(x, y), the

moti-vation for our delayed approach is that the same target density π(·) is stationary for the resulting Markov chain.

The mathematical validation of this simple if surprising result can be seen as a consequence ofChristen and Fox (2005). This paper reexaminesFox and Nicholls

(1997), where the idea of testing for acceptance using an approximation before computing the exact likelihood was first suggested. InChristen and Fox(2005), the original proposal density q is used to generate a value y that is tested against an approximate target ˜π. If accepted, y can be seen as coming from a pseudo-proposal ˜

q that simply is formalising the earlier preliminary step and is then tested against the true target π. The validation inChristen and Fox(2005) follows from standard detailed balance arguments; we will focus formally on this point in Section2.

In practice, sequentially comparing those probabilities with d uniform variates means that the comparisons stop at the first rejection, implying a gain in computing time if the most costly items in the product (2) are saved for the final comparisons. Examples of the specific two-stage Delayed Acceptance as defined by Christen and Fox (2005) can be found inGolightly et al. (2015), in the pMCMC context, and inShestopaloff and Neal(2013).

The major drawback of the scheme is that Delayed Acceptance efficiently re-duces the computing cost only when the approximation ˜π is “good enough” or “flat enough”, since the probability of acceptance of a proposed value will always be smaller than in the original Metropolis–Hastings scheme. In other words, the origi-nal Metropolis–Hastings kernel dominates the new one in Peskun’s (Peskun,1973)

(4)

Figure 1. Fit of a two-step Metropolis–Hastings algorithm ap-plied to a normal-normal posterior distribution µ|x ∼ N (x/({1 + σ_µ−2}, 1/{1 + σ−2

µ }) when x = 3 and σµ = 10, based on T = 105

it-erations and a first acceptance step considering the likelihood ratio and a second acceptance step considering the prior ratio, resulting in an overall acceptance rate of 12%

Figure 2. (left) Fit of a multiple-step Metropolis–Hastings al-gorithm applied to a Beta-binomial posterior distribution p|x ∼ Be(x + a, n + b − x) when N = 100, x = 32, a = 7.5 and b = .5. The binomial B(N, p) likelihood is replaced with a product of 100 Bernoulli terms and an acceptance step is considered for the ratio of each term. The histogram is based on 105 _{iterations, with an}

overall acceptance rate of 9%; (centre) raw sequence of successive values of p in the Markov chain simulated in the above experiment; (right) autocorrelogram of the above sequence.

sense. The most relevant question raised byChristen and Fox (2005) is thus how to achieve a proper approximation; note in fact that while in Bayesian statistics a decomposition of the target is always available, by breaking original data in subsam-ples and considering the corresponding likelihood parts or even by just separating the prior, proposal and likelihood ratio into different factors, these decompositions may just lead to a deterioration of the algorithm properties without impacting the computational efficiency.

(5)

However, even in these simple cases, it is possible to find examples where Delayed Acceptance may be profitable. Consider for instance resorting to a costly non-informative prior distribution (as illustrated in Section5.3in the case of mixtures); here the first acceptance step can be solely based on the ratio of the likelihoods and the second step, which involves the ratio of the priors, does not require to be computed when the first test leads to rejection. Even more often, the converse decomposition applies to complex or just costly likelihood functions, in that the prior ratio may first be used to eliminate values of the parameter that are too unlikely for the prior density. As shown in Figure 1, a standard normal-normal example confirms that the true posterior and the histogram resulting from such a simulated sample are in agreement.

In more complex settings, as for example in “Big Data” settings where the like-lihood is made of a very large number of terms, the above principle also applies to any factorisation of the like of (1) so that each individual likelihood factor can be evaluated separately. This approach increases both the variability of the evaluation and the potential for rejection, but, if each term of the factored likelihood is suffi-ciently costly to compute, the decomposition brings some improvement in execution time. The graphs in Figure 2 illustrate an implementation of this perspective in the Beta-binomial case, namely when the binomial B(N, p) observation x = 32 is replaced with a sequence of N Bernoulli observations. The fit is adequate on 105 iterations, but the autocorrelation in the sequence is very high (note that the ACF is for the 100 times thinned sequence) while the acceptance rate falls down to 9%. (When the original y = 32 observation is (artificially) divided into 10, 20, 50, and 100 parts, the acceptance rates are 0.29, 0.25, 0.12, and 0.09, respectively.) The gain in using this decomposition is only appearing when each Bernoulli likelihood computation becomes expensive enough.

On one hand, the order in which the product (3) is explored determines the computational efficiency of the scheme, while, on the other hand, it has no impact on the overall convergence of the resulting Markov chain, since the acceptance of a proposal does require computing all likelihood values. It therefore makes sense to try to optimise this order by ranking the entries in a way that improves the execution speed of the algorithm (see Section3.2).

We also stress that the Delayed Acceptance principle remains valid even when the likelihood function or the prior are not integrable over the parameter space. There-fore, the prior may well be improper. For instance, when the prior distribution is constant, a two-stage acceptance scheme reverts to the original Metropolis–Hastings one.

Similar questions about efficiency of Delayed Acceptance schemes were considered in Sherlock et al.(2015a) and in Sherlock et al.(2017), where the authors propose to build ˜π using a weighted average of the previous evaluations of the likelihood computed at the k nearest neighbours in the parameter space; as the chain pro-gresses, pairs of proposed parameters and their corresponding likelihood are stored in a KD-tree to efficiently search for the nearest neighbours and adaptively build the surrogate distribution.

Finally, while the Delayed Acceptance methodology is intended to cater to com-plex likelihoods or priors, it does not bring a solution per se to the “Big Data” problem in that (a) all terms in the product must eventually be computed; and (b) all terms previously computed (i.e., those computed for the last accepted value of

(6)

the parameter) must be either stored for future comparison or recomputed. Cor-nish et al. (2019) combine Delayed Acceptance and the factorisation in (3) with control variate ideas to address this issue: provided each term in the decomposition can be bounded from below, the corresponding independent Bernoulli draws can be efficiently (sub-)sampled via thinning (Devroye, 1986), avoiding thus the need to compute every ρk term. See insteadBardenet et al.(2017), for a review on different

parallel ways of handling massive datasets.

The plan of the paper is as follows: in Section2, we validate the decomposition of the acceptance step into a sequence of decisions, arguing about the computational gains brought by this generic modification of Metropolis–Hastings algorithms and further analysing the relation between the proposed method and the Metropolis– Hastings algorithm in terms of convergence properties and asymptotic variances of statistical estimates. In Section 4 we briefly state the relations between Delayed Acceptance and other methods present in the literature. In Section 3 we aim at giving some intuitions on how to improve the behaviour of Delayed Acceptance by ranking the factors in a given decomposition to achieve optimal computational efficiency and finally give some preliminary results in terms of optimal scaling for the proposed method. Then Section 5 studies Delayed Acceptance within three realistic environments, the first one made of logistic regression targets, the second one alleviating the computational burden from a Geometric Metropolis adjusted Langevin algorithm and a third one handling an original analysis of a parametric mixture model via genuine Jeffreys priors. Section6 concludes the paper.

2. Validation and convergence of Delayed Acceptance. In this section, we establish that Delayed Acceptance is a valid Markov chain Monte Carlo scheme and analyse on a theoretical basis the differences with the original version.

2.1. The general scheme. We assume for simplicity that the target distribution π and the proposal distributions Q(x, ·) all admit densities w.r.t. the Lebesgue or counting measures. We also denote by π the target density and let q(x, y) denote the proposal density.

Let (Xn)n≥1be a Markov chain evolving on X with Metropolis–Hastings Markov

transition kernel P associated with q and π, i.e. for A ∈ B(X) P (x, A) := Z A q(x, y)α(x, y)dy + 1 − Z X q(x, y)α(x, y)dy 1A(x), where α(x, y) := 1 ∧ r(x, y), r(x, y) := π(y)q(y, x) π(x)q(x, y).

Above, α(x, y) is known as the Metropolis–Hastings acceptance probability and r(x, y) as the Metropolis–Hastings acceptance ratio.

We consider the class of “Delayed acceptance” Markov kernels associated with P , which are defined by factorisations of the function r as

r(x, y) =

d

Y

k=1

ρk(x, y) (4)

with all components in the product satisfying ρk(x, y) = ρk(y, x)−1. The associated

Delayed Acceptance Markov kernel is then defined as ˜ P (x, A) := Z A q(x, y) ˜α(x, y)dy + 1 − Z X q(x, y) ˜α(x, y)dy 1A(x),

(7)

where ˜ α(x, y) := d Y k=1 {1 ∧ ρk(x, y)}.

We will denote by ( ˜Xn)n≥1the Markov chain associated with ˜P .

The order in which the sequence of functions ρk appears in the factorisation

(4) is important for algorithmic specification, as can be seen in Algorithm 1. It means that ρ1(x, Y ) is evaluated first, ρ2(x, Y ) second, and so on until ρd(x, Y ) =

r(x, Y )/Qd−1

k=1ρk(x, Y ) which is last, with the motivation that “early rejection”

can allow computational savings by avoiding the computation of the subsequent ρk(x, Y ).

Algorithm 1 Delayed Acceptance

To sample from ˜P (x, ·): 1. Sample Y ∼ Q(x, ·).

2. For k = 1, . . . , d:

• With probability 1 ∧ ρk(x, Y ) continue, otherwise stop and output x.

3. Output Y .

2.2. Validation. The first lemma is a standard representation leading to the vali-dation of the Delayed Acceptance Markov chain:

Lemma 2.1. For any Markov chain with transition kernel Π of the form Π(x, A) = Z A q(x, y)a(x, y)dy + 1 − Z X q(x, y)a(x, y)dy 1A(x),

and satisfying detailed balance, the function a(·) satisfies (for π-a.a. x, y) a(x, y)

a(y, x)= r(x, y).

Proof. This follows immediately from the detailed balance condition π(x)q(x, y)a(x, y) = π(y)q(y, x)a(y, x).

The Delayed Acceptance Markov chain ( ˜Xn)n≥1 is then associated with the

in-tended target:

Lemma 2.2. ( ˜Xn)n≥1 is a π-reversible Markov chain.

Proof. From Lemma2.1it suffices to verify that ˜α(x, y)/ ˜α(y, x) = r(x, y). Indeed, we have ˜ α(x, y) ˜ α(y, x) = Qd k=1{1 ∧ ρk(x, y)} Qd k=1{1 ∧ ρk(y, x)} = d Y k=1 1 ∧ ρk(x, y) 1 ∧ ρk(y, x) = d Y k=1 ρk(x, y) = r(x, y),

(8)

Remark 1. It is immediate to show that ˜ α(x, y) = d Y k=1 {1 ∧ ρk(x, y)} ≤ 1 ∧ d Y k=1 ρk(x, y) = 1 ∧ r(x, y) = α(x, y),

since (1 ∧ a)(1 ∧ b) ≤ (1 ∧ ab) for a, b ∈ R+.

2.3. Comparisons of the kernels P and ˜P . Given a probability measure µ, let us denote µ(f ) := Z E f (x)µ(dx) , L2(E, µ) := {f : µ(f2) < ∞} L2₀ E, µ := f ∈ L2_{(E, µ) : µ f = 0 .}

For a generic Markov kernel Π : E × B(E) with unique invariant probability measure µ, we define var(f, Π) := lim n→∞var n −1 2 n X i=1 [f (Xi) − µ(f )] ! ,

where (Xn)n≥1 is a Markov chain with Markov kernel Π initialised with X1∼ µ.

Remark 2. One can immediately conclude from the construction of ˜P that var(f, P ) ≤ var(f, ˜P ) for any f ∈ L2(X, π), using Peskun ordering (Peskun,1973;Tierney,1998), since ˜α(x, y) ≤ α(x, y) for any (x, y) ∈ X2.

For any f ∈ L2_{(E, µ) we define the Dirichlet form associated with a µ-reversible}

Markov kernel Π : E × B(E) as EΠ(f ) :=

1 2 Z

µ(dx)Π(x, dy) [f (x) − f (y)]2.

The (right) spectral gap of a generic µ-reversible Markov kernel has the following variational representation Gap (Π) := inf f ∈L2 0(E,µ) EΠ(f ) hf, f i_µ . which leads to the following comparison lemma:

Lemma 2.3 (Andrieu et al., 2018, Lemma 34). Let Π1 and Π2 be µ-reversible

Markov transition kernels of µ-irreducible and aperiodic Markov chains, and assume that there exists % > 0 such that for any f ∈ L20 E, µ

EΠ2 f ≥ %EΠ1 f , then Gap Π2 ≥ %Gap Π1 and

var f, Π2 ≤ (%−1− 1)varµ(f ) + %−1var (f, Π1) f ∈ L20(E, µ).

We will need the following condition in the sequel, which imposes a uniform lower bound on each ρk(x, y) when α(x, y) = 1:

Condition 1. Defining A := {(x, y) ∈ X2 : r(x, y) ≥ 1}, there exists a c > 0 such that

(9)

Intuitively, this condition ensures that when the acceptance probability α(x, y) is 1 then the acceptance probability ˜α(x, y) is uniformly lower bounded by a constant. Reversibility then implies that ˜α(x, y) is uniformly lower bounded by a constant multiple of α(x, y) for all x, y ∈ X. Ultimately, this allows one to show, using Lemma2.3, that ˜P is not too different from P .

Proposition 1. Assume Condition 1. Then Lemma 2.3holds with Π1= P , Π2=

˜

P , µ = π and % = cd−1_.

Proof. Let (x, y) ∈ A, defined in Condition1. Since r(x, y) ≥ 1, we have α(x, y) = 1. On the other hand, from Condition1,

˜ α(x, y) = d Y k=1 1 ∧ ρk(x, y) = Y k:ρk(x,y)<1 ρk(x, y) ≥ c|{k:ρk(x,y)<1}|≥ cd−1,

since at least one ρk(x, y) ≥ 1 whenever r(x, y) ≥ 1.

From Lemma2.1, when (x, y) ∈ A, we have ˜

α(y, x) = ˜α(x, y)/r(x, y) ≥ cd−1α(x, y)/r(x, y) = cd−1α(y, x)

and thus ˜α(x, y) ≥ cd−1_{α(x, y) for any (x, y) ∈ A. Now consider (x, y) /}_{∈ A, which}

implies that (y, x) ∈ A. Then combining this result with Lemma 2.1, ˜α(y, x) = ˜

α(x, y)/r(x, y) ≥ cd−1_{α(x, y)/r(x, y) = c}d−1_{α(y, x). So ˜}_{α(x, y) ≥ c}d−1_{α(x, y) for}

any (x, y) ∈ X2_. It follows that E_P˜(f ) = Z X π(dx) ˜P (x, dy) (f (x) − f (y))2 = Z X π(dx)P (x, dy)α(x, y)˜ α(x, y)(f (x) − f (y)) 2 ≥ cd−1 Z X π(dx)P (x, dy) (f (x) − f (y))2= cd−1EP(f ), and we conclude.

The implication of this result is that, if P admits a right spectral gap, then so does ˜P , whenever Condition 1 holds. Furthermore, and irrespective of whether or not P admits a right spectral gap, quantitative bounds on the asymptotic variance of MCMC estimates using ( ˜Xn)n≥1in relation to those using (Xn)n≥1are available.

2.4. Modification of a given factorisation. The easiest way to use the above result is to modify any candidate factorisation. Given a factorisation of the function r r(x, y) = d Y k=1 ˜ ρk(x, y) ,

satisfying the balance condition, we can define a sequence of functions ρk such

that both r(x, y) = Qd

k=1ρk(x, y) and Condition 1 holds. To that effect, take an

arbitrary c ∈ (0, 1] and define b := cd−11 _{. Then, if we set} ρk(x, y) := min

1

b, max {b, ˜ρk(x, y)}

(10)

it then follows that one must define ρd(x, y) := r(x, y) Qd−1 k=1ρk(x, y) . From this modification, we deduce the following result:

Proposition 2. Using this scheme, Lemma2.3holds with Π1= P , Π2= ˜P , µ = π

and % = c2_.

Proof. We note that inf(x,y)∈X2Q

d−1 k=11 ∧ ρk(x, y) ≥ bd−1= c and that ρd(x, y) = r(x, y) Qd−1 k=1ρk(x, y) ≥ bd−1_{r(x, y) = cr(x, y).}

With A := {(x, y) ∈ X2 _{: r(x, y) ≥ 1}, it follows that inf}

(x,y)∈Aρd(x, y) ≥ c, and

so inf(x,y)∈Aα(x, y) ≥ c˜ 2. We conclude along the same lines as in the proof of

Proposition1.

While this modification ensures that one can take % = c2 _{in Proposition} ₁_{, it is}

too general to suggest that using ˜P can be more computationally efficient than using P when the cost of evaluating each ρk is taken into account. Indeed, Proposition 2

holds when the functions ˜ρk are chosen completely arbitrarily. Of course in practice,

one should choose ˜ρk and hence ρk so that they are in some sense in agreement with

r.

We will show in the next example that a certain class of ˜ρk’s are beneficial, namely

those which correspond to Metropolis–Hastings acceptance ratios with “flattened” surrogate target densities. On the other hand, it is far from difficult to come up with surrogate target densities for which unmodified use of ˜ρk can lead to disastrous

performance.

2.5. Example: unmodified surrogate targets. One common usage (Christen and Fox, 2005) of Delayed Acceptance is to substitute a surrogate target ¯π for π in ρ1(x, y). We consider the case d = 2 and a random walk proposal to examine

Condition1in this context. Here we have q(x, y) = q(y, x) and so α(x, y) = 1 ∧ π(y) π(x), while ρ1(x, y) = 1 ∧ ¯ π(y) ¯ π(x), ρ2(x, y) = 1 ∧ π(y)¯π(x) π(x)¯π(y).

Considering (x, y) ∈ A = {(x, y) ∈ X2: r(x, y) ≥ 1} we require c > 0 satisfying simultaneously ¯ π(y) ¯ π(x) ≥ c, π(y)¯π(x) π(x)¯π(y)≥ c.

The first of these says that ¯π(y)/¯π(x) cannot be too small when π(y) ≥ π(x). The second says that ¯π(y)/¯π(x) should not be a large multiple of π(y)/π(x). There are a large variety of choices of ¯π that allow one to take c = 1. For example, ¯

π(x) = π(x) + C for some constant C ≥ 0 and ¯π(x) ∝ π(x)β for some β ∈ [0, 1]. Note that β = 0 corresponds to ¯π being a constant function and β = 1 corresponds to ¯π ∝ π. In between, one can think of ¯π as being a flattened version of π.

(11)

2.6. Counter-example: failure to reproduce geometric ergodicity. Con-sider the case π(x) = N (x; 0, 1) and ¯π(x) = N (x; 0, σ2) with Q(x, ·) a normal distribution with mean x and fixed variance for each x ∈ R. Here we have

ρ1(x, y) = exp (x − y)(x + y) 2σ2 , ρ2(x, y) = exp (1 − σ2_{)(y − x)(y + x)} 2σ2 .

Mengersen and Tweedie (1996) showed that a random-walk Metropolis–Hastings chain for targets with super-exponential tails is geometrically ergodic. We now exploit this result to derive that, if σ2< 1, then the unmodified delayed acceptance Markov chain is not geometrically ergodic.

Proposition 3. The unmodified Delayed Acceptance Markov chain using the fac-torisation into ρ1 and ρ2 as above is not geometrically ergodic when σ < 1.

Intuitively, when x is large P (x, (−∞, x)) ≈ 1

2but limx→∞P (x, {x}) = 1 because˜

ρ1(x, y) takes on smaller and smaller values for y > x and ρ2(x, y) takes on smaller

and smaller values for y < x.

Proof. From Roberts and Tweedie (1996, Theorem 5.1), it suffices to show that π−ess infx∈XP (x, {x}˜ {) = 0, i.e. that for any τ ∈ (0, 1) we can find A ⊆ X such

that π(A) > 0 and sup_x∈AP (x, {x}˜ {) ≤ τ . Let Bs(z) denote the ball of radius s

around z. Given τ ∈ (0, 1), we define

r := sup{s > 0 : Q(x, Bs(x)) < τ /3}, and A := x : x > r 2 − σ2log(τ /3) r(1 − σ2₎ \ n x : Q(x, R−) < τ 3 o . Then ˜ P (x, {x}{) = P (x, B˜ r(x) \ {x}) + ˜P (x, Br{(x)) ≤ τ 3 + Z B{ r(x) Q(x, dy) ˜α(x, y) ≤ 2τ 3 + Z B{ r(x)∩R+ Q(x, dy) ˜α(x, y) ≤ 2τ 3 +_y∈Bsup{ r(x)∩R+ ˜ α(x, y) = 2τ 3 +_y∈Bsup{ r(x)∩R+ [1 ∧ ρ1(x, y)] [1 ∧ ρ2(x, y)] . Now let x ∈ A, y ∈ B{

r(x) ∩ R+ and assume y < x. It follows that ρ2(x, y) attains

its maximum when y = x − r and therefore ρ2(x, y) ≤ exp (1 − σ2_{)r(r − 2x)} 2σ2 ≤ exp (1 − σ 2_)r 2σ2 2σ2_{log(τ /3)} r(1 − σ2₎ = τ 3.

(12)

Similarly, let x ∈ A, y ∈ B{

r(x) ∩ R+ and assume y > x. It follows that ρ1(x, y)

attains its maximum when y = x + r and therefore ρ1(x, y) ≤ exp −r(2x + r) 2σ2 ≤ exp − r 2σ2 2r −2σ 2_{log(τ /3)} r(1 − σ2₎ ≤ exp r 2σ2 2σ2_{log(τ /3)} r(1 − σ2₎ ≤ τ 3, since log(τ /3) < 0 and σ2_{< 1. Therefore,}

sup y∈B{ r(x)∩R+ [1 ∧ ρ1(x, y)] [1 ∧ ρ2(x, y)] ≤ τ 3 so ˜P (x, {x}{) ≤ τ and we conclude.

The same argument can be made for much more general targets and proposals, albeit at the expense of brevity and clarity. We refrain from such a generalisation as our purpose here is to demonstrate that the DA chain may fail to inherit geometric ergodicity and that the simple proposed modification of the Delayed Acceptance kernel provided in Section2.4allows one to avoid this.

3. Optimisation. When considering Markov Chain Monte Carlo methods in prac-tice, their efficiency as measured by mixing properties and computational cost is a fundamental issue. This section addresses both perspectives in connection with Delayed Acceptance. Section 3.1 examines the proposal distribution and derives its optimal scaling from standard random–walk Metropolis–Hastings theory. Then Section3.2covers the ranking of the factors ρi, which drives the total computational

cost of the procedure.

3.1. Optimising the proposal mechanism. The explorative performances of a random–walk MCMC are strongly dependent on its proposal distribution. As exemplified inRoberts et al.(1997), finding the optimal scale parameter does lead to efficient ‘jumps’ in the state space and the overall acceptance rate of the chain is directly connected to the average jump distance and to the asymptotic variance of ergodic averages. This provides practitioners with an approach to ‘auto-tune’ the resulting random–walk MCMC algorithm. Extending this calibration to the Delayed Acceptance scheme is equally important, on its own ground towards finding a reasonable scaling for the proposal distribution and to avoid comparisons with the standard Metropolis–Hastings version.

The original framework ofRoberts et al.(1997) is centered on estimating a collec-tion of expected funccollec-tionals, say g, where a plausible criterion for the performances of the MCMC is the minimisation of the stationary integrated auto-correlation time (ACT) of the Markov chain, defined as

τg= 1 + 2 ∞

X

i=1

Cor(g(X0), g(Xi))

where the index g stresses the dependence on the considered functional, which is connected to the asymptotic variance through var(P, g) = τg× varπ(g) whenever

the chain is φ-irreducible, aperiodic, and reversible, varπ(g) is finite and g ∈ L2(π).

(13)

• Consider the limit in the dimension of the target distribution toward ∞, where

Roberts et al. (1997) gave conditions under which each marginal chain con-verges toward a Langevin diffusion. Maximising the speed of that diffusion, say h(`) where ` is a parameter of the scale of the proposal, implies a minimisa-tion of the ACT and also that τ is free from the dependence on the funcminimisa-tional, defining thus an independent measure of efficiency for the algorithm;

• Sherlock and Roberts(2009) focus on unimodal elliptically symmetric targets and show that a proxy for the ACT in finite dimensions is the Expected Square Jumping Distance (ESJD), defined as

EkX0− Xk2β = E " _d X i=1 1 β2_i(X 0 i− Xi)2 #

where X and X0 _{are two successive points in the chain and k · k}

β represent

the norm on the principal axes of the ellipse rescaled by the coefficients βi so

that every direction contributes equally.

An interesting result in Sherlock and Roberts (2009) is that, as d → ∞, the ESJD on one marginal component of the chain converges with the same speed as the diffusion process described inRoberts et al.(1997). These authors furthermore show the asymptotic result holds for rather small dimension, roughly starting from d = 5.

Moreover, when considering efficiency for Delayed Acceptance, which is a tech-nique tailored on costly computations, we need to focus on the execution time of the algorithm as well. We then proceed to define our measure of efficiency as

Eff := ESJDcost per iteration (5)

similarly toSherlock et al. (2015b) for Pseudo-Marginal MCMC.

Due to the complex acceptance ratio in Delayed Acceptance, an extension of the previous results requires rather stringent assumptions, albeit providing a proper guideline in practice. Section5will further demonstrate optimality extends beyond those conditions. Note that our assumptions are quite standard in the literature on the subject.

(H1) We assume for simplicity’s sake that the Delayed Acceptance procedure op-erates on two factors only, i.e., that r(x, y) = ρ1(x, y) × ρ2(x, y). The acceptance

probability of the scheme is thus ˜ α(x, y) = 2 Y i=1 (1 ∧ ρi(x, y)).

We also consider the ideal setting where a computationally cheap approximation ˜

f (·) is available for π(·) and precise enough so that ρ2(x, y) = r(x, y)/ρ1(x, y) =

π(y)π(x) × ˜f (x)_˜ f (y) = 1.

(H2) We further assume that the target distribution satisfies (A1) and (A2) in

Roberts et al.(1997), which are regularity conditions on π and its first and second derivatives, and that π(x) =

n

Q

i=1

f (xi).

(H3) We consider a random walk proposal y = x +p`2_{/d Z, where Z ∼ N (0, I} d).

Note that Gaussianity can be easily relaxed to distributions with finite fourth mo-ment and similar results are available for more heavy-tailed distributions (Neal and

(14)

Roberts, 2011).

(H4) Finally we assume that the cost of computing ˜f (·), say c, is proportional to the cost of computing π(·), named C, with c = δC.

Normalising costs by setting C = 1, the average total cost per iteration of the Delayed Acceptance chain is δ + E [˜α] and the efficiency of the proposed method under the above conditions is

Eff (δ, `) = ESJ D δ + E [˜α]

Lemma 3.1. Under the above conditions (H1–H4) on the target π(x), on the pro-posal q(x, y) and on the factorised acceptance probability ˜α(x, y) =

2 Q i=1 (1 ∧ ρi(x, y)) we have that ˜ α(x, y) = (1 ∧ ρ1(x, y)) and that as d → ∞ Eff (δ, `) = h(`) δ + E [˜α]= 2`2_Φ(−`√I 2 ) δ + 2Φ(−`√I 2 ) a(`) = E [˜α] = 2Φ(−` √ I 2 ) where I := E ( π(x) )0 π(x) 2

as defined in Roberts et al.(1997).

Proof. It is easy to see that (H1) implies ˜f (·) = π(·) and so ρ1(x, y) = r(x, y).

Moreover, by definition, ρ2(x, y) = r(x, y)/ρ1(x, y) = 1 and hence the second test

is always accepted. The acceptance rate reduces then to just the ratio ˜f (y)/ ˜f (x) = ρ1(x, y).

The second part of the lemma follows directly from Theorem 1.1 inRoberts et al.

(1997).

Let us stress that almost all assumptions in the above Lemma can be relaxed and that performances are robust against small deviances from those assumptions, as shown by the literature on standard Metropolis–Hastings. Obtaining analytical results without such conditions, while possible, requires however a considerable mathematical effort.

We now state the main practical implication of Lemma3.1.

Proposition 4. If the conditions of Lemma3.1 holds, the optimal average accep-tance rate α∗(δ) is independent of I.

Proof. Consider Eff (δ, `) in terms of (δ, a(`)): a = g(`) = 2Φ −` √ I 2 ! ; ` = g−1(a) = −Φ−1a 2 2 √ I Eff (δ, a) = 4 I h Φ−1 a₂2ai δ + a = 4 I ₁ δ + a Φ−1a 2 2 a

(15)

Figure 3. Two top panels: behaviour of `∗(δ) and α∗(δ) as the rel-ative cost varies. Note that for δ >> 1 the optimal values converges towards the values computed for the standard Metropolis–Hastings (dashed in red). Two bottom panels: close–up of the interesting region for 0 < δ < 1.

where we dropped the dependence on ` in a for notation’s sake. It is now evident that to maximise Eff (δ, a) in a we only need maximisen_δ+a1 hΦ−1 a₂2

aio, which is independent of I.

The optimal scale of the proposal `∗_{(δ) and the optimal acceptance rate a}∗_(δ)

are thus given as functions of δ. In particular, as the relative cost of computing ρ1(x, y) with respect to ρ2(x, y) decreases, the proposed moves become bolder, in

that `∗increases and a∗decreases, since rejecting costs the algorithm little in terms of time, while every accepted move results in an almost independent sample. On the contrary when δ grows larger the chain rapidly approaches a Metropolis–Hastings behaviour, as it is no longer convenient to reject early. Figure3 helps visualise the result.

3.2. Ranking the Blocks. As mentioned at the end of Section 1, the order in which the factors ρi(x, y) are tested has a strong influence on the performance of

the algorithm. Delayed Acceptance was first developed inFox and Nicholls(1997);

Christen and Fox (2005) to speed up computations using a cheap approximation ˜

π(·) of the target distribution π(·) as a first step before computing the actual, and costly, Metropolis–Hastings ratio π(y)π(x) only in the cases where the acceptance

(16)

test based on the approximation ˜π was satisfied. The main idea, namely to avoid the computation of the most costly parts as often as possible, remains relevant even for factorisations composed of more than two terms.

Consider an i.i.d. framework; the target (in x) is given by π(x|Z) ∝ p(x) × L(Z|x) = p(x|Z) ×

n

Y

i=1

f (zi|x)

where Z = (z1, . . . , zn) is an i.i.d. sample from f (z|x) and p(x) is the prior

distri-bution for x, we can always consider the decomposition r(x, y) =

K

Y

i=1

ξi(x, y) (6)

where each ξi(x, y) is made of a small number of density ratio terms, with one

including the prior and proposal ratios. In the limit, it is feasible if not necessarily efficient to consider the case K = n + 1 with

ξi(x, y) = f (zi|y) f (zi|x) i = 1, . . . , n and ξn+1(x, y) = p(y)q(y, x) p(x)q(x, y).

Assuming the computing cost is comparable for all terms, a solution for optimis-ing the order of these factors ranks the entries accordoptimis-ing to the success rates observed so far, starting with the least successful values. Alternatively, the factorisation can start with the ratio that has the highest variance, since it is the most likely to be rejected. (Note however that poor factorisations (6) lead to very low acceptance rates, as for instance when picking only outliers in a given group of observations.) Lastly, we can rank factors by their correlation with the full Metropolis–Hastings ratio; taking the argument to the limit, if the first factor has a perfect correlation with r(·, ·) then all the successive terms must be accepted and their order is hence of no interest.

This later setting is akin to considering the hypothetical optimal solution intro-duced in Section3.1with only two terms in the decomposition. Let a small number of the best scoring terms be merged to form ρ1 and let the remaining factors

be-come ρ2. ρ1(x, y) = ˜π(y)/˜π(x) is then highly correlated to r(x, y), ρ2(x, y) ∼ 1 for

every (x, y) and hence ˜π(x) is a close approximation of the target, albeit probably flattened, which is exactly what we want (see Section2).

As all these features can be evaluated for each subsample while running a chain with acceptance ratio factored as in (6), an implementation based on this intuition is then to take ˜ πZ∗(x) ∝ p(x)m/n m Y i=1 f (zi∗|x) or π˜Z∗(x) ∝ m Y i=1 f (zi∗|x)

with m < n, where Z∗= (z∗1, . . . , zm∗) is a subsample of Z. At each iteration t of the

Markov chain we compute all the ξi(x(t−1), y) and Z∗(t)is chosen as the subset that

maximise the observed correlation between the values ofπ˜Z∗(x(j))

˜

πZ∗(x(j)) (j = 1, . . . , t−1) and the full Metropolis–Hastings ratio (or whatever other selected criterion). As computing the real argf maxZ∗_⊂Z is expensive, in our practical implementation we resort to a forward selection scheme; starting with the factor ξi with the maximal

correlation we build Z∗(t) merging one term at a time until a desired correlation level is achieved, the observed correlation after including another term does not

(17)

grow more than a small > 0 or the size of Z∗(t) has reached a critical point for computational purposes (e.g. 10% of the whole sample Z).

A relevant warning is that if we rearrange terms during the run, not only reorder-ing but also mergreorder-ing them, in accordance to their correlation with the unmodified ratio, the resulting method has no theoretical guarantee since the kernel is po-tentially changing at each iteration depending on properties of previous samples (Gelfand and Sahu,1994).

As with standard adaptive MCMC (Roberts and Rosenthal,2005) we resort thus to a finite adaptation scheme; we start with a fixed number of iterations to rank and rearrange the factors, followed by a fixed ordering to achieve ergodicity of the chain. We test this procedure in Section5.1on a simulated example.

Finally note that while we focused on the i.i.d. setting, in more complex cases where the ratio is factored and Delayed Acceptance can be applied, it is often the case that the optimal ordering of such factors is already known.

4. Relation with other methods.

4.1. Delayed Acceptance and Prefetching. Prefetching, as defined by Brock-well (2006), is a programming method that accelerates the convergence of a single MCMC chain by guessing future states in the path of a random walk Metropolis– Hastings Markov chain in order to use any additional computing power available, in the form of extra parallel processors, to calculate in advance necessary quantities (like the Metropolis–Hastings ratio) so that when the chain reaches a given state the computationally-heavy part of that iteration are ready.

Clearly the usefulness of this technique depends on our ability to guess the path of the chain correctly and hence many advanced prefetching strategies make use of the observed acceptance rate of the chain or even of a fast approximation ˜π of the target distribution to select the most likely future outcomes.

Since an in-depth exploration of prefetching is outside the scope of this work the reader is referred toStrid(2010) and citations therein for a complete discussion of the argument.

As mentioned above and demonstrated inStrid(2010);Angelino et al.(2014) if a cheap approximation ˜π of the target density is available, it can be used to select more likely future paths of the chain and this results in an efficient prefetching algorithm.

In our case the master process sequentially samples from the (Delayed Ac-ceptance) chain by checking only the (assumed) inexpensive first approximation ρ1(x, y) = ˜π(y)/˜π(x) while the other additional processors provide him the more

expensive ρ2(x, y) = π(y)˜π(x)/π(x)˜π(y) computed beforehand thanks to

prefetch-ing. The theoretical properties of the chain are unchanged while the achievable speed-up may be substantial, especially for the first few additional processors. 4.2. Alternative procedure for Delayed Acceptance. In the case that every factor ρi(x, y) has roughly the same computational cost, Philip Nutzman suggested

(personal communication) that Delayed Acceptance can be slightly modified by taking the overall acceptance probability

d

Y

i=1

min {ρi(x, y), 1} to be instead min k=1,... ,d ( k Y i=1 ρi(x, y), 1 ) .

(18)

Such a decomposition follows from the same idea that one would like to compute as few factors as possible once one realizes that the proposal is likely to be rejected. Under this modification the associated Markov chain still achieves the correct target in the stationary regime and the procedure satisfies detailed balance, provided the ordering of the terms is uniformly random.

An interesting consequence of this modification is that, as the number of factor increases, the acceptance rate eventually stabilises, while for the method described in Section1the acceptance rate decreases to zero. This property is indeed appeal-ing, even though this procedure logically takes longer to complete when compared with the standard Delayed Acceptance (albeit less than the reference Metropolis– Hastings procedure).

The evident disadvantage of the modification in a general setting is that detailed balance implies that the factors are computed in a random order at each iteration, making vain any attempt to adapt in terms of the ordering (Section3.2) or to set the order based on respective computational costs.

This drawback can be somewhat reduced by combining the above two approaches; consider the decomposition

π(y) q(y, x)π(x)q(x, y) = "d1 Y i=1 ρi(x, y) # ×   d2 Y j=1 φj(x, y)  

where d1+ d2 = d and the factors ρi and φj represent respectively cheap factors

and costly factors. By taking now

min (m=1,... ,d1) ( 1, m Y i=1 ρi(x, y) ) × min (k=1,... ,d2)    1, k Y j=1 φi(x, y)   

the algorithm computes cheap factors first and expensive factors last, applying the symmetry requirement to satisfy detail balance inside each of both subsets. Clearly the above can be generalised to a larger number of subsets, each with di factors in

it. Intuitively, this last modification can be explained as an early rejection of each of the intermediate acceptance/rejection steps inside a Delayed Acceptance scheme. Remark 3. Interestingly if dl= 1 ∀l (l being the number of subsets considered) this

procedure reduces to Delayed Acceptance, and for l that increases and dl> 1 ∀l this

combined technique will have a even lower overall acceptance rate than standard Delayed Acceptance.

4.3. Delayed Acceptance and Slice Sampling. As a final remark, we stress another analogy between our Delayed Acceptance algorithm and slice sampling (Neal,1997;Robert and Casella,2004). Based on the same decomposition (1), slice sampling proceeds as follows

1. simulate u1, . . . , un∼ U (0, 1) and set λi= ui`(θ|xi) (i = 1, . . . , n);

2. simulate θ0 as a uniform under the constraints `i(θ0|xi) ≥ λi (i = 1, . . . , n).

to compare with Delayed Acceptance which conversely 1. simulate θ0∼ q(θ0_|θ);

2. simulate u1, . . . , un∼ U (0, 1) and set λi= ui`(θ|xi) (i = 1, . . . , n);

3. check that `i(θ0|xi) ≥ λi (i = 1, . . . , n).

The differences between both schemes are thus that (a) slice sampling always accepts a move, (b) slice sampling requires the simulation of θ0_{under the constraints, which}

(19)

may prove infeasible, and (c) Delayed Acceptance re-simulates the uniform variates in the event of a rejection. In this respect, Delayed Acceptance appears as a “poor man’s” slice sampler in that values of θ0s are proposed until one is accepted. 5. Examples. To illustrate the improvement brought by Delayed Acceptance, we study three different realistic settings that reflect on the generality of the method. First, in Section5.1we consider a Bayesian analysis of a logistic regression model, to assess the computational gain brought by our approach in a “Big-Data” environ-ment where obtaining the likelihood is the main computational burden. Secondly (Section 5.2) we examine a high dimensional toy Normal-Normal model, sample with a geometric Metropolis adjusted Langevin algorithm where the main compu-tational cost comes from the proposal distribution which is position specific and involves derivatives of the density up till the third level, which are computed nu-merically at each iteration. Finally in Section 5.3we investigate a mixture model where a formal Jeffreys prior is used, as it is not available in closed-form and does require an expensive approximation by numerical or Monte Carlo means. The latter example comes as a realistic setting where the prior itself is a burdensome object, even for small datasets.

5.1. Logistic Regression. While a simple model, or due to its simplicity, logistic regression is widely used in applied statistics, especially in classification problems. The challenge in the Bayesian analysis of this model is not generic, since simple Markov Chain Monte Carlo techniques providing satisfactory approximations, but stems from the data-size itself. This explains why this model is used as a benchmark in some of the earlier accelerating papers (Korattikara et al.,2014;Neiswanger et al.,

2013; Scott et al., 2013; Wang and Dunson, 2013). Indeed, in “Big Data” setups, MCMC is deemed to be progressively inefficient and researchers are striving to keep simulation effective, focusing mainly on parallel computing and on sub-sampling but also on replacing the classic Metropolis scheme itself.

We tested the proposed method against the standard Metropolis–Hastings algo-rithm on 106_{simulated data with a 100-dimensional parameter space. The proposal}

distribution is Gaussian: y|x ∼ N (x, Σ) with Σ initialised to be 0.2 × Id (d being

the dimension of the parameter space) and then adapted. The Metropolis–Hastings benchmark was made adaptive by targeting the asymptotic optimal acceptance rate of α∗= 0.234 (Roberts et al.,1997).

Delayed Acceptance was optimised first against the ordering of the factors as explained in Section 3; we split the data into subsamples of 10 elements and com-puted their empirical correlation with the full Metropolis–Hastings ratio as a crite-rion. Once these estimates were stable we merged into the surrogate target ˜f the smallest number of subsamples needed to achieve a ≥ 0.85 correlation with r(x, y). As soon as the ordering was fixed we computed δ, the relative cost of the obtained ρ1with respect to the full ratio, and run the chain for the remaining iterations

op-timising Σ against the optimal acceptance rate found through (5). We also added the modification explained in Section2.4 with c set such that b was slightly lower than the optimal acceptance rate above.

We collected 100 repetitions of the experiment and the results are presented in Table 1. Before commenting the results we highlight the fact that this situation may seem not particularly appealing for Delayed Acceptance and in fact straight application of the method by randomly choosing the composition of ρ1 and ρ2 may

(20)

Algorithm rel. ESS (av.) rel. ESJD (av.) rel. Time (av.)

DA-MH over MH 1.1066 12.962 0.098

Algorithm rel. gain (ESS) (av.) rel. gain (ESJD) (av.)

DA-MH over MH 5.47 56.18

Table 1. Comparison between MH and MH with Delayed Accep-tance on a logistic model. ESS is the effective sample size, ESJD the expected square jumping distance, time is the computation time.

adaptively how to split the MH ratio. Borrowing from both Section3.2and the end of Section3.1, i.e. by choosing during the brief burn-in of the chain which subset best represents the whole likelihood and then, based on how populated that subset is, targeting a specific acceptance ratio, produces both a completely automated MCMC version for this kind of data (iid ) and better results under a time constraint.

As shown in Table 1, while the assumption made in Section 3 not completely satisfied, the relative efficiency of Delayed Acceptance is higher that for MH by a factor of almost 6. We measured efficiency trough effective sample size (ESS, from the coda R package (Plummer et al., 2006)) or expected square jumping distance (ESJD). By choosing the first subsample to be informative on the whole ratio there is practically no loss on ESS (while the estimated ESJD actually increased) and, given the significantly reduced acceptance rate, the computing time is usually less then a fourth of the computing time of the corresponding optimal MH, taking into account the first part of chain used to determine the blocks ranking.

5.2. G-MALA with Delayed Acceptance.

5.2.1. MALA and Geometric MALA:. Random walk Metropolis–Hastings, while generic and popular, can struggle with posterior distributions in high dimensions or in the presence of high correlation between some components. In such cases it is inefficient, with low acceptance rate, poor mixing and highly correlated sam-ples. Metropolis adjusted Langevin algorithm (MALA, see for instanceRoberts and Stramer (2002)) has been devised to overcome these difficulties by taking advan-tage of the gradient of the target distribution in the proposal mechanism, making the Markov chain more robust with respect to the dimension of the problem and proposing broader moves with higher probability. MALA is based on a Langevin dif-fusion, with the target (the posterior distribution π(θ|y) in our case) as a stationary distribution, defined by the SDE

dθ dt = 1 2∇θlog(π(θ|y)) + dB dt

where B is a Brownian motion. Using a first-order discretisation the diffusion gives the following proposal mechanism:

θ0 = θ(i−1)+ ε2∇θlog(π(θ(i−1)|y))/2 + εZ

where ε is the step-size for the Euler’s integration and Z ∼ N (0, I). This discreti-sation is then compensated by introducing an accept/reject probability similar to a Metropolis–Hastings algorithm.

This diffusion is isotropic and will hence still be inefficient for highly correlated components or with very different scales, as the step size ε is fixed across dimensions.

(21)

Roberts and Stramer(2002) propose to alleviate the issue using a pre-conditioning matrix A so that the proposal becomes

θ0= θ(i−1)+ ε2ATA∇θlog(π(θ(i−1)|y))/2 + εAZ.

Christensen et al.(2005) demonstrate however that defining this matrix in general can be difficult and that tuning on the go may result in an inappropriate asymptotic behaviour.

Girolami and Calderhead(2011) propose the Geometric-MALA in order to over-come this difficulty, advising the use of a position specific metric for the matrix A, which takes advantage of the geometry of the target space that the chain is explor-ing. They suggest in particular the Fisher-Rao metric tensor. In terms of Bayesian inference, where the target distribution is the posterior density, this choice trans-lates into AT_{A being the expected Fisher information matrix plus the negative}

Hessian of the log-prior.

This theoretically efficient solution also performs well in practice but comes with a serious computational burden in the fact that at every evaluation of the Metropolis– Hastings ratio derivatives up till the third order of our log-target distribution are needed and, in the event of them being analytically not available, expensive nu-merical approximations are to be computed (see equation (10) of Girolami and Calderhead(2011)).

5.2.2. Sampling with Delayed Acceptance and GMALA:. Geometric-MALA repre-sent a perfect application for Delayed Acceptance since we can naturally divide its acceptance ratio into the product of the posterior ratio and the ratio of the propos-als, the latter to be only computed when the proposed point is associated with a relatively large posterior probability.

As described above, the computational bottleneck of the G-MALA lays in the computation of the third derivative of our log-target at the proposed point, while the computation of the posterior itself has usually a low relative cost. Moreover even with a non-symmetric efficient proposal mechanism (the discretised Langevin diffusion) G-MALA is still close to a random walk and we expect the ratio of the proposal to be near 1, especially at equilibrium especially when ε is small. Therefore, the first ratio is inexpensive, relative to the second one, while the decision reached at the first stage should be consistent with the overall acceptance rate.

Given that optimal scaling for MALA in terms of the dimension d of the target differs from the random-walk setting (see Roberts and Rosenthal, 2001), we set the variance of the random-walk normal component as σ2

d = `2

d1/3. Borrowing from Section3.1, we can obtain the optimal acceptance rate for the DA-MALA, through Equation (5), by maximising Eff (δ, `) = h(`) δ + E [˜α] − δ × E [˜α] = 2`2_Φ(−K`3 2 ) δ + 2Φ(−K`₂3) × (1 − δ) or equivalently Eff (δ, a) = − 2 K 23   aΦ−1 a 2 23 δ + a(1 − δ)  .

In the above the computational cost per iteration is taken to be c = δC for the posterior ratio, C = 1 for the proposal ratio (and hence c + E [˜α] (C − c) for the whole kernel), h(`) is again the speed of the limiting diffusion process and K is a measure of roughness of the target distribution, depending on its derivatives. Since

(22)

Figure 4. Optimal acceptance rate for the DA-MALA algorithm as a function of δ. In red, the optimal acceptance rate for MALA obtained byRoberts and Rosenthal(2001) is met for δ = 1.

the optimal a∗is independent from K, we do not define it more rigorously, referring to Roberts and Rosenthal (2001). Figure 4 shows that a∗ decreases with δ, as is the case with random-walk Metropolis–Hastings. It reaches the known optimum for the standard MALA when δ = 1.

5.2.3. Simulation study: To test the above assumptions we ran a toy MALA exam-ple where we drew 100 samexam-ples from a Nd(θ, I), with d = 10; π(θ) was set to be

Nd(0, 100). Figure 5 presents an example run. We then repeated the experiment

100 times and computed an average efficiency gain, defined either as ESS or as the ESJD, over the computing time. We computed δ at each run by averaging a few computed derivatives, required by the proposal ratio. We then adapt ε to get the optimal acceptance rate, being conservative in order to avoid overflow issues with the first-order numerical integrator. Results are presented in Table2. Delayed Ac-ceptance exhibits improvement by a factor of 10 in this example, obtained almost for free in terms of coding time.

5.2.4. HMC with Delayed Acceptance: As a side note, while the reasoning applied to MALA does theory apply to Hamiltonian Monte Carlo (HMC), the computational gain obtained through Delayed Acceptance is only connected with avoiding some proposal computations. In a general HMC though (with both point-dependent and independent pre-conditioning matrices), proposing a new value still involves the computation of L − 1 (with L the number of steps in the discretised–Hamiltonian integration) derivatives, as only the starting point is recovered from the previous iteration, while computing the final Metropolis–Hastings ratio involves just the extra computation at the end point. Therefore, in this setting, the computational gain is much reduced.

(23)

Algorithm ESS (av.) (sd) ESJD (av.) (sd) time (av.) (sd)

MALA 7504.48 107.21 5244.94 983.47 176078 1562.3

DA-MALA 6081.02 121.42 5373.253 2148.76 17342.91 6688.3 Algorithm a (aver.) ESS/time (aver.) ESJD/time (aver.)

MALA 0.661 0.04 0.03

DA-MALA 0.09 0.35 0.31

Table 2. Comparison between standard geometric MALA and geometric MALA with Delayed Acceptance, with ESS the effective sample size, ESJD the expected square jumping distance, time the computation time and a the observed acceptance rate.

Figure 5. Comparison between geometric MALA (top panels) and geometric MALA with Delayed Acceptance (bottom panels): marginal chains for two arbitrary components (left), estimated marginal posterior density for an arbitrary component (middle), 1D chain trace evaluating mixing (right).

5.3. Mixture Model.

5.3.1. Non-Informative inference on a Mixture Model: Consider a standard mixture model (MacLachlan and Peel,2000) with a fixed number of components

k X i=1 wif (x|θi) , with k X i=1 wi= 1 . (7)

(24)

This standard setting nonetheless offers a computational challenge in that the ref-erence objective Bayesian approach based on the Fisher information and the asso-ciated Jeffreys prior (Jeffreys,1939;Robert,2001) is not readily available for com-putational reasons and has thus not been implemented so far. Proxys using Jeffreys priors on the components of (7) have been proposed instead, with the drawback that since they always lead to improper posteriors, ad hoc corrections have to be implemented (Diebolt and Robert,1994; Roeder and Wasserman, 1997; Stephens,

1997).

When relying instead on dependent improper priors, it is not always the case that the posterior distribution is improper. For instance,Robert and Titterington

(1998) provide a location-scale representation that allows for some improper prior. In the current paper, we consider instead the genuine Jeffreys prior for the complete set of parameters in (7), derived from the Fisher information matrix for the whole model. While establishing the analytical properness of the associated posterior is beyond the goal of the current paper, we handle large enough samples to posit that a sufficient number of observations is allocated to each component and hence the likelihood function dominates the prior distribution. (In the event the poste-rior remains improper, the associated MCMC algorithm should exhibit a transient behaviour.)

Therefore, this is an appropriate and realistic example for evaluating Delayed Acceptance since the computation of the prior density is clearly costly, relying on many integrals of the form:

− Z X ∂2_loghPk i=1wif (x|θi) i ∂θh∂θj " k X i=1 wif (x|θi) # dx . (8)

Indeed, these integrals cannot be computed analytically and thus their derivation involve numerical or Monte Carlo integration. This setting is such that the prior ratio—as opposed to the more common case of the likelihood ratio—is the costly part of the target evaluated in the Metropolis–Hastings acceptance ratio. More-over, since the Jeffreys prior involves a determinant, there is no easy way to split the computation in more parts than “prior × likelihood”. Hence, the Delayed Ac-ceptance algorithm can be applied by simply splitting between the prior pJ_{(ψ) and}

the likelihood `(ψ|x) ratios, the latter being computed first. Moreover, since the proposed prior is “non informative”, its influence on the definition of the poste-rior distribution should be small with respect to the likelihood function and, then, computing the likelihood ratio first should not have a substantial impact on the acceptance rate. However, the improper nature of the prior means using a second acceptance ratio solely based on the prior can create trapping states in practice, even though the method remains theoretically valid. We therefore opted for stabil-ising this second step by saving a small fraction of the likelihood, corresponding to 5% of the sample, to regularise this second acceptance ratio. This choice translates into Algorithm2.

5.3.2. Simulation study: An experiment comparing a standard Metropolis–Hastings implementation with a Metropolis–Hastings version relying on Delayed Acceptance is summarised in Table3. Data were simulated from the following Gaussian mixture model:

(25)

Algorithm 2 Metropolis–Hastings with Delayed Acceptance for Mixture Models Set `2(·|x) = bpnc Q i=1 `(·|xi) and `1(·|x) = n Q i=bpnc+1 `(·|xi) where p ∈ (0, 1) 1. Simulate ψ0 _{∼ q(ψ}0_|ψ);

2. Simulate u1, u2∼ U (0, 1) and set λ1= u1`1(ψ|x);

3. if `1(ψ0|x) ≤ λ1, repeat the current parameter value and return to 1;

else set λ2= u2`2(ψ|x)pJ(ψ);

4. if `2(ψ0|x)pJ(ψ0) ≥ λ2 accept ψ0;

else repeat the current parameter value and return to 1.

Algorithm ESS (av.) (sd) ESJD (av.) (sd) time (av.) (sd)

MH 1575.96 245.96 0.226 0.44 513.95 57.81

MH + DA 628.77 87.86 0.215 0.45 42.22 22.95

Table 3. Comparison using different performance indicators in the example of mixture estimation, based on 100 replicas of the ex-periments according to model (9) with a sample size n = 500, 105

MH simulations and 500 samples for the prior estimation. (“ESS” is the effective sample size,“time” is the computational time). The actual averaged gain (ESSDA/ESSM H

timeDA/timeM H) is 9.58, higher than the “dou-ble average” that the ta“dou-ble above suggests as being around 5.

Both the standard Metropolis–Hastings and the Delayed Acceptance version are adapted against their respective optimal acceptance rate, which is computed to be 2%, given that δ is empirically established to be 0.01 using 500 samples for the Monte Carlo estimation of the prior. As a consequence the MH+DA algorithm will produce less unique samples in the total 105 iterations of the chain, as reflected in the lesser ESS in Table3, but this is counterbalanced by the impressive decrease in computing time, leading again to an overall gain in terms of ESS/t of about 9.

6. Conclusion. We introduced in this paper Delayed Acceptance, a generic and easily implemented modification of the standard Metropolis–Hastings algorithm that splits the acceptance rate into more than one step in order to increase the computational efficiency of the resulting MCMC, under the sole condition that the Metropolis–Hastings ratio can be factorised this way.

The choice of splitting the target distribution into parts ultimately depends on the respective costs of computing the said parts and of reducing theoretically the overall acceptance rate and expected square jump distance (ESJD). Still, this generic alternative to the standard Metropolis–Hastings approach can be considered on a customary basis, since it both requires very little modification in programming and can be easily tested against the basic version, both empirically and theoretically by the results of Section 2. The Delayed Acceptance algorithm presented in Sec-tion1can significantly decrease the computational time per se as well as the overall acceptance rate. Nevertheless, the examples presented in Section5suggest that the gain in terms of computational time is not linear in the reduction of the acceptance rate, especially in the presence of optimisation techniques like in Section3.

(26)

Furthermore, our Delayed Acceptance algorithm does naturally merge with the widening range of prefetching techniques, in order to make use of parallelisation and reduce the overall computational time even more significantly. Most settings of interest are open to take advantage of the proposed method, if mostly in the situation of Bayesian statistics where the target density and/or the Metropolis– Hastings ratio always allow for a natural factorisation. The case when the likelihood function can be factorised in an useful way represents the best gain brought by our solution, in terms of computational time, and it may easily improve even more by exploiting parallelisation techniques.

Acknowledgements. The authors are grateful to the reviewers for their comments and suggestions. Thanks to Christophe Andrieu for a very helpful discussion on an earlier version of the manuscript. The massive help provided by Jean-Michel Marin and Pierre Pudlo towards an implementation on a large cluster has been fundamental in the completion of this work. Thanks to Samuel Livingstone for suggesting the Geometric MALA example and also thanks to Philip Nutzman for an interesting conversation including the suggestion of the method proposed in (4.2). Christian P. Robert research was partly financed by Agence Nationale de la Recherche (ANR, 212, rue de Bercy 75012 Paris) on the 2012–2015 ANR-11-BS01-0010 grant and by a 2016–2021 senior chair grant of Institut Universitaire de France. Marco Banterle’s PhD was funded by Universit´e Paris Dauphine.

REFERENCES.

C. Andrieu, A. Lee and M. Vihola et al., Uniform ergodicity of the iterated condi-tional SMC and geometric ergodicity of particle Gibbs samplers (2018). Bernoulli, 24 842–872.

E. Angelino, E. Kohler, A. Waterland, M. Seltzer and R. Adams, Accelerating MCMC via parallel predictive prefetching (2014). UAI’14 Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence 22–31 .

R. Bardenet, A. Doucet and C. Holmes, On Markov chain Monte Carlo methods for tall data (2017). The Journal of Machine Learning Research, 18 1515–1557. A. Brockwell, Parallel Markov chain Monte Carlo simulation by pre-fetching (2006).

J. Comput. Graphical Stat., 15 246–261.

J. Christen and C. Fox, Markov chain Monte Carlo using an approximation (2005). Journal of Computational and Graphical Statistics, 14 795–810.

O.F. Christensen, G.O. Roberts and J.S. Rosenthal, Scaling limits for the tran-sient phase of local Metropolis–Hastings algorithms (2005). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67 253–268.

R. Cornish, P. Vanetti, A. Bouchard-Cˆot´e, G. Deligiannidis and A. Doucet, Scal-able Metropolis-Hastings for exact Bayesian inference with large datasets (2019). arXiv preprint arXiv:1901.09881.

L. Devroye, Nonuniform random variate generation (1986). Springer-Verlag, New York.

J. Diebolt and C.P. Robert, Estimation of finite mixture distributions by Bayesian sampling (1994). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 56 363–375.

C. Fox and G. Nicholls, Sampling conductivity images via MCMC (1997). The Art and Science of Bayesian Image Analysis 91–100.

(27)

A. Gelfand and S. Sahu, On Markov chain Monte Carlo acceleration (1994). J. Comput. Graph. Statist., 3 261–276.

M. Girolami and B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo methods (2011). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73 123–214.

A. Golightly, D. Henderson and C. Sherlock, Delayed acceptance particle MCMC for exact inference in stochastic kinetic models (2015). Statistics and Computing, 25 1039–1055.

H. Jeffreys, Theory of Probability (1939). 1st ed. The Clarendon Press, Oxford. A. Korattikara, Y Chen and M. Welling, Austerity in MCMC land: Cutting the

Metropolis-Hastings budget (2014). In ICML 2014, International Conference on Machine Learning. 181–189.

G. MacLachlan and D. Peel, Finite Mixture Models (2000). John Wiley, New York. K.L. Mengersen and R. Tweedie, Rates of convergence of the Hastings and

Metrop-olis algorithms (1996). Ann. Statist., 24 101–121.

P. Neal and G.O. Roberts, Optimal scaling of random walk Metropolis algorithms with non–Gaussian proposals (2011). Methodology and Computing in Applied Probability, 13 583–601.

R. Neal, Markov chain Monte Carlo methods based on ‘slicing’ the density function (1997). Tech. rep., University of Toronto.

W. Neiswanger, C. Wang and E. Xing, Asymptotically exact, embarrassingly par-allel MCMC (2013). arXiv preprint arXiv:1311.4780.

P. Peskun, Optimum Monte Carlo sampling using Markov chains (1973). Biometrika, 60 607–612.

M. Plummer, N. Best, K. Cowles and K. Vines, CODA: Convergence diagnosis and output analysis for MCMC (2006). R News, 6 7–11.

C.P. Robert, The Bayesian Choice (2001). 2nd ed. Springer-Verlag, New York. C.P. Robert and G. Casella, Monte Carlo Statistical Methods (2004). 2nd ed.

Springer-Verlag, New York.

C.P. Robert and D.M. Titterington, Reparameterisation strategies for hidden Markov models and Bayesian approaches to maximum likelihood estimation (1998). Statistics and Computing, 8 145–158.

G.O. Roberts, A. Gelman and W.R. Gilks, Weak convergence and optimal scaling of random walk Metropolis algorithms (1997). Ann. Appl. Probab., 7 110–120. G.O. Roberts and J.S. Rosenthal, Optimal scaling for various Metropolis-Hastings

algorithms (2001). Statist. Science, 16 351–367.

G.O. Roberts and J.S. Rosenthal, Coupling and ergodicity of adaptive MCMC (2005). J. Applied Proba., 44 458–475.

G.O. Roberts O. Stramer, Langevin diffusions and Metropolis–Hastings algorithms (2002). Methodology and Computing in Applied Probability, 4 337–358.

G.O. Roberts and R. Tweedie, Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms (1996). Biometrika, 83 95–110.

K. Roeder and L. Wasserman, Practical Bayesian density estimation using mixtures of Normals (1997). J. American Statist. Assoc., 92 894–902.

S. Scott, A. Blocker, F. Bonassi, M. Chipman, E. George and R. McCulloch, Bayes and big data: The consensus Monte Carlo algorithm (2013). EFaBBayes 250 conference, 16.

(28)

C. Sherlock, A. Golightly and D.A. Henderson, Adaptive, delayed-acceptance MCMC for targets with expensive likelihoods (2017). Journal of Computational and Graphical Statistics, 26 434–444.

C. Sherlock and G.O. Roberts, Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal targets (2009). Bernoulli, 15 774–798.

C. Sherlock, A. Thiery and A. Golightly, Efficiency of delayed-acceptance random walk Metropolis algorithms (2015). arXiv preprint arXiv:1506.08155.

C. Sherlock, A. Thiery, G.O. Roberts and J.S. Rosenthal, On the efficiency of pseudo-marginal random walk Metropolis algorithms (2015). The Annals of Sta-tistics, 43 238–275.

A.Y. Shestopaloff and R.M. Neal, MCMC for non-linear state space models using ensembles of latent sequences (2013). arXiv preprint arXiv:1305.0320.

M. Stephens, Bayesian Methods for Mixtures of Normal Distributions (1997). Ph.D. thesis, University of Oxford.

I. Strid, Efficient parallelisation of Metropolis–Hastings algorithms using a prefetch-ing approach (2010). Computational Statistics & Data Analysis, 54 2814–2835. L. Tierney, A note on Metropolis-Hastings kernels for general state spaces (1998).

Ann. Appl. Probab., 8 1–9.

L. Tierney and A. Mira, Some adaptive Monte Carlo methods for Bayesian inference (1998). Statistics in Medicine, 18 2507–2515.

X. Wang and D. Dunson, Parallel MCMC via Weierstrass sampler (2013). arXiv preprint arXiv:1312.4605.

Received February 2, 2019; revised March 19, 2019.

E-mail address: [email protected] E-mail address: [email protected] E-mail address: [email protected] E-mail address: [email protected]