5.3 Markov Chains and the Metropolis–Hastings Algorithm
5.3.2 The Metropolis–Hastings Algorithm
The Metropolis–Hastings algorithm (Metropolis et al., 1953) has been listed as one of the top 10 algorithms with the greatest influence on the development and practice of science and engineering in the 20th century (Cipra, 2000; Andrieu et al., 2003). The algorithm belongs to the class of Markov Chain Monte Carlo approaches for the simulation of a probability
5.3 Markov Chains and the Metropolis–Hastings Algorithm 141
Algorithm 5.2Metropolis–Hastings
Input: target density function π, transition kernel T , initial state x0, number of steps n
Output: sample x from π
1: x← x0 2: fort = 1, 2, . . . , n do 3: xt∼ T (xt−1→ ·) and u ∼ U[0,1] 4: ifu < minn π(xt) π(xt−1)· T (xt→xt−1) T (xt−1→xt), 1 o thenx← xtend if 5: end for
distribution. The approach developed in this chapter relies on the Metropolis–Hastings algorithm to perform the sampling from a distribution of structures conditioned on their label being equal to that of the target property.
Algorithm 5.2 is a pseudo-code description of the approach. The algorithm takes as input a target density function π specified up to a normalization constant, the proposal generator given by a transition kernel T , an instance x0from the state-space X as the initial state, and
the number of Markov chain steps n. In the first step of each iteration, the transition kernel is used to sample a candidate state from the corresponding conditional density function, conditioned on the current state of the chain (i.e., either the initial state which is provided as input to the algorithm or the state visited in the previous iteration). Following this, the chain makes a transition to the sampled candidate state with the acceptance probability minn π(x)
π(xt−1)·
T (x→xt−1)
T (xt−1→x), 1 o
. The chain iterates for n steps and the last accepted state is returned as an approximate sample from the target density function π. To ensure that the sample indeed follows the target distribution, the number of steps needs to be sufficiently large so that the chain forgets the initial state and moves away from the stationary distribution of the proposal generator to the target density function π.
Having described the Metropolis–Hastings algorithm, we proceed to review the theo- retical properties of the corresponding chain such as ergodicity and convergence. These properties are mainly determined by the choice of the transition kernel defining a proposal generator. For instance, if there exists a subset A ⊂ X such that π(A) > 0 together with
T (x→ x0) = 0 for all x ∈ X and any x0∈ A, then the target density π is not the stationary
distribution of the Markov chain generated using the Metropolis–Hastings algorithm. The latter can be seen by observing that the chain never visits the set A. Thus, a minimal necessary condition for convergence is that
supp(π) ⊆ ∪x∈XT (x→ ·) .
Assuming that this condition is satisfied, it can be shown that the transition kernel of the Metropolis–Hastings chain satisfies the detailed balance condition with the density function
π. The following proposition is a formal statement of the result.
Proposition 5.9. (Robert and Casella, 2005, Theorem 7.2) Suppose thatT is a transition kernel
whose support contains that of a target density function π. Let{xt}t∈N be a Markov chain
generated using the Metropolis–Hastings algorithm withπ as the target density function and T
as the transition kernel of the proposal generator. The transition kernel of the Metropolis–Hastings chain satisfies the detailed balance condition with the target density functionπ.
Proof. Let M be the transition kernel of the Metropolis–Hastings chain. Then, it holds that
where a(x,x0)is the acceptance probability of a transition from state x to x0, δ
xis the Dirac
mass in x, and r (x) = Pz∈XT (x→ z)(1 − a(x,z)). The first term in this transition kernel can be transformed as a (x, x0)T (x → x0) = min(T (x→ x0),π (x0)T (x0→ x) π (x) ) =π (x0) π (x)T (x0→ x)a(x0, x) .
Thus, we have that the detailed balance condition holds for all x,x0∈ X , i.e.,
π (x) M (x→ x0) = π (x0)T (x0→ x)a(x0, x) + π (x0)δx0(x)r (x0) = π (x0)M (x0 → x) .
Now, from Theorem 5.8 it follows that the Metropolis–Hastings chain is uniformly ergodic if the transition kernel of the chain is aperiodic and π-irreducible. A sufficient condition for the chain to be aperiodic is that the transition kernel allows events {xt+1= xt} with positive
probability. More specifically, the Metropolis–Hastings chain is aperiodic if the acceptance probability, a(x,x0), satisfies
Pa(x,x0) ≥ 1 < 1 .
This condition implies that the transition kernel T corresponding to the proposal generator is not the transition kernel of a Markov chain with the stationary density function π. The latter is reasonable in the sense that if we have a transition kernel that corresponds to a stationary distribution then there is no point in perturbing it with the Metropolis–Hastings algorithm. A sufficient condition for the π-irreducibility of the Metropolis–Hastings chain is that the transition kernel of the proposal generator is positive on the support of π, i.e.,
T (x→ x0) > 0 for all x,x0∈ supp(π) .
Proposition 5.10. (Robert and Casella, 2005, Theorem 7.4) Suppose that a transition kernelT
defined on a discrete state-space is positive on the support of a target density functionπ. Assume
also thatπ is not the stationary distribution of the Markov chain corresponding to T . Then, the
Markov chain generated using the Metropolis–Hastings algorithm withπ as the target density
function andT as the transition kernel of the proposal generator is uniformly ergodic.
Of particular interest to the algorithm proposed in this chapter is an instance of the Metropolis–Hastings algorithm where the transition kernel of a proposal generator is in- dependent of the previous states, i.e., T (x → x0) = T (x0). This instance of the algorithm is
called the independent Metropolis–Hastings algorithm and the following theorem provides a sufficient condition for the algorithm to produce a uniformly ergodic Markov chain. Theorem 5.11. (Mengersen and Tweedie, 1996; Robert and Casella, 2005) The independent Metropolis–Hastings algorithm produces a uniformly ergodic Markov chain if there exists a constantc > 1 such that π (x) < cT (x) for all x∈ supp(π). In this case, for all x ∈ X
kP (xn| x0= x) − πkT V ≤ 2
1 −1cn ,
wherek·kT V denotes the total variation norm.
Having reviewed the Metropolis–Hastings algorithm, we proceed to investigate the properties of two random processes characteristic to Algorithm 5.1, the consistency of the approach and the Metropolis–Hastings algorithm for drawing samples from p(x | y∗, θ).
5.4 Theoretical Analysis 143
5.4
Theoretical Analysis
In this section, we first show that Algorithm 5.1 is consistent and then analyze the mixing time of an independent Metropolis–Hastings chain for sampling from the posterior p(x | y∗, θ).
The section concludes with a method for handling large importance weights that can occur in Algorithm 5.1 while performing the weighted maximum a posteriori estimation.