Mixing Time Analysis - Markov Chains and the Metropolis

5.3 Markov Chains and the Metropolis–Hastings Algorithm

5.4.2 Mixing Time Analysis

Having shown the consistency of Algorithm 5.1, we proceed to bound the mixing time of the Metropolis–Hastings chain. For that, we consider an independent proposal generator G and provide a simple coupling analysis (Vembu et al., 2009) to bound the worst case mixing time of an independent Metropolis–Hastings chain for sampling from the posterior p(x | y∗_{, θ}

t).

We start by formally defining the coupling of two random processes and then provide a result by Aldous (1983) that relates the coupling and worst case mixing time of a Markov chain. Definition 5.5. LetM be a finite, ergodic Markov chain defined on a state space Ω with transition probabilitiesp(x→ x0_{). A coupling is a joint process (A,B) = (A}

t, Bt) on Ω × Ω

such that each of processesA, B, considered marginally, is a faithful copy of M.

The following result by Aldous (1983) allows us to utilize perfect sampling algorithms such as coupling from the past (Propp and Wilson, 1996) to draw samples from the posterior. In particular, suppose |X| parallel and identical chains are started from all possible states x ∈ X and an identical random bit sequence is used to simulate all the chains. Thus, whenever two chains move to a common state, all the future transitions of the two chains are the same. From that point on it is sufficient to track only one of the chains. This is called a coalescence (Huber, 1998). Propp and Wilson (1996) have shown that if all the chains were started at time −T and have coalesced to a single chain at step −T with T > T > 0, then samples drawn at time 0 are exact samples from the stationary distribution. The following lemma embodies this principle and it is crucial for our bound on the worst case mixing time for sampling from the posterior distribution of structures using an independent Metropolis–Hastings chain. Lemma 5.17. (Aldous, 1983) LetM be a finite, ergodic Markov chain, and let (A_t, B_t) be a coupling forM. Suppose that P (A_t(ε), Bt(ε)) ≤ ε, uniformly over the choice of initial state

(A0, B0). Then the mixing time τ(ε) of M (starting at any state) is bounded from above by t(ε).

The following proposition gives a worst case bound on the mixing time of an independent Metropolis–Hastings chain for sampling from the posterior distribution p(x | y∗_{, θ}

t).

Proposition 5.18. For all0 < ε < 1, the mixing time τ(ε) of an independent Metropolis– Hastings chain for sampling from the posterior distributionp(x| y∗_{, θ}

t) is bounded from above

by llnε/ ln1 − exp(−4r kθ_tkm.

Proof. As min_x∈Xp (y∗| x,θt) ≤ maxx∈Xp (y∗| x,θt), the lower bound on the Metropolis–

Hastings acceptance criterion is never greater than 1. Then, from Eq. (5.2) and (5.1) it follows that, for a finite space Y, the transition probability from a state x to a state x0 _satisfies

p(x→ x0) ≥

expDφ(x0_{, y}∗_),θ

tE − A(θt | x0)

expDφ(x,y∗_),θ_t_{E − A(θ}_t _{| x)} =

y∈Yexphφ(x0, y∗) + φ(x,y),θti

y∈Yexphφ(x,y∗) + φ(x0, y) , θti

 .

Now, we can lower bound the transition probability by

p(x→ x0) ≥|Y|exp2 · hφx↓, y↓ ,θti |Y|exp2 · hφx↑, y↑ ,θti ≥ exp −2 · Dφ(x_↓, y_↓) − φ(x↑, y↑),θt E , (5.22)

where hφx↓, y↓ ,θti and hφx↑, y↑ ,θti are the minimum and maximum values of the dot

5.4 Theoretical Analysis 149 Then, using the Cauchy–Schwarz inequality, we derive

p(x→ x0) ≥ exp−2 φ(x↓, y↓) − φ(x↑, y↑) kθtk.

From our assumptions we have that kθk ≤ R and kφ(x,y)k ≤ r. Thus, it holds that

p(x→ x0) ≥ exp(−4rkθtk) ≥ exp(−4Rr). (5.23)

From Eq. (5.23) it follows that the probability of not coalescing for T steps is upper bounded by 1 − exp(−4rkθtk)

. Then for t(ε) = llnε/ ln1 − exp(−4rkθtk)

, we have

P (At(ε), Bt(ε)) ≤1 − exp(−4rkθtk)

t(ε) ≤ ε,

and the result follows from the coupling lemma (e.g., see Lemma 5.17 or Aldous, 1983). The bound from Proposition 5.18 does not exploit the fact that the posterior distribution can be related to the stationary distribution of the proposal generator used in the Metropolis– Hastings sampler. The following bound uses this information and gives a significantly better estimate of the worst case mixing of an independent Metropolis–Hastings chain for sampling from p(x | y∗_{, θ}

t). In fact, the chain mixes in sublinear time expressed as a function of the

approximation quality ε > 0.

Proposition 5.19. The mixing timeτ (ε) of an independent Metropolis–Hastings chain for

sampling from the posterior distributionp(x| y∗, θ_t) is bounded from above by l _ln₂_/ ε

lnc_/c−1 m

, where

c =maxx∈X p(y∗|x,θt)/p(y∗).

Proof. First observe that for all x ∈ X it holds that

p (x| y∗, θ_t) =p (y∗| x,θt)ρ (x)

p (y∗) ≤ cρ (x) , with c ≥ 1. The result then follows from Theorem 5.11.

Any bound on the worst case mixing time of the Metropolis–Hastings chain with a proposal generator defined with a conditional transition kernel depends on the specifics of that kernel. Such studies of the mixing time are beyond the scope of this thesis and will be deferred to future work with specific instantiations of Algorithm 5.1. However, we note here that a simple condition can be imposed on the proposal generator such that the corresponding Metropolis–Hastings chain is uniformly ergodic. The following theorem gives a sufficient condition for the uniform ergodicity of the Metropolis–Hastings chain with a proposal generator defined with a conditional transition kernel.

Proposition 5.20. The Metropolis–Hastings chain is uniformly ergodic if G(x → x0) > 0 for allx, x0 ∈ suppp (x | y∗, θt)

For conditional exponential family models p(y | x,θ) > 0, the lower bound can be con- trolled with the regularization parameter. Thus, there will always be a path with non-zero probability between any two target structures. As it is the case with other Metropolis algorithms, for difficult problems where clusters of targets are far apart in the search space, the mixing will be slower as the model becomes more confident.

In document Constructive Approximation and Learning by Greedy Algorithms (Page 164-166)