Markov Chains - Markov Chains and the Metropolis

5.3 Markov Chains and the Metropolis–Hastings Algorithm

5.3.1 Markov Chains

For clarity reasons, we restrict our review to Markov chains on finite discrete spaces. A more general introduction to this class of algorithms can be found in Robert and Casella (2005). This restriction is also in line with the problem investigated in this chapter—discovery of combinatorial objects with desired properties. Thus, throughout this section we assume that the state-space X is a discrete set with finitely many combinatorial objects.

The stochastic process, {xt}t∈N, is called a Markov chain if, for all t ≥ 1, the conditional

distribution of xt given xt−1, . . . , x0is the same as the distribution of xt given xt−1, i.e.,

p (xt| xt−1, . . . , x0) = p (xt| xt−1) .

If the initial state of the chain x0is known, then the construction of the chain is completely

determined by its transition probabilities, i.e., the conditional density function p(x_t | xt−1).

This density function is also known as the transition kernel of the chain and in the remainder of the chapter we will denote this kernel with T (x_t−1→ xt) B p (xt| xt−1).

A σ-finite probability measure π defined on the state-space X is invariant for a transition kernel T (· → ·) and the corresponding Markov chain if, for all x0 _{∈ X , it holds}

π (x0) =X

x∈X

T (x→ x0)π (x) .

A Markov chain with an invariant probability measure is stationary in distribution. To see this, observe that x₀ ∼ π implies xt ∼ π for all t ≥ 1. As a result of this, the invariant

probability measure π is also called the stationary distribution of the chain. The existence of the stationary distribution is an important stability property of a Markov chain and one of

the main reasons for the popularity of Markov chain Monte Carlo methods. More specifically, for distributions that are difficult to simulate (e.g., not analytically tractable) the property enables their simulation via a corresponding Markov chain, subject to additional stability properties such as ergodicity which is discussed subsequently.

Having introduced the notion of a stationary chain, we proceed to review the stability properties of the chain required for the existence of a unique stationary distribution. For that, let A ⊂ X and denote the first time step t in which the chain enters the set A with

τ_A= min{t ≥ 1 | t ∈ N ∧ xt∈ A} . (5.10)

The time step τ_Ais called the stopping time in A and τ_A= ∞ if xt< A for all t≥ 1. For the set A denote the number of visits of the chain to A with

η_A=X∞

t=1

IA(xt) . (5.11)

This quantity allows us to define the stability property measuring the expected number of visits toA given an initial state of the chain x ∈ X , denoted with E[ηA| x0= x]. This

stability measure is needed to ensure that the trajectory of the chain will visit each state often enough. To further formalize this stability property, we need to introduce the notion of state recurrence. A state x ∈ X is called recurrent if the expected number of returns to x is infinite, i.e., E[ηx| x0= x] = ∞, and transient otherwise. Thus, for chains with discrete

state-spaces the recurrence property of a state is equivalent to the guarantee of return to that state. In other words, the recurrence of a state can be characterized with the probability of return tox in a finite number of steps given an initial state of the chain x∈ X , denoted with

P (τx<∞ | x0= x). More specifically, a state x ∈ X is recurrent if P (τx<∞ | x0= x) = 1. To

see that these two definitions are equivalent note that for P (τ_x<∞ | x0= x) > 0 we have

Now, the claim follows by setting P (τ_x<∞ | x0= x) = 1 or E[ηx| x0= x] = ∞.

Having introduced the notion of a recurrent state, we now turn our attention to a stability property that quantifies the sensitivity of the chain to initial conditions. This property will turn out te be crucial for the existence of the stationary distribution of a chain with discrete state-space. A Markov chain is irreducible if starting from any state it is possible to reach all states from the state-space in a finite number of steps with positive probability. More formally, a chain is irreducible if, for all x,x0_{∈ X , it holds that}

P (τ_x0<∞ | x₀= x) > 0 . (5.12)

An equivalent definition of the irreducibility requires that the chain satisfies E[ηx0| x₀= x] > 0 for all x,x0 _{∈ X . For a given measure ψ on the state-space X , the Markov chain is ψ-}

irreducible if, for all x0 _{∈ X with ψ (x}0_{) > 0 and all x ∈ X , P (τ}

5.3 Markov Chains and the Metropolis–Hastings Algorithm 139 A Markov chain is called recurrent if there exists a measure ψ on X such that the chain is ψ-irreducible and if all the states from the support of ψ are recurrent. An irreducible Markov chain on a discrete state-space is guaranteed to have at least one recurrent state (the cardinality of the state-space is finite and there are infinitely many states in the chain). The following proposition establishes the connection between irreducibility and recurrence of a chain on a discrete state-space.

Proposition 5.5. (Robert and Casella, 2005) An irreducible Markov chain defined on a discrete state-spaceX is recurrent.

Proof. As the chain has at least one recurrent state x∗∈ X we have that P (τ_x∗ <∞ | x₀=

x∗) = 1. Assume now there is a transient state z ∈ X with P (τz<∞ | x0= z) < 1. From the

irreducibility of the chain we have that there exist m1, m2∈ N such that P (τx∗ = m₁| x₀=

z) > 0 and P (τz = m2| x0= x∗) > 0. Thus, we have that it holds

Pxm1+m2+n= z | x0= z = X x∈X Pxm1= x | x0= z P (xn= x | x0= x)Pxm2= z | x0= x ≥ Px_m₁= x∗_{| x}₀_{= z P (x} n= x∗| x0= x∗)Pxm2= z | x0= x∗ . Now, summing the last inequality over n ∈ N we deduce

∞ X n=0 Px_m₁+m2+n= z | x0= z ≥ P xm1= x∗| x0= z P xm2= z | x0= x∗ E [ηx∗| x₀= x∗] . As the state x∗_{is recurrent it must hold that E[η}

x∗ | x₀= x∗] = ∞. The latter inequality then implies that E[ηz | x0= z] = ∞. As all the states from the state-space X are recurrent the

chain is also recurrent.

We can now relate the properties of irreducibility and recurrence of a chain to the existence of the unique stationary probability measure. In particular, as the following theorem will show, for any recurrent chain there exists a unique stationary probability measure. Thus, Proposition 5.5 together with the following theorem implies that an irreducible Markov chain defined on a finite discrete state-space has a unique stationary probability measure.

Theorem 5.6. (Meyn and Tweedie, 2009; Robert and Casella, 2005) If a Markov chain is recurrent then there exists an invariantσ -finite measure which is unique up to a multiplicative factor.

An alternative constraint can also be imposed on the transition kernel of a chain to ensure the existence of a stationary probability measure. More specifically, detailed balance condition, formally defined below, is a sufficient but not necessary condition for the existence of a unique stationary probability measure of a Markov chain. Moreover, when designing chains the condition is often easier to impose than the recurrence or irreducibility.

Definition 5.3. A Markov chain with transition kernelT (· → ·) satisfies the detailed balance condition if there exists a functionπ satisfying

T (x→ x0)π (x) = T (x0→ x)π (x0) . (5.13)

The following theorem provides a guarantee that a unique stationary probability density function corresponds to a Markov chain satisfying the detailed balance condition.

Theorem 5.7. (Robert and Casella, 2005, Theorem 6.46) Suppose that the transition kernel of a Markov chain satisfies the detailed balance condition with a probability density functionπ.

Then, the density functionπ is the invariant density of the chain.

Having presented the standard constraints imposed on a chain for the existence of a unique stationary probability measure, we now review the convergence properties of discrete state-space Markov chains. For that, we need to introduce another stability property of the chain ensuring that the chain does not get trapped in cycles as a result of the constraints imposed by the transition kernel. The period of a state x ∈ X is defined as

d (x) = gcd ({m ≥ 1 | P (xm= x | x0= x) > 0}) ,

where gcd (S) denotes the greatest common divisor of a set of positive integers S. A Markov chain is aperiodic if d (x) = 1 for all x ∈ X. As we will demonstrate shortly, this is an important stability property for the convergence of the chain.

Let us denote the probability of being at time t in the state x with p(x_t= x) and the corresponding probability distribution over the state-space with a row-vector pt. A transition

kernel defined on a discrete state-space can be represented with a non-negative matrix T such that T_ij= Tx_i→ xj

for 1 ≤ i,j ≤ |X|. Then, for n ≥ 1 the chain evolves as

pn= pn−1T = p0Tn. (5.14)

Now, if the transition matrix T is irreducible (i.e., the corresponding Markov chain is aperiodic) then the Perron–Frobenius theorem (Frobenius, 1912) guarantees the existence of the limit of the matrix power, i.e., lim_n→∞Tn_<_{∞. The latter is an important condition for the ergodicity}

property of the chain, formally defined as follows.

Definition 5.4. Let_{x_t_}_t∈Nbe a Markov chain on a discrete state-spaceX and let π be the corresponding stationary distribution. The chain is uniformly ergodic if

lim

t→∞sup_x∈XkP (xt | x0= x) − πkT V = 0 ,

wherek·k_{T V} is the total variation norm.

Definition 5.4 introduces a strong notion of convergence for Markov chains. In particular, the uniform ergodicity property implies that the chain is independent of the initial conditions and that a sample from the chain, xt, is asymptotically distributed according to the corre-

sponding stationary distribution. The following theorem formally specifies conditions for the uniform ergodicity of a Markov chain defined on a discrete state-space.

Theorem 5.8. (Meyn and Tweedie, 2009; Robert and Casella, 2005) For any starting pointx0∈ X ,

the Markov chain with a transition kernel defined on a discrete state-spaceX is uniformly ergodic if the transition kernel is irreducible and aperiodic.

In document Constructive Approximation and Learning by Greedy Algorithms (Page 153-156)