15. MARKOVCHAINS: DEFINITIONS AND EXAMPLES
Along with the discussion of martingales, we have introduced the concept of a discrete-time stochastic process. In this chapter we will study a particular class of such stochastic processes called Markov chains. Informally, a Markov chain is a discrete-time stochastic process for which, given the present state, the future and the past are independent. The formal definition is as follows:
Definition 15.1 Consider a probability space(W,F,P)with a filtration{Fn}n 0and a standard
Borel space(S,B(S)). Let(Xn)n 0be anS-valued stochastic process adapted to{Fn}. We call
this process a Markov chainwith state spaceSif for everyB2B(S)and everyn 0,
P(Xn+12B|Fn) =P(Xn+12B|Xn), P-a.s. (15.1) Here and henceforth, P(·|Xn) abbreviates P(·|s(Xn)). (The conditional distributions exist by Theorem 7.1.)
The relation (15.1) expresses the fact that, in order to know the future, we only need to know the present state. In practical situations, in order to prove that (15.1) is true, we typically calculate P(Xn+12B|Fn)explicitly and show that it depends only onXn. Since s(Xn)⇢Fn, eq. (15.1) then follows by the “smaller always wins” principle for conditional expectations.
If we turn the above definition around, we can view (15.1) as a prescription thatdefines the Markov chain. Indeed, suppose we know the distribution ofX0. Then (15.1) allows us to calculate the joint distribution of(X0,X1). Similarly, if we know the distribution of(X0, . . . ,Xn)the above
lets us calculate(X0, . . . ,Xn+1). The object on the right-hand side of (15.1) then falls into the
following category:
Definition 15.2 A functionp:S⇥B(S)![0,1]is called a transition probabilityif
(1) B7!p(x,B)is a probability measure onB(S)for allx2S.
(2) x7!p(x,B)isB(S)-measurable for allB2B(S).
Our first item of concern is the existence of a Markov chain with given transition probabilities and initial distribution:
Theorem 15.3 Let(pn)n 0be a sequence of transition probabilities and letµ be a probability
measure onB(S). Then there exists a unique probability measure Pµ on the measurable space (SN0,B(SN0)), whereN0=N[{0}, such thatw7!Xn(w) =wnis a Markov chain with respect
to the filtrationFn=s(X0, . . . ,Xn)and such that for all B2B(S),
Pµ(X02B) =µ(B) (15.2)
and
Pµ(Xn+12B|Xn) =pn(Xn,B), Pµ-a.s. (15.3)
for all n 0. In other words, (Xn) defined by the coordinate maps on (SN0,B(SN0),Pµ) is a
Markov chain with transition probabilities(pn)and initial distributionµ.
Proof. This result is a direct consequence of the Kolmogorov Extension Theorem. Indeed, recall thatF is the leastB(SN0)-algebra containing all setsA0⇥···⇥An⇥S⇥. . . whereAj2B(S).
We will define a consistent family of probability measures on the measurable space(Sn,B(Sn)) by putting P(n)µ (A0⇥···⇥An):=Z A0 µ(dx0)Z A1 p0(x0,dx1). . . Z An pn 1(xn 1,dxn), (15.4) and extending this toB(Sn). It is easy to see that these measures are consistent because ifA2 B(Sn)then
P(n+1)µ (A⇥S) =P(n)µ (A). (15.5)
By Kolmogorov’s Extension Theorem, there exists a unique probability measurePµon the infinite product space (SN0,B(SN0))such thatPµ(A⇥S⇥. . .) =Pµ
(n)(A)for everyn 0 and every set A2B(Sn).
It remains to show that the coordinate maps Xn(w):=wn define a Markov chain such that (15.2–15.3) hold true. The proof of (15.2) is easy:
Pµ(X02B) =Pµ(B⇥S⇥. . .) =P(0)µ (B) =µ(B). (15.6)
In order to prove (15.3), we claim that for allB2B(S)and allA2Fn,
E(1{Xn+12B}1A) =E pn(Xn,B)1A . (15.7)
To this end we first note that, by interpreting both sides as probability measures onB(SN0), it
suffices to prove this just forA2Fnof the formA=A0⇥···⇥An⇥S⇥. . .. But for suchAwe have
E(1{Xn+12B}1A) =P
µ
(n+1)(A0⇥···⇥An⇥B) (15.8) which by inspection of (15.4) — indeed, it suffices to integrate the last coordinate — simply equals E(1Apn(Xn,B)) as desired. — Once (15.7) is proved, it remains to note that pn(Xn,B) is Fn-measurable. Hence pn(Xn,B) is a version of Pµ(Xn+12B|Fn). But s(Xn)⇢Fn and since pn(Xn,B) is s(Xn)-measurable, the “smaller always wins” principle implies that (15.1)
holds. Thus(Xn)is a Markov chain satisfying (15.2–15.3) as we desired to prove. ⇤
While the existence argument was carried out in a “continuum” setup, the remainder of this chapter will be specilized to the case when the Xn’s can take only a countable set of values with positive probability. Denoting the (countable) state space by S, the transition probabili-tiespn(x,dy)will become functions pn:S⇥S![0,1]with the property
Â
y2S
pn(x,y) =1 (15.9)
for allx2S. (Clearly,pn(x,y)is an abbreviation ofpn(x,{y}).) We will call such pnastochastic matrix. Similarly, the initial distributionµwill become a functionµ:S![0,1]with the property
Â
x2S
µ(x) =1. (15.10)
(Again,µ(x)abbreviatesµ({x}).)
The subindexnon the transition matrixpn(x,y)reflects the possibility that a different
transi-tion matrix is used at each step. This would correspond totime-inhomogeneousMarkov chain. While this is sometimes a useful generalization, an overwhelming majority Markov chains that are ever considered aretime-homogeneous. In light of Theorem 15.3, such Markov chain are then
determined by two objects: the initial distributionµ and the transition matrix p(x,y), satisfying
(15.10) and (15.9) respectively. For the rest of this chapter we will focus on time-homogeneous Markov chains.
Let us make a remark on general notation. As used before, the objectPµ denotes the Markov chain with initial distributionµ. Ifµ is the point mass at statex2S, then we denote the resulting
measure by Px. We now proceed by a list of examples of countable-state time-homogeneous Markov chains.
Example 15.4(SRW starting atx) LetS=Zdand, usingx⇠yto denote thatxandyare nearest
neighbors onZd, let
p(x,y) =
(1
2d, ifx⇠y,
0, otherwise. (15.11)
Consider the measure Px generated from the initial distribution µ(y) =dx(y) using the above transition matrix. As is easy to check, the resulting Markov chain is the simple random walk started atx.
Example 15.5(Galton-Watson branching process) Consider i.i.d. random variables(xn)n 1
tak-ing values inN[{0}and define the stochastic matrix p(n,m)by
p(n,m) =P(x1+···+xn=m), n,m 0. (15.12)
(Here we use the interpretation p(0,m) =0 unless m=0.) LetS=N[{0}and let P1 be the
corresponding Markov chain started atx=1. As is easy to verify, the result is exactly the Galton-Watson branching process discussed in the context of martingales.
Example 15.6(Ehrenfest chain) In his study of convergence to equilibrium in thermodynamics,
Ehrenfest introduced the following simple model: Consider two boxes with altogethermlabeled balls. At each time step we pick one ball at random and move it to the other box.
To formulate the problem more precisely, we will only keep track of the number of balls in one of the boxes. Thus, the set of possible values — i.e., the state space — is simplyS={0,1, . . . ,m}.
To calculate the transition probabilities, we note that the number of balls always changes only by one. The resulting probabilities are thus
p(n,n 1) = n
m and p(n,n+1) = m n
m . (15.13)
Note that this automatically gives zero probability to the situation when there is no balls left or where all balls are in one box. The initial distribution can be whatever is of interest.
Example 15.7(Birth-death chain) A generalization of Ehrenfest chain is the situation where we
think of a population evolving in discrete time steps. At each time only one of three things can happen: Either an individual is born or dies or nothing happens. The state space of such chain will beS=N[{0}. The transition probabilities will then be determined by three sequences(ax),(bx) and(gx)via
p(x,x+1) =ax, p(x,x 1) =bx, p(x,x) =gx. (15.14)
Clearly, we need thatax+bx+gx=1 for allz2Sand, since the number of individuals is always non-negative, we also require thatb0=0.
Example 15.8(Stochastic matrix) An abstract example of a Markov chain arises whenever we are given a square stochastic matrix. For instance,
p= 31/4 1/4
/5 4/5
!
(15.15) defines a Markov chain on S ={0,1} with the matrix elements corresponding to the values ofp(x,y).
Example 15.9(Random walk on a graph) Consider a graphG= (V,E), whereVis the countable
set — thevertex set— andE⇢V⇥V is a binary relation — theedge setofG. We will assume thatGisunoriented, i.e.,Eis symmetric, and that there are no self-loops, i.e.,E is antireflexive. Letd(x)denote thedegreeof vertexx which is simply the number ofy2V with(x,y)2E —
i.e., the number of neighbors of x. Suppose that there are no isolated vertices, which means thatd(x)>0 for allx.
We define a random walk onV as a Markov chain onS=V with transition probabilities p(x,y) = axy
d(x), (15.16)
where(axy)x,y2V is theadjacency matrix, axy=
(
1, if(x,y)2E,
0, otherwise. (15.17)
Since Ây2Vaxy=d(x) we see p(x,y)is a stochastic matrix. (Here we assumed without saying thatGis locally finite, i.e.,d(x)<•for allx2V.)
Note that this extends the “usual” simple random walk onZd— which we defined in terms of sums of i.i.d. random variables — to any locally finite graph. The initial distribution is typically concentrated at one point; namely, the starting point of the walk, see Example 15.4.
Having amassed a bunch of examples, we begin investigating some general properties of countable-state time-homogeneous Markov chains. (To keep the statements of theorems and lemmas concise, we will not state these as our assumptions any more.)
16. STATIONARY AND REVERSIBLE MEASURES
The first question we will try to explore is that of stationarity. To motivate the forthcoming definitions, consider a Markov chain (Xn)n 0 with transition probability p and X0 distributed according to µ0. As is easy to check, the law ofX1 is then described by measure µ1 which is computed by
µ1(y) =
Â
x2Sµ0(x)p(x,y). (16.1)
We are interested in the situations when the distribution ofX1is the same as that ofX0. This leads us to this, somewhat more general, definition:
Definition 16.1 A (positive) measurenon the state spaceSis called stationaryif
n(y) =
Â
x2Sn(x)p(x,y), y2S. (16.2)
Ifnhas total mass one we call itstationary distribution.
Remark 16.2 While we allow ourselves to consider measures µ onSthat are not normalized,
we will always assume that the measure assigns finite mass to every element ofS.
Clearly, once the laws ofX0andX1are the same, then allXn’s have the same law (provided the chain is time-homogeneous). Let us find stationary measures for the Ehrenfest and birth-death chains:
Lemma 16.3 Consider the Ehrenfest chain with state space S={0,1, . . . ,m}and let the
tran-sition matrix be as in(15.13). Then n(k) =✓m
k ◆
2 m, k=0, . . . ,m, (16.3)
is a stationary distribution.
Proof. We have to show thatnsatisfies (16.2). First we note that
Â
k2S n(k)p(k,l) =n(l 1)p(l 1,l) +n(l+1)p(l+1,l). (16.4) Then we calculate rhs of (16.4)=2 mh✓ m l 1 ◆ m l+1 m + ✓ m l+1 ◆ l+1 m i =2 m ✓ m l ◆h l m l+1 m l+1 m + m l l+1 l+1 m i . (16.5)The proof is finished by noting that, after a cancellation, the bracket is simply one. ⇤
Lemma 16.4 Consider the birth-death chain onN[{0}characterized by sequences(an),(bn)
and(gn), cf (15.14). Suppose thatbn>0for all n 1. Thenn(0) =1and
n(n) =
’
n k=1ak 1
bk , n 1, (16.6)
defines a stationary measure of the chain.
In order to prove this lemma, we will introduce the following interesting concept:
Definition 16.5 A measurenof a Markov chain onSis called reversibleif
n(x)p(x,y) =n(y)p(y,x) (16.7)
holds for allx,y2S.
As we will show later, reversibility owes its name to the fact that, if we run the chain backwards (whatever this means for now) starting fromn, we would get the same Markov chain. A simple consequence of reversibility is stationarity:
Lemma 16.6 A reversible measure is automatically stationary.
Proof. We just have to sum both sides of (16.7) overy. Since p(x,y)is stochastic, the left hand
side producesn(x)while the right-hand side givesÂy2Sn(y)p(y,x). ⇤ Equipped with these observations, the proof of Lemma 16.4 is a piece of cake:
Proof of Lemma 16.4.We claim thatnin (16.6) is reversible. To prove this we observe that (16.6) implies for allk 0
n(k+1) =n(k)bk+1ak =n(k)p(kp(k,+k+1)
1,k). (16.8)
This shows that (16.7) holds forxandydiffering by one. Forx=y(16.7) holds trivially and in the remaining cases p(x,y) =0, so (16.7) is proved in general. Hencen is reversible and thus
stationary. ⇤
Remark 16.7 Recall that stationary measures need not be finite and thus the existence of a
stationary measure does not imply the existence of a stationary distribution. (The distinction is really whether finite or infinite because a finite stationary measure can always be normalized.) For the Ehrenfest chain we immediately produced a stationary distribution. However, for the birth-death chain the question whethernis finite or infinite depends sensitively on the asymptotic properties of the ratioak+1/bk. We will return to this later in this chapter.
Next we will address the underlying meaning of the reversible measure. We will do this by showing that reversing the “flow of time” we obtain another Markov chain, which in the reversible situation will be the same as the original chain.
Theorem 16.8(Reversed chain) Consider a Markov chain (Xn)n 0 started from a stationary
initial distributionµ and transition matrix p(x,y). Fix N large and let
Then(Yn)N
n=0is a (time-homogeneous) Markov chain — called thereversed chain— with initial distributionµ and transition matrix q(x,y)defined by
q(x,y) =µ(y)
µ(x)p(y,x), µ(x)>0, (16.10) (The values q(x,y) for x such that µ(x) =0 are immaterial since such x will never be visited
starting from the initial distributionµ.)
Proof. Fix a collection of values y1, . . . ,yN 2S and consider the probability P(Yn =yn,n=
0, . . . ,N), whereYnare defined fromXnas in (16.9). Since(Xn)is a Markov chain, we have
P(Yn=yn,n=0, . . . ,N) =µ(yN)p(yN,yN 1). . .p(y1,y0). (16.11) A simple argument now shows that either this probability vanishes or µ(yk)>0 for all k=
0, . . . ,N. Focussing on the latter cases we can rewrite this as follows:
P(Yn=yn,n=0, . . . ,N) = µ(yN)
µ(yN 1)p(yN,yN 1). . . µ(y1)
µ(y0)p(y1,y0)µ(y0) =µ(y0)q(y0,y1). . .q(yN 1,yN).
(16.12)
Hence(Yn)is a Markov chain with initial distributionµ and transition matrixq(x,y). ⇤
Clearly, ifµis a reversible measure thenq(x,y) =p(x,y), i.e., the dual chain(Yn)has the same
law as(Xn). This allows us to extend any stationary Markov chain to negative infinity — into a two-sided chain(Xn)n2Z. As an additional exercise we will apply these concepts to the random
walk on a locally finite graph.
Lemma 16.9 Consider a locally finite unoriented graph G= (V,E) and let d(x) denote the
degree of vertex x. Suppose that there are no isolated vertices, i.e., d(x)>0for every x. Then the measure
n(x) =d(x), x2V, (16.13) is reversible and hence stationary for the random walk on G.
Proof. We havep(x,y) =axy/d(x), whereaxyis the adjacency matrix. SinceGis unoriented, the
adjacency matrix is symmetric. This allows us to calculate n(x)p(x,y) =d(x) axy
d(x)=axy=ayx=d(y) ayx
d(y) =n(y)p(y,x). (16.14)
Thusnis reversible and hence stationary. ⇤
Clearly,nis finite if and only ifE is finite which by the fact that no vertex is isolated implies thatV is finite. However, the stationary measure may not be unique. Indeed, ifGhas two separate components, even the restriction ofnto one of the component would be stationary. We proceed by analyzing the question of uniqueness (and existence) of stationary measures.
17. EXISTENCE/UNIQUENESS OF STATIONARY MEASURES
As alluded to in the example of the random walk on a general graph, a simple obstruction to uniqueness is when there are parts of the state space S for which the transition from one to another happens with zero probability. We introduce a proper name for this situation:
Definition 17.1 We call the transition matrix p (or the Markov chain itself) irreducible if for
eachx,y2Sthere is a numbern=n(x,y)such that
pn(x,y) =
Â
y1,...,yn 12Sp(x,y1)p(y1,y2). . .p(yn 1,y)>0. (17.1) The object pn(x,y) — not to be confused with pn(x,y) — is the (x,y)-th entry of the n-th power of the transition matrix p. As is easy to check, pn(x,y)simply equal the probability that the Markov chain started atxis atyat timen, i.e.,Px(Xn=y) =pn(x,y).
Irreducibility can be characterized in terms of a stopping time.
Lemma 17.2 Consider a Markov chain(Xn)n 0 on S and let Ty=inf{n 1: Xn=y}be the
first time the chain visits y (note that we do not count the initial state X0in this definition). Then the chain is irreducible if and only if for all x,y,
Px(Ty<•)>0. (17.2)
Proof. This is a trivial consequence of the fact thatPx(Xn=y) =pn(x,y)and thatPx(Tx<•) =
Ân 1pn(x,y). ⇤
However, irreducibility alone is not sufficient to guarantee the existence and uniqueness of a stationary measure. The principal concept here isrecurrence, which we have already encountered in the context of random walks:
Definition 17.3 A statex2Sis called recurrentifPx(T
x<•) =1. A Markov chain is recurrent
if every state is recurrent.
Theorem 17.4(Existence of stationary measure) Consider an irreducible Markov chain(Xn)n 0
and let x2S be a recurrent state. Then nx(y) =Ex ✓ Tx
Â
n=11{Xn=y} ◆ =Â
n 1 Px(Xn=y,Tx n) (17.3)is finite for all y2S and defines a stationary measure on S. Moreover, any other stationary measure is a multiple ofnx.
The crux of the proof and the principal reason why we need recurrence is because of the following observation: IfTx<•almost surely, we can also writenxas follows:
nx(y) =Ex ✓Tx 1
Â
n=01{Xn=y} ◆ =Â
n 1 Px(Xn 1=y,Tx n). (17.4) The first equality comes from the fact that ify6=xthenXn6=yforn=0 andn=Txanyway, while ify=xthen the sum in the first expectation in (17.3) and (17.4) equals one in both cases. The second equality in (17.4) follows by a convenient relabeling.Proof of existence.Letx2Sbe a recurrent state. Then Px(Xn=y,Tx n) =
Â
z2S Px(Xn 1=z,Xn=y,Tx n) =Â
z2SEx P x(Xn=y|Fn 1)1 {Xn 1=z}1{Tx n} =Â
z2S p(z,y)Px(Xn 1=z,Tx n), (17.5)where we used that{Xn 1=z}and{Tx n}are bothFn 1-measurable to derive the second line. The third line is a consequence of the fact that on{Xn 1=z}we have thatPx(Xn=y|Fn 1) = Px(Xn=y|Xn 1) =p(z,y).
Summing the above overn 1, applying discrete Fubini (everything is positive) and invoking (17.3) on the left and (17.4) on the right-hand side gives usnx(y) =Âz2Snx(z)p(z,y). It remains to show thatnx(y)<•for ally2S. First note thatnx(x) =1 by definition. Next we note that we actually have
nx(x) =
Â
z2Snx(z)pn(z,x) nx(y)pn(y,x) (17.6) for alln 1 and ally2S. Thus,nx(y)<•whenever pn(y,x)>0. By irreducibility this will
happen for some n for every y2S and so nx(y)<• for ally 2S. Hence nx is a stationary
measure. ⇤
The proof of uniqueness provides some motivation for hownxwas constructed:
Proof of uniqueness.Supposexis a recurrent state and letnxbe the stationary measure in (17.3). Letµ be another stationary measure (we require thatµ(y)<•for ally2Seven though, as we will see, it is enough to assume thatµ(x)<•). Stationarity ofµ can also be written as
µ(y) =µ(x)p(x,y) +
Â
z6=x
µ(z)p(z,y). (17.7)
Plugging this forµ(z)in the second term and iterating gives us µ(y) =µ(x)hp(x,y) +
Â
z6=x p(x,z)p(z,y) +···Â
z1,...,zn6=x p(x,z1). . .p(zn,y)i +Â
z0,...,zn6=x µ(z0)p(z0,z1). . .p(zn,y). (17.8)We would like to pass to the limit and conclude that the last term tends to zero. However, a direct proof of this appears unlikely and so we proceed by using inequalities. Noting that thek-th term in the bracket equalsPx(Xk=y,Tx k), we have
µ(y) µ(x)n+1
Â
k=1Px(Xk=y,Tx k) !
n!• µ(x)nx(y). (17.9)
In particular, we haveµ(y) µ(x)nx(y)for ally2S.
Our goal is to show that equality holds. Suppose that for somex,ywe haveµ(y)>µ(x)nx(y).
By irreducibility, there existsn 1 such thatpn(y,x)>0 and so µ(x) =
Â
z2S
µ(z)pn(z,x)>µ(x)
Â
z2Sa contradiction. Soµ(y) =µ(x)nx(y), i.e.,µ is a rescaled version ofnx. ⇤
We finish by noting that there are criteria for existence of stationary measures for transient chains (proved by Harris (Proceedings AMS 1957) and Veech (Proceedings AMS1963). In par-ticular, are irreducible transient chains without stationary measures and also those with stationary measures. (For the latter, note that the random walk on any locally-finite graph has a reversible, and thus stationary, measure.)
18. STRONGMARKOV PROPERTY AND DENSITY OF RETURNS
In the proof of existence and uniqueness of the stationary measure we have barely touched upon the recurrence property. Before we delve deeper into that subject let us state and prove an inter-esting consequence of the definition of Markov chain.
We will suppose that our Markov chain is defined on itscanonicalmeasurable space(W,F),
where W:=SN0 andF is the products-algebra. The Xn’s are represented by the coordinate
mapsXn(w):=wn. This setting permits the consideration of theshiftoperatorq which acts on sequenceswby
(qw)n=wn+1. (18.1)
For any n 1 we define qn to be the n-fold composition of q, i.e.,(qnw)
k =wk+n. If N is a stopping time of the filtration Fn=s(X0, . . . ,Xn), then qN is defined to beqn on {N=n}. On{N=•}we leaveqNundefined.
Theorem 18.1(Strong Markov property) Consider a Markov chain(Xn)n 0with initial
distri-bution µ and let N be a stopping time of{Fn}. Suppose that Pµ(N<•)>0and letqN be as
defined above. Then for all B2F,
Pµ(1B qN|FN) =PXN(B) Pµ-a.s. on{N<•}. (18.2) HereFN={A2F:A\{N=n}2Fn,n 0}and XNis defined to be Xnon{N=n}.
This property is called “strong” because it is a strengthening of theMarkov property
Pµ(1B qn|Fn) =PXn(B) (18.3) to random n. In our case the proof of the Markov property and the strong Markov property amount more or less to the same. In particular, no additional assumptions are needed. This is not true for continuous-time where strong Markov property may fail in the absence of (rather natural) continuity conditions.
Proof of Theorem 18.1.LetA2FNbe such thatA⇢{N<•}. First we will partition according to the values ofN:
Eµ 1A(1B qN) =
Â
n 0Eµ1A\{N=n}(1B qn) . (18.4) (Note that, by our assumptions aboutA, we do not have to include the valueN=•. Once we are on{N=n}we can replaceqNbyqn.) NowA\{N=n}2Fnwhile1
B qniss(Xn,Xn+1, . . . )-measurable. This allows us to condition onFnand use (15.3):
Eµ 1A\{N=n}(1B qn) =Eµ 1A\{N=n}Eµ(1B qn|Fn)
=Eµ 1A\{N=n}PXn(B) =Eµ 1A\{N=n}PXN(B) . (18.5) Plugging this back to (18.4) and summing overnwe deduce
Eµ 1A(1B qN) =Eµ 1APXN(B) (18.6) for allA2FNwithA⇢{N<•}. SincePXN(B)isFNmeasurable, a standard argument implies
We proceed by listing some applications of the Strong Markov property. Consider the stopping timeTxdefined by
Tx=inf{n 1:Xn=x}. (18.7)
Here we deliberately omit n=0 from the sum, so that even inPx we may haveTx =•almost surely. LetTn
x denote then-th iteration ofTxby which we meanTx qT
n 1
x . One of the principal
conclusions of the strong Markov property is the following observation:
Lemma 18.2 Consider a Markov chain with state space S, let x,y2S and let Tybe as defined
above. Consider arbitrary events A0, . . . ,An 12FTy with Aj⇢{Ty<•}and let An2F. Then
for the Markov chain started at x, the events qTxj(A
j), j=0, . . . ,n, (18.8)
are independent and Px(qTxj(Aj)) =Py(Aj)for all j=1. . . ,n. (Here T0
y =0.)
Proof. Supposen=1. ByA0⇢{Ty<•}, the strong Markov property, and the fact thatXTy=y
on{Ty<•},
Px A0\qTy(A1) =Ex 1A
0PXTy(A1) =Px(A0)Py(A1). (18.9)
The general case follows by induction. ⇤
Another interesting conclusion is:
Corollary 18.3 Consider a Markov chain with state space S and let Tx be as defined above.
Then for all x,y2S,
Px(Tn
y <•) =Px(Ty<•)Py(Ty<•)n 1. (18.10) Proof. LetAj={Ty<•}in Lemma 18.2 and apply{Txn<•}=Tjn 1qT
j
y(Aj). ⇤
As for the random walk this statement allows us to characterize recurrence ofxin terms of the expected number of visits of the chain back tox.
Corollary 18.4 Let N(x):=Ân 11{Xn=x}. Then
Ex N(y) = P
x(Ty<•)
1 Py(Ty<•). (18.11)
Here the right-hand side should be interpreted as zero if the numerator vanishes and as infinity if the numerator is positive and the denominator vanishes.
Proof. This is a consequence of the fact that Ex(N(y)) =Ân 1Px(Tyn <•) and the formula
(18.10). ⇤
Corollary 18.5 A state x is recurrent if and only ifEx(N(x)) =•. In particular, for an
irre-ducible Markov chain either all states are recurrent or none of them are. Finally, an irreirre-ducible finite-state Markov chain is recurrent.
Proof. A statex is recurrent iffPx(Tx <•) =1 which by (18.11) is true iffEx(N(x)) =•. To show the second claim, suppose thatxis recurrent and let us show that so is anyy2S. To that
end, let k and l be numbers such that pk(y,x)>0 and pl(x,y)>0 — these numbers exist by irreducibility. Then pn+k+l(y,y) pk(y,x)pn(x,x)pl(x,y), (18.12) which implies Ey N(y) =
Â
m 1p m(y,y)Â
n 1p k(y,x)pn(x,x)pl(x,y) =pk(y,x)pl(x,y)Ex N(x) . (18.13)But pk(y,x)pl(x,y)>0 and so Ex(N(x)) =•implies Ey(N(y)) =•. Hence, all states of an irreducible Markov chain are recurrent if one of them are. Finally, ifSis finite, the trivial rela-tionÂx2SN(x) =•impliesEz(N(x)) =•for allx,z2S. Then (18.11) yieldsPx(Tx<•) =1,
i.e., every state is recurrent. ⇤
In the previous section we concluded that, for irreducible Markov chains, recurrence is aclass property, i.e., a property that either holds for all states or none. We have also shown that, once the chain is recurrent (on top of being irreducible), there exists a stationary measure. Next we will give conditions under which the stationary measure has finite mass which means it can be nor-malized to produce a stationary distribution. To that end we introduce the following definitions:
Definition 18.6 A statex2Sof a Markov chain is said to be
• transientifPx(Tx<•)<1.
• null recurrentifPx(Tx<•) =1butExTx=•.
• positive recurrentifExTx<•.
We will justify the terminology later. Our goal is to show that a stationary distribution exists if and only if every state of an (irreducible) chain is positive recurrent. The principal result is formulated as follows:
Theorem 18.7 Consider a Markov chain with state space S. If there exists a stationary
mea-sureµ with 0<µ(S)<•then every x withµ(x)>0is recurrent. If the chain is irreducible,
then
µ(x) = µ(S)
ExTx (18.14)
for all x2S. In particular,ExTx<•for all x, i.e., every state is positive recurrent.
Proof. Letxbe such thatµ(x)>0. Sinceµ is stationary,µ(x) =Âz2Sµ(z)pn(z,x)for alln 1. Therefore •=
Â
n 1 µ(x) = FubiniÂ
z 2S µ(z)Â
n 1 pn(z,x) =Â
z2S µ(z)Ez N(x) . (18.15) But (18.11) impliesEz(N(x))[1 Px(Tx<•)] 1and so• 1 Pµx(T(S)
x<•). (18.16)
In order to prove the second part of the claim, note that irreducibility implies that µ(x)>0
for allx and so all states are recurrent. From Theorem 17.4 we have µ(y) =µ(x)nx(y) which byÂynx(y) =nx(S) =ExTxyields
µ(S) =µ(x)
Â
y2Snx(y) =µ(x)ExTx, (18.17) implying (18.14). Sinceµ(S)<•we must haveExTx<•. ⇤
We summarize the interesting part of this result as a corollary:
Corollary 18.8 For an irreducible Markov chain on a state space S, the following are
equiva-lent:
(1) Some state is positive recurrent.
(2) There exists a stationary measureµ withµ(S)<•.
(3) Every state is positive recurrent.
Proof.(1))(2): Letxbe positive recurrent. Thennxis a stationary measure withnx(S) =ExTx<
•. (2))(3): This is the content of Theorem 18.7. (3))(1): Trivial. ⇤
We finish this section by providing a justification for the terminology of “positive and null recurrent” states/Markov chains (both are class properties):
Theorem 18.9(Density of returns) Consider a Markov chain with state space S. Let Nn(y) =
Ân
m=11{Xm=y}be the number of visits to y before time n. If y is recurrent, then for all x2S,
lim n!• Nn(y) n = 1 EyTy1{Ty<•}, P x-a.s. (18.18)
Proof. Let us first consider the case x=y. Then recurrence implies1{Ty<•}=1 almost surely. Define the sequence of times tn=Tn
y Tyn 1 where t1=Ty. By the Strong Markov Property, (tn)are i.i.d. with the same distribution asTy. In terms of thetn’s, we have
Nn(y) =sup{k 0:t1+···+tkn}, (18.19) i.e.,Nn(y)is a renewal sequence. The Renewal Theorem then gives us
lim n!• Nn(y) n = 1 Eyt1 = 1 EyTy, Py-a.s. (18.20)
Now we will look at the casesx6=y. IfPx(Ty=•) =1 thenNn(y) =0 almost surely for alln and there is nothing to prove. We can thus assume thatPx(Ty<•)>0 and decompose according to the values of Ty. We will use the Markov property which tells us that for any A2F, we havePx(qm(A)|Ty=m) =Py(A). We will apply this to the event
A=nlim n!• Nn(y) n = 1 EyTy o . (18.21)
Indeed, this event occurs almost surely in Py and so we have Px(qm(A)|Ty =m) =1. But onqm(A)\{Ty=m}we have lim n!• Nn+m(y) Nm(y) n = 1 EyTy (18.22)
which implies thatNn(y)/n!1/EyTy. Therefore,Px(A|Ty=m) =1 for allmwithPx(Ty=m)> 0. It follows thatAoccurs almost surely inPx(·|Ty<•). Hence, the limit in (18.18) equals 1/EyTy almost surely on{Ty<•}and zero almost surely on{Ty=•}. This proves the claim. ⇤ The upshot of this theorem is that a state is positive recurrent if it is visited at a positive density of times and null recurrent if it is visited infinitely often, but the density of visits is zero. Notwithstanding, in all cases allN(y)’s grow at roughly the same rate:
Theorem 18.10 (Ratio limit theorem) Consider an irreducible Markov chain and let x be a
recurrent state. Letnxbe the measure from(17.3). Then for all x,y,z2S, lim n!• Nn(y) Nn(z) = nx(y) nx(z), Px-almost surely. (18.23) The proof of this claim is part of the homework assignment for this part.
19. CONVERGENCE TO EQUILIBRIUM
Markov chains are often run on a computer in order to sample from a complicated distribution on a large state space. The idea is to define a Markov chain for which the desired distribution is stationary and then wait long enough for the chain to “equilibrate.” The last aspect of Markov chains we wish to examine is the convergence to equilibrium.
When run on a computer, only one state of the Markov chain is stored at each time — this is why Markov chains are relatively easy to implement — and so we are asking about the conver-gence of the distribution Pµ(Xn2·). Noting that this is captured by the quantities pn(x,y), we will thus study the convergence ofpn(x,·)asn!•.
For irreducible Markov chains, we can generally guarantee convergence is in Cesaro sense:
Lemma 19.1 Consider an irreducible Markov chain on S. Then for all x,y2S, the limit
lim n!• 1 n n
Â
m=1p m(x,y) =µ(y), (19.1)exists and defines a stationary measure (possiblyµ⌘0).
Proof. LetNn(y) =Â1kn1{Xk=y}and note that
n
Â
m=1p
m(x,y) =E
x Nn(y) . (19.2)
By Theorem 18.9 we know that Px-almost surely, Nn(y)/n converges to the reciprocal value ofEyTyin recurrent cases. In transient cases this ratio converges to zero by direct observation; the a.s. limit thus always exist and vanishes unless the chain is positive recurrent. SinceNn(y)/n1,
the Bounded Convergence Theorem applies the same is true even under expectation.
Suppose now that the chain is positive recurrent, fixxand letjn(x,y) =1nEx(Nn(y)). Then the
Markov property tells us
jn+1(x,y) = p(x,y)
n+1 + n
n+1z
Â
2Sjn(x,z)p(z,y). (19.3) By our previous reasoning,jn(x,y)!j(y):= (EyTy) 1asn!•. Reducing the sum overzto afinite subset ofS, passing to the limit, and then increasing the size of this subset to comprise all ofSwe derive the inequality
j(y)
Â
z2Sj(z)p(z,y), y2S. (19.4)
But summing overy2Swe get equal terms on both sides so if this inequality were strict for at
least oney, we would have a contradiction. Hencejis a stationary measure. ⇤
Remark 19.2 Note that the previous proof yields the following surprising, and quite unintuitive,
identity:
1
ExTx =y
Â
2S 1EyTyp(y,x), (19.5)
Unfortunately, the pn(x,y) themselves may not converge. For instance, if we have a chain that “hops” between two states, pn(x,y) will oscillate between zero and one asn changes. The obstruction is clearly related to periodicity — if there was a slightest probability to not “hop,” the chain would soon get out of sync and the equilibrium would be reached.
In order to classify the periodic situation, let Ix ={n 1: pn(x,x) >0}. By standard ar-guments,Ix is an additive semigroup (a set closed under addition). This allows us to define a numberdx as the largest integer that divides alln2Ix. (Since 1 divides all integers, such number
indeed exists.) We calldxthe period ofx.
Lemma 19.3 If the Markov chain is irreducible, then dxis the same for all x.
Proof. See (5.3) Lemma on page 309 of the textbook. ⇤
Definition 19.4 A Markov chain is called aperiodicifdx=1for allx.
Lemma 19.5 An aperiodic Markov chain with state space S satisfies the following: For all x,y2
S there exists n0=n0(x,y)such that pn(x,y)>0for all n n0.
Proof.First we note that, thanks to irreducibility, it suffices to prove this forx=y. Indeed, ifk 1 is such thatpk(x,y)>0, thenpn+k(x,y) pn(x,x)pk(x,y)and so ifpn(x,x)>0 forn n0(x,x), then pn(x,y)>0 forn k+n0(x,x).
Let us setx=y. By aperiodicity, there exists ak 1 such thatpk(x,x)>0 andpk+1(x,x)>0. Since p2k+l+m(x,x) pk+l(x,x)pk+m(x,x), we thus have p2k(x,x),p2k+1(x,x),p2k+2(x,x)>0. Proceeding similarly, we conclude that
pk2+j(x,x)>0, j=0, . . . ,k 1 (19.6)
But that implies pk2+mk+j
(x,x) pk2+j(x,x)⇥pk(x,x)⇤m>0, m 0, j=0, . . . ,k 1. (19.7)
Every integern k2can be written in the formn=k2+mk+jfor suchmand jand so the result
holds withn0(x,x)k2. ⇤
Our goals it to prove the following result:
Theorem 19.6 Consider an irreducible, aperiodic Markov chain on state space S. Suppose
there exists a stationary distributionp. Then for all x2S,
pn(x,y) n!
!•p(y), y2S. (19.8)
The proof of this theorem will be based on a general technique, calledcoupling. The idea is as follows: We will run one Markov chain started atxand the other started at azwhich was itself chosen at random from distributionp. As long as the chains stay away from each other, we keep generating them independently. The first moment they collide, we glue them and from that time on move both of them synchronously.
The upshot is that, if we observe only the chain started atx, we see a chain started atxwhile if we observe the chain started atz, we observe only a chain started atz. But the latter was started
from stationary distribution and so it will be stationary at each time. It follows that, provided the chains glued, also the one started fromxwill eventually be stationary.
To make this precise, we will have to define both chains on the same probability space. We will generalize the initial distributions to any two measuresµ andnonS. Let us therefore consider a Markov chain onS⇥Swith transition probabilities
p (x1,x2),(y1,y2) = 8 > < > : p(x1,y1)p(x2,y2) ifx16=x2, p(x1,y1), ifx1=x2andy1=y2, 0, otherwise, (19.9)
and initial distributionµ⌦n. We will usePµ⌦n to denote the corresponding probability measure
— called thecoupling measure— and(Xn(1),Xn(2))to denote thecoupled process. First we will verify that each of the marginals is the original Markov chain:
Lemma 19.7 Let(Xn(1),Xn(2))denote the coupled process in measure Pµ⌦n. Then(Xn(1))is the original Markov chain on S with initial distributionµ, while(Xn(2))is the original Markov chain on S with initial distributionn.
Proof. Let A={Xk(1) =xk,k=0, . . . ,n}. Abusing the notation slightly, we want to show that Pµ⌦n(A) =Pµ(A). Since A fixes only the X(1)
k ’s, we can calculate the probability of A by summing over the possible values ofXk(2):
Pµ⌦n(A) =
Â
(yk)
µ(x0)n(y0)n
’
1 k=0p(xk,yk),(xk+1,yk+1) . (19.10)
Next we note that
Â
y02S p (x,y),(x0,y0) = ( Ây02Sp(x,x0)p(y,y0), ifx6=y, p(x,x0), ifx=y. (19.11)In both cases the sum equals p(x,x0)which we note is independent ofy. Therefore, the sums in
(19.10) can be performed one by one with the result Pµ⌦n(A) =µ(x0)n
’
1k=0
p(xk,xk+1), (19.12)
which is exactlyPµ(A). The second marginal is handled analogously. ⇤
Our next item of interest is the time when the chains first collide:
Lemma 19.8 Let T =inf{n 0:Xn(1)=Xn(2)}. Under the conditions of Theorem 19.6,
Pµ⌦n(T <•) =1 (19.13)
for any pair of initial distributionsµ andn.
Proof. We will consider anuncoupledchain onS⇥Swhere both original Markov chains move independently forever. This chain has the transition probability
As a moment’s though reveals, the timeThas the same distribution in both coupled and uncoupled chains. Therefore, we just need to prove the lemma for the uncoupled chain.
First, let us note that the uncoupled chain is irreducible (this is where aperiodicity is needed). Indeed, by Lemma 19.5 aperiodicity implies that pn(x1,y1)>0 and pn(x2,y2)>0 forn suffi-ciently large and so we also haveqn((x1,y1),(x2,y2))>0 fornsufficiently large. Second, we observe that the uncoupled chain is recurrent. Indeed, ˆp(x,y) =p(x)p(y)is a stationary
distribu-tion and, using irreducibility, every state of the chain is thus recurrent. But then, for anyx2S,
the first hitting time of(x,x)is finite almost surely which implies the same forT, which is the
first hitting time of the diagonal inS⇥S. ⇤
The principal idea behind coupling now reduces to the following lemma:
Lemma 19.9 (Coupling inequality) Consider the coupled Markov chain with initial
distribu-tionµ⌦nand let T=inf{n 0: Xn(1)=Xn(2)}. Letµn(·) =Pµ⌦n(Xn(1)2·)andnn(·) =Pµ⌦n(Xn(2)2
·)be the marginals at time n. Then
kµn nnk Pµ⌦n(T >n), (19.15)
wherekµn nnk=supA✓S|µn(A) nn(A)|is the variational distance ofµnandnn. Proof. LetS+={x2S:µn(x)>nn(x)}. The proof is based on the fact that
kµn nnk=µn(S+) nn(S+). (19.16)
This makes it reasonable to evaluate the difference
µn(S+) nn(S+) =Pµ⌦n(Xn(1)2S+) Pµ⌦n(Xn(2)2S+) =Eµ⌦n 1{X(1) n 2S+} 1{Xn(2)2S+} =Eµ⌦n ⇣ 1{T>n} 1{Xn(1)2S+} 1{Xn(2)2S+} ⌘ . (19.17)
Here we have noted that if T nthen either both {Xn(1)2S+} and{Xn(2)2S+}occur or both don’t. Estimating the difference of the two indicators by one, we thus get µn(S+) nn(S+) Pµ⌦n(T >n). Plugging this into (19.16), the desired estimate follows. ⇤
Now we are ready to prove the convergence to equilibrium:
Proof of Theorem 19.6. Consider two Markov chains, one started fromµ and the other fromn. By Lemmas 19.7 and 19.9, the variational distance between the distributionsµnandnnofXnin these two chains is bounded byPµ⌦n(T >n). But Lemma 19.8 implies thatPµ⌦n(T >n)tends
to zero asn!•andkµn nnk !0.
To get (19.8) we now letµ=dxandn=p. Thenµn(·) =pn(x,·)whilenn=pfor alln. Hence
we havekpn(x,·) pk !0 which means thatpn(x,·)!p in the variational norm. This implies
(19.8). ⇤
The method of proof is quite general and can be adapted to other circumstances. See Lindvall’s book “Lectures on the coupling method.” We observe that Lemmas 19.9 and 19.7 allow us to estimate the time it takes for the two marginals to get closer than prescribed. On the basis of the
proof of Lemma 19.7, the coupling time can be studied in terms of the uncoupled process, which is slightly easier to handle.
The above technique also provides some estimate how fast the chain converges to its equilib-rium. Indeed, the coupling time provides a bound on themixing time
tmix(e) =sup
x2Sinf n 0:kp
n(x,·) nk e . (19.18) Much of the research devoted to Markov chain these days is devoted to deriving sharp bounds on the mixing time in various specific examples. Sophisticated techniques based on coupling and spectral analysis have been developed; unfortunately, we don’t have time to cover these in this course. Instead, we will discuss in detail a simple, yet fairly representative, Markov chain.
Example 19.10(Random walk on hypercube) Consider ad-dimensional 2⇥···⇥2 hypercube
which we represent atS={ 1,1}d. The vertices of the hypercube are d-tuples of ±1’s, x= (x1, . . . ,xd). We will consider a random walk on S which, formally, is a Markov chain with transition probabilities p(x,y) = 8 > < > : 1
2d, if(9i)(xi= yi&(8j6=i)xj=yj), 1/2, ifx=y,
0, otherwise.
(19.19)
The reason for the middle line is to make the chain aperiodic (which is a necessary condition for convergence pn(x,·)!n.
To analyze the behavior of this Markov chain, we will consider a natural coupling between two copies of this chain. Explicitly, consider a Markov chain onS⇥Swith the following random “move”: Given a state (x,y) of this chain, pick i=1, . . . ,d at random and, if xi =yi, perform the above single-chain move in both coordinates simultaneously while if xi 6=yi, move them independently. To see how these chains gradually couple, we introduce the Hamming distance onSwhich is defined by
d(x,y) = {i: 1id,xi6=yi} . (19.20)
We also introduce the stopping times
Tj=inf n 0: d(Xn,Yn)d(X0,Y0) j . (19.21)
Lemma 19.11 Consider the about duplicated Markov chain whose component are each started
from a fixed point, letd0=d(X0,Y0) be the Hamming distance of the initial states and lettj= Tj+1 Tj, j =0, . . . ,d0. Then the random times (tj) are independent and tj has geometric
distribution with parameter d0 j
2d .
Proof.The independence is a consequence of the strong Markov property. To find the distribution oftj, it clearly suffices to focus on the case j=0. Now the Hamming distance will decrease by one only if the following occurs: An indexiis chosen among one of the components where the two states differ — this happens with probability d0 j
d — and one of the chain moves while the other does not — which happens with probability1/2. Thus, the Hamming distance will decrease with probabilityd0 j
2d in each step; it thus takes a geometric time with this parameter for the move
Next we recall theCoupon collector problem: A coupon collector collects coupons that arrive one per unit time. There arerdistinct flavors of the coupons; each coupon is sampled indepen-dently at random from the set ofrflavors. The goal is to find when the collector has seen coupons of allrflavors for the first time. IfZn2{1, . . . ,r}marks the flavor of then-th coupon that arrived,
we define
ˆ
Tj(r)=inf n 0:|Z1, . . . ,Zn| j (19.22)
to be the time when the collector has for the first time seen j distinct flavors of coupons. It is easily checked that(Tˆj+1(r) Tˆj(r))are independent with ˆTj+1(r) Tˆj(r)geometric with parameterr jr . The WLLN-type analysis also gives that
ˆ Tr(r)
rlogr n!!•1, in probability. (19.23)
Going back to our Markov chain, the random times (tj)jd0 have the same law as (Tˆ
(2d) d0+j
ˆ
Td(2d)0 )jd0. In particularTd0 has the same law as ˆT
(2d) 2d0 Tˆ (2d) d0 . But ˆT (2d) d0 /dconverges in
proba-bility to a finite constant and so we conclude thatTd0 ⇠2dlog(2d). Since this provides an upper
bound on the mixing time, we have
tmix(e) =O(dlogd). (19.24) This actually turns out to be correct to within a multiplicative constant. Indeed, it is known that fore >0 andd 1, supx2Skpnd(x,·) nksweeps very rapidly from nearly 1 to nearly 0 asn
increases from(1/2 e)dlogd to(1/2+e)dlogd. This feature, discovered by Persi Diaconis and referred to ascutoff phenomenonis one of the most exciting recent developments in the area of Markov chains.