Markov Chain Monte Carlo Algorithms

3.2 Model Fitting

3.2.2 Markov Chain Monte Carlo Algorithms

Markov Chain Monte Carlo (MCMC) procedures are the standard method in the Bayesian community to ﬁt not only mixture models, but models in general. Contrary to the EM algorithm, not only point estimates are obtained for the model parameters, but a sample from the posterior distribution of each of these parameters. Integration across the posterior distribution, e.g. to calculate ex- pectations of quantities of interest, can thus be approximated by summarizing across the obtained sample (this approach is called Monte Carlo integration). A

Algorithm 1 Metropolis-Hastings

If the Markov chain A is in iteration t ≥ 0, carry out the following steps:

1. Draw a candidate B from a proposal distribution q(.| A_t), e.g., a (poten-

tially multivariate) normal distribution.

2. Deﬁning ϕ as the stationary distribution of the Markov chain, accept B with probability α(A_t, B) = min 1, ϕ(B)q(At| B) ϕ(A_t)q(B| A_t) , (3.12)

i.e. draw an UM H ∼ U(0, 1) and set At+1 = B if UM H ≤ α(At, B). Other-

wise set A_t+1 = A_t.

3. Increase the iteration count, i.e. set t = t + 1.

is drawn from the conditional distribution p(A|A_t), i.e. conditional on A_t, A_t+1

is independent of {A₁, . . . , A_t−1}. For any given model parameter, the basic

concept of MCMC methods is to construct a Markov chain with a stationary distribution ϕ that equals the desired posterior distribution

p(θ | y) = p(y| θ)p(θ)

p(y| θ)p(θ) dθ ∝ p(y | θ)p(θ). (3.11)

This may be achieved by means of the Metropolis-Hastings algorithm (Hastings, 1970), presented in Algorithm 1, or one of its many special cases:

In (3.12), ϕ is the stationary distribution of the Markov chain independently

of the proposal distribution q(.|.), and once the Markov chain reaches the sta-

tionary distribution, it will not leave it anymore (see, e.g., Gelman et al., 2013).

A good proposal distribution q(.| A_t) should be easy to sample from for any

A_t, and each accepted jump should go a reasonable distance in the parameter

space, since otherwise the chain would move too slowly. Finally, the acceptance

ratio α(A_t, B) should be easy to calculate, and the acceptance rate should be

in a reasonable range (Gelman et al., 2013). A very low acceptance rate will cause a long running time of the algorithm, while a very high acceptance rate will typically lead to autocorrelation between the values of the chain, running

contrary to the goal of obtaining an independent sample of draws from the posterior distribution. Thus, in general the goal is to achieve an acceptance rate in the range between 0.1 and 0.9, ideally between 0.3 and 0.7. To ensure this, it may be necessary to tune the standard deviation of the proposal distribution. Autocorrelation nevertheless perceived in a Markov chain after running the algorithm can be reduced by keeping and analyzing only every wth draw, discarding the others, in a procedure usually referred to as ‘thinning’.

Note that although the acceptance probability (3.12) contains the unknown posterior distribution ϕ, its calculation is possible since the denominators in (3.11) cancel out in the acceptance ratio, leaving only known measures:

The ﬁrst B iterations that pass before the Markov chain reaches the stationary

distribution are discarded as burn-in, leaving the remaining sample (θt, ϕt_{), t =}

B + 1, . . . as the basis for inference.

There also exists a componentwise version of the Metropolis-Hastings algorithm, presented in Algorithm 2.

The distributions ϕ(A_.i|A.−i) = ϕ(A)/(

ϕ(A)dA_.i) are called full conditional distributions. In Algorithm 2, components can be multi-dimensional themselves and may be deﬁned based on the speciﬁcs of the model, e.g., correlated components are often grouped into a new component. The grouping may also vary between the iterations.

The componentwise Metropolis-Hastings algorithm owes its importance to one of its special cases in particular, the Gibbs Sampler (Gelfand and Smith, 1990), probably the MCMC procedure most frequently used in Bayesian mod- eling. Its proposal distribution for the ith component is given by

Algorithm 2 Componentwise Metropolis-Hastings

Partition the Markov chain A into components/subchains {A.1, . . . , A.n} and

deﬁne A.−i = {A.1, . . . , A.i−1, A.i+1, . . . , A.n}. If the Markov chain A.i rests in

iteration t + 1 and is, thus, still in state At.i, carry out the following steps:

1. Draw a candidate B_.i from the proposal distribution

q_i(B_.i|A_t.i, A_t.−i), where

A_t.−i={A_t+1.1, . . . , A_t+1.i−1, A_t.i+1, . . . , A_t.n} .

2. Accept B_.i with probability

α(A.−i, A.i, B.i) = min

1,ϕ(B.i|A.−i)qi(A.i|B.i, A.−i)

ϕ(A_.i|A.−i)qi(B.i|A.i, A.−i)

3. Set A_t+1.i= B_.i, if B_.i is accepted. Otherwise, set A_t+1.i = A_t.i.

where ϕ(B_.i|A.−i) is the full conditional distribution. Thus, the Gibbs Sampler

can only be used if the full conditional distributions are known. This is par- ticularly the case when conjugate prior distributions are employed, which is an important reason for the popularity of conjugate models in practice (see Sec- tion 3.1.2). The advantage of the Gibbs sampler is that it always accepts the proposed candidate value, since

α(A.−i, A.i, B.i) = min

1,ϕ(B.i|A.−i)ϕ(A.i|A.−i)

ϕ(A_.i|A.−i)ϕ(B.i|A.−i)

= 1 ,

which typically saves a considerable amount of computation time compared to the general Metropolis-Hastings algorithm.

In case of ﬁnite mixture models employing a Dirichlet distribution as prior, a standard implementation of the Gibbs sampler starts from an initial allocation

T(0) _{and then, in iteration t, alternates between the following steps in which}

y(t)_k ={yi : Tit = k} (cf. Algorithm 2):

1. Draw π∗(t+1)T(t) from the Dirichlet(α₁+ n₁, . . . , αK+ nK) distribution,

2. For k = 1, . . . , K, draw θ(t+1)_k  y, T(t) from p(θk|y(t)k ). If p(θk|yk(t)) is not

known, an Metropolis-Hastings step has to be employed here.

3. For i = 1, . . . , n, draw T_i(t+1) yi,θ(t+1),π∗(t+1) from the corresponding

conditional distribution, i.e. from (3.2).

4. Update chain values and proceed to next iteration.

The EM algorithm and the Gibbs Sampler are structurally similar regarding their sampling from the distributions of one variable conditional on the others. The Gibbs Sampler may thus be interpreted as a stochastic version of the EM algorithm. While both may in principle be used to ﬁt Bayesian mixture models, only the Gibbs sampler may be used to ﬁt complex hierarchical models.

In document Finite Bayesian mixture models with applications in spatial cluster analysis and bioinformatics (Page 47-51)