• No results found

3.2 Model Fitting

3.2.2 Markov Chain Monte Carlo Algorithms

Markov Chain Monte Carlo (MCMC) procedures are the standard method in the Bayesian community to fit not only mixture models, but models in general. Contrary to the EM algorithm, not only point estimates are obtained for the model parameters, but a sample from the posterior distribution of each of these parameters. Integration across the posterior distribution, e.g. to calculate ex- pectations of quantities of interest, can thus be approximated by summarizing across the obtained sample (this approach is called Monte Carlo integration). A

Algorithm 1 Metropolis-Hastings

If the Markov chain A is in iteration t ≥ 0, carry out the following steps:

1. Draw a candidate B from a proposal distribution q(.| At), e.g., a (poten-

tially multivariate) normal distribution.

2. Defining ϕ as the stationary distribution of the Markov chain, accept B with probability α(At, B) = min 1, ϕ(B)q(At| B) ϕ(At)q(B| At) , (3.12)

i.e. draw an UM H ∼ U(0, 1) and set At+1 = B if UM H ≤ α(At, B). Other-

wise set At+1 = At.

3. Increase the iteration count, i.e. set t = t + 1.

is drawn from the conditional distribution p(A|At), i.e. conditional on At, At+1

is independent of {A1, . . . , At−1}. For any given model parameter, the basic

concept of MCMC methods is to construct a Markov chain with a stationary distribution ϕ that equals the desired posterior distribution

p(θ | y) =  p(y| θ)p(θ)

p(y| θ)p(θ) dθ ∝ p(y | θ)p(θ). (3.11)

This may be achieved by means of the Metropolis-Hastings algorithm (Hastings, 1970), presented in Algorithm 1, or one of its many special cases:

In (3.12), ϕ is the stationary distribution of the Markov chain independently

of the proposal distribution q(.|.), and once the Markov chain reaches the sta-

tionary distribution, it will not leave it anymore (see, e.g., Gelman et al., 2013).

A good proposal distribution q(.| At) should be easy to sample from for any

At, and each accepted jump should go a reasonable distance in the parameter

space, since otherwise the chain would move too slowly. Finally, the acceptance

ratio α(At, B) should be easy to calculate, and the acceptance rate should be

in a reasonable range (Gelman et al., 2013). A very low acceptance rate will cause a long running time of the algorithm, while a very high acceptance rate will typically lead to autocorrelation between the values of the chain, running

contrary to the goal of obtaining an independent sample of draws from the pos- terior distribution. Thus, in general the goal is to achieve an acceptance rate in the range between 0.1 and 0.9, ideally between 0.3 and 0.7. To ensure this, it may be necessary to tune the standard deviation of the proposal distribu- tion. Autocorrelation nevertheless perceived in a Markov chain after running the algorithm can be reduced by keeping and analyzing only every wth draw, discarding the others, in a procedure usually referred to as ‘thinning’.

Note that although the acceptance probability (3.12) contains the unknown posterior distribution ϕ, its calculation is possible since the denominators in (3.11) cancel out in the acceptance ratio, leaving only known measures:

ϕ(B) ϕ(At+1) = p(B| y) p(At+1| y) = p(y | B)p(B)  p(y | θ)p(θ) dθ p(y | A t+1)p(At+1) p(y | θ)p(θ) dθ = p(y | B)p(B) p(y | At+1)p(At+1).

The first B iterations that pass before the Markov chain reaches the stationary

distribution are discarded as burn-in, leaving the remaining sample (θt, ϕt), t =

B + 1, . . . as the basis for inference.

There also exists a componentwise version of the Metropolis-Hastings algo- rithm, presented in Algorithm 2.

The distributions ϕ(A.i|A.−i) = ϕ(A)/(



ϕ(A)dA.i) are called full conditional distributions. In Algorithm 2, components can be multi-dimensional themselves and may be defined based on the specifics of the model, e.g., correlated com- ponents are often grouped into a new component. The grouping may also vary between the iterations.

The componentwise Metropolis-Hastings algorithm owes its importance to one of its special cases in particular, the Gibbs Sampler (Gelfand and Smith, 1990), probably the MCMC procedure most frequently used in Bayesian mod- eling. Its proposal distribution for the ith component is given by

Algorithm 2 Componentwise Metropolis-Hastings

Partition the Markov chain A into components/subchains {A.1, . . . , A.n} and

define A.−i = {A.1, . . . , A.i−1, A.i+1, . . . , A.n}. If the Markov chain A.i rests in

iteration t + 1 and is, thus, still in state At.i, carry out the following steps:

1. Draw a candidate B.i from the proposal distribution

qi(B.i|At.i, At.−i), where

At.−i={At+1.1, . . . , At+1.i−1, At.i+1, . . . , At.n} .

2. Accept B.i with probability

α(A.−i, A.i, B.i) = min

1,ϕ(B.i|A.−i)qi(A.i|B.i, A.−i)

ϕ(A.i|A.−i)qi(B.i|A.i, A.−i)

.

3. Set At+1.i= B.i, if B.i is accepted. Otherwise, set At+1.i = At.i.

where ϕ(B.i|A.−i) is the full conditional distribution. Thus, the Gibbs Sampler

can only be used if the full conditional distributions are known. This is par- ticularly the case when conjugate prior distributions are employed, which is an important reason for the popularity of conjugate models in practice (see Sec- tion 3.1.2). The advantage of the Gibbs sampler is that it always accepts the proposed candidate value, since

α(A.−i, A.i, B.i) = min

1,ϕ(B.i|A.−i)ϕ(A.i|A.−i)

ϕ(A.i|A.−i)ϕ(B.i|A.−i)

= 1 ,

which typically saves a considerable amount of computation time compared to the general Metropolis-Hastings algorithm.

In case of finite mixture models employing a Dirichlet distribution as prior, a standard implementation of the Gibbs sampler starts from an initial allocation

T(0) and then, in iteration t, alternates between the following steps in which

y(t)k ={yi : Tit = k} (cf. Algorithm 2):

1. Draw π∗(t+1)T(t) from the Dirichlet(α1+ n1, . . . , αK+ nK) distribution,

2. For k = 1, . . . , K, draw θ(t+1)k  y, T(t) from p(θk|y(t)k ). If p(θk|yk(t)) is not

known, an Metropolis-Hastings step has to be employed here.

3. For i = 1, . . . , n, draw Ti(t+1) yi,θ(t+1),π∗(t+1) from the corresponding

conditional distribution, i.e. from (3.2).

4. Update chain values and proceed to next iteration.

The EM algorithm and the Gibbs Sampler are structurally similar regarding their sampling from the distributions of one variable conditional on the others. The Gibbs Sampler may thus be interpreted as a stochastic version of the EM algorithm. While both may in principle be used to fit Bayesian mixture models, only the Gibbs sampler may be used to fit complex hierarchical models.

Related documents