Monte-Carlo (MCMC)
As introduced in the previous section, the Bayesian iterative simulation method is all about yielding draws from the posterior distribution. This idea is not hard to understand. However, the devil is in the detail. When we try to implement the simulation idea for some data, a diffi- cult problem appears. How do we generate draws from a distribution we are not familiar with or one which is high dimensional and complicated? Markov Chain Monte Carlo (MCMC) sampling method provides a way to draw from unknown or complex posterior distributions. In order to draw from those distributions, MCMC often involves breaking down them into more manageable distributions.
“Markov Chain” refers to the process which the draws are made in sequence and are de- pendent, but where each draw only depends on the previous one. In terms of Bayesian ter- minology, it generates a new value from the posterior distribution, given the previous value. “Monte Carlo” refers to the random simulation process.
We will introduce two common MCMC methods in the rest of this section.
7.2.1
Gibbs sampling algorithm
The Gibbs sampler is one of the most basic special cases of the MCMC method (Scott 2007). It is simply an iterative simulation method that produces a draw from the joint dis- tribution in the case of a general pattern of missing data (Little & Rubin 2002). The Gibbs sampler can also be regarded as a multivariate extension of the chained data augmentation algorithm in which we estimate p parameters θ1, ...θp (Peter 1997). This means it has the
ability to simulate from the full conditional distribution p(θi|θ1,...,p), where i ∈ 1, ..., p. A
generic Gibbs sampler follows the following iterative process (t indexes the iteration count):
0. Assign a vector of starting values, θ0, to the parameter vector, and set t = 0 1. Set t = t + 1 2. Draw p(θ1t|θ2t−1, θ3t−1...θpt−1) 3. Draw p(θ2t|θt 1, θ t−1 3 ...θpt−1) .. . .
p+1. Draw p(θpt|θt
1, θ2t...θt−1p−1)
p+2. Return to step one, repeating until convergence2
In other words, Gibbs sampling orders the parameters and generates draws from the condi- tional distribution for these parameters given the current value of all the other parameters and cycles through the updating process repeatedly. The process of cycling stops when the distri- butions become stable and stationary3. In other words, the process stops when the algorithm converges. Section 7.4 discusses the convergence of MCMC in more detail.
7.2.2
Metropolis-Hastings (MH) algorithm
As we have mentioned that MCMC methods breakdown a complex or unfamiliar poste- rior distribution into smaller manageable distributions, then parameters are drawn from those smaller distributions. Actually, this is the case of the Gibbs sampling algorithm. The Gibbs sampler usually works fine until we encounter situations that even the breakdown for con- ditional posterior distributions p(θk|θ1, ..., θk−1, θk+1, ..., θp) are foreign to us, or we might
know the functional form, but not know the normalisation, nor have a means of drawing a sample directly. Again, we ask the question: “how do we generate draws from a distribution which we are not familiar with?”
The Metropolis-Hastings (MH) algorithm overcomes the unfamiliar distribution problem by generating draws from the posterior distribution (Hastings 1970). Basically, the MH al- gorithm draws a candidate point from a proposal distribution, then uses some techniques to determine whether we accept the candidate point as a draw from the full conditional distribu- tion (or the target distribution). Clearly, the MH algorithm bypasses the need of generating draws from the full conditional distribution or its breakdown distributions directly. We also need to point out that we can have MCMC updates in blocks where all blocks except one or two are Gibbs updates, but the rest special cases are the MH steps.
Here is the generic process of using the MH algorithm to generate parameters from the posterior distribution (t indexes the iteration count):
1. Establish starting values θ0for the parameter: θ . Set t = 0.
2. Draw a “candidate” parameter, θcfrom a “proposal density” q(θc|θt−1). 3. Compute the ratio
R= min 1, f(θ c)q(θt−1|θc) f(θt−1)q(θc|θt−1) (7.2)
4. Compare R with a U (0, 1) random draw u. If R > u, then set θt = θc. Otherwise, set θt= θt−1
5. Set t = t + 1 and return to step 2 until it converges (Please refer to section 7.4 for details of convergence).
2The convergence will be discussed in later part of this chapter.
3Stationary means the joint probability distribution of a stochastic process does not vary with respect to a
In the MH algorithm we have described above, if the “proposal density” q(θc|θt−1) is chosen to be independent of θt−1, that is:
q(θc|θt−1) = q(θc)
for a given probability density function q(θc). Then, the candidate point is generated from q(θc). The candidate point is accepted or rejected with an acceptance probability α(θt−1, θc) given by: α (θt−1, θc) = min 1, f(θ c)q(θt−1) f(θt−1)q(θc)
This version of the MH algorithm is called the MH independence sampler. Clearly, the MH independence sampler has a potential to boost up computation, since it only accepts or rejects the candidate points which are random draws from the proposal distribution. In other words, if the proposal distribution is not well matched to the target density, then many proposals will be rejected. For example, if the proposal distribution is too wide, it will take a very long time for the Bayesian iterative chain to converge; or if the proposal distribution is too narrow, the Bayesian iterative chain will not cover the target distribution. Hence, this method requires the proposal distribution to be as close as the target distribution, otherwise it can get stuck in the tails of the target distribution (Marin & Robert 2007, pg. 93).
7.2.3
Relationship between Gibbs and MH sampling
The Gibbs sampler is actually a special case of the MH algorithm. The only difference is that there is no rejection of selected candidate points in Gibbs sampling. The reason is that the ratio R is always 1 (Gamerman & Lopes 2006). Why? Let’s consider the equation for the ratio R, Eq. (7.2). In Gibbs sampling, we set the “proposal density” q(θc|θt−1) to equal the target density f (θc). This means that θc is independent of θt−1, and is an independent sampler. Hence, we have:
R= f(θ c)q(θt−1|θc) f(θt−1)q(θc|θt−1) = f(θ c) f (θt−1) f(θt−1) f (θc) = f(θ c)/ f (θc) f(θt−1)/ f (θt−1) = 1
Since the candidate point is accepted with probability min(1, R), and it is always true that R= 1, every draw is accepted.
Given the Gibbs sampler is part of the MH algorithm, there is no inherent reason we can not combine both algorithms. Actually, the MH algorithm can be a sub algorithm inside a Gibbs sampling cycle (Gilks et al. 1996) and (Muller 1991). It is also fine to have the Gibbs sampler inside the MH algorithm (Gamerman & Lopes 2006) and (Scott 2007). However, as we have already discussed above, the Gibbs sampler automatically accepts any candidate points, but the MH algorithm does not accept all the candidate points coming from the proposal density. To be precise, all Gibbs sampling is MH sampling, but not all MH sampling is Gibbs sampling.
7.2.4
Block Updating
So far, the MH algorithm and the Gibbs sampler we have introduced are only for updating a single scalar parameter θ one at a time. For a high dimensional posterior distribution, where we have a large number of parameters, θ1, ..., θp, updating only a single parameter θi at a
time, where i ∈ 1, ..., p, is not only a daunting task, but many have a very slow convergence rate. Hastings (1970) proposed a method that applies the MH algorithm in turn to subblocks of the vector of parameters θ = (θ1, ..., θp). Hence, instead of having p parameters, we group
these p parameters into b subblocks, where b < p. For example, if b = 2, then we have θ = ((θ1, ..., θk), (θk+1, .., θp)) = (θblock1, θblock2).