Markov Chain Monte Carlo (MCMC) simulation is a statistical modelling tool that can be used in Bayesian statistics for integrating over high- dimensional probability distributions in order to make inferences about model parameters. MCMC can also be used in other situations as a simu- lation method. In this thesis, we will use it for Bayesian inference when the posterior distribution cannot be obtained analytically. The method works by approximately drawing dependent samples from the posterior distribu- tion of a parameter, from which inference can be made about the moments of the distribution. MCMC was first formulated by Metropolis et al. (1953) with significant additions and improvements from Hastings (1970) and Ge- man and Geman (1984).
joint probability distribution is,
P (D, θ) = P (θ)P (D|θ),
where P (θ) is a prior distribution and P (D|θ) is a likelihood. Therefore, the distribution of θ conditional on the observed data, D, is,
P (θ|D) = R P (θ)P (D|θ)
P (θ)P (D|θ)dθ = π(θ|D),
where π(θ|D) is the posterior distribution of θ that we are interested in. In the rest of this thesis, prior distributions will be denoted with P , likelihoods with L and posterior distributions with π.
Let θ be a vector of k random variables, then Monte Carlo integration
evaluates E(θ) by drawing samples {θt, t = 1, ..., n} from π(θ|D) and ap-
proximating, E(θ)≈ n1 n X t=1 θt.
Sampling from π(θ|D) could be done by any process that samples the
distribution in the correct proportions. MCMC works by constructing {θt}
from a Markov chain that has π(θ|D) as its stationary distribution. A Markov chain is a sequence of random variables, such that at time t, the
value of θt+1 is only dependent on θt and independent of the previous sam-
ples. The value of θt+1 is sampled from the distribution, P (θt+1|θt), which
is independent of t. The Markov chain will eventually become effectively
independent of its starting state, θ0, and t, so the distribution of θt is invari-
ant or stationary. The first m values of θt, which might be dependent on θ0,
distribution, are used to give the estimator, E(θ)≈ 1 n − m n X t=m+1 θt.
Constructing a Markov chain with π(θ|D) as the stationary distribution can be done via the Metropolis-Hastings algorithm, Algorithm 1.7.1. See Chib and Greenberg (1995) for an overview of this technique. At each time,
t, a candidate point, θ∗, is sampled from a proposal distribution q(θ∗|θ
t)
and is accepted as θt+1 with probability,
α(θt, θ∗) = min 1,π(θ ∗)q(θ t|θ∗) π(θt)q(θ∗|θt) , (1.3) otherwise θt+1 = θt.
Algorithm 1.7.1 The Metropolis-Hastings algorithm
1. Initialise θ0 and let t = 0.
2. Sample θ∗ from q(θ∗|θ
t).
3. Sample U from the uniform U(0, 1) distribution. 4. If U ≤ α(θt, θ∗) let θt+1 = θ∗, otherwise let θt+1= θt.
5. Increment t.
6. Repeat steps 2 to 5 n times.
Gilks et al. (1996) provides a complete proof, beyond the scope of this introduction, that the Markov Chain resulting from the Metropolis-Hastings algorithm converges to, and continues to sample from, the specified station- ary distribution. In brief, the argument requires us to show that the chain is irreducible, aperiodic and reversible. A chain is irreducible if, given enough iterations, it can reach all interesting parts of its state-space irrespective of its starting point. The aperiodicity requirement prevents the chain from oscillating between a fixed number of states in a regular periodic manner. If a chain satisfies just these two conditions then it has a unique stationary
distribution. The third condition, reversibility, is defined with respect to the distribution, π, and requires the balance equation,
π(θt)P (θt+1|θt) = π(θt+1)P (θt|θt+1)
to be satisfied for all t. If the chain is reversible, as well as irreducible and aperiodic, then the chain’s unique stationary distribution is π. The Metropolis-Hastings algorithm can be shown to satisfy these conditions due to the acceptance/rejection step.
The proposal distribution, q, can have almost any form, but must be chosen carefully for the chain to move around the support of π efficiently. If the distance between the proposed value and the current value is typically too large then the majority of proposals will be rejected. If the distance is too small then it will require more samples to move about the entire support of π. Either case will result in slow mixing.
If the proposal distribution is symmetric, q(θ∗|θ
t) = q(θt|θ∗) for all t,
then Equation (1.3) simplifies. A special case is the random-walk, where
q(θ∗|θ
t) = q(|θ∗ − θt|). A common form of q(θ∗|θt), with q symmetric, is
a multivariate normal distribution with mean, θt, and constant covariance
matrix, Σ.
Sometimes it is more convenient to update each element of θ individ-
ually. Let θ = (θ1, ..., θk) be the vector of current parameter values and
θ−i = (θ1, ..., θi−1, θi+1, ..., θk). In each iteration of the Metropolis-Hastings
algorithm the elements of θ are updated in turn. For the ith parameter, we
propose a new value, θ∗
i, which we sample from the proposal distribution,
qi(θ∗i|θi, θ−i). Then, θi is updated to θi∗ with probability,
α(θi, θi∗; θ−i) = min 1,π(θ ∗ i|θ−i)qi(θi|θi∗, θ−i) π(θi|θ−i)qi(θ∗i|θi, θ−i) .
All of the MCMC algorithms in this thesis will update parameters in turn within one iteration, although sometimes individual parameters are collated
in matrices and the matrices are updated in turn.
If the full conditional distribution of a parameter given the rest,
π(θi|θ−i), is known then choosing this distribution as the ith proposal dis-
tribution,
π(θ∗i|θ−i) = qi(θ∗i|θi, θ−i), π(θi|θ−i) = qi(θi|θ∗i, θ−i),
results in the proposal being accepted with probability 1. This special case of the Metropolis-Hastings sampling method is known as the Gibbs sampler. If sampling from the full conditional distribution is possible, then using the Gibbs sampler is often computationally quicker and usually mixes more efficiently than alternative Metropolis-Hastings steps.
In practical applications consideration must be given to the choice of prior distribution, based on a priori knowledge. If little is known, then using a non-informative prior, such as a uniform distribution over the parameter space, will place more emphasis on the data in the likelihood. Conversely, if a parameter is known to be in a narrow interval, then an informative prior, such as a Gaussian distribution with small variance parameter, will be more appropriate. Care should also be given so that the support of the prior distribution does not extend outside the parameter space. For example, a variance parameter must be positive.