Bayesian Inference - MCMC for Bayesian Inference

1.5 MCMC for Bayesian Inference

1.5.1 Bayesian Inference

The fundamentals of Bayesian theory are reviewed in this section in an in- troductory manner. For a more detailed approach see Bernardo and Smith (1994).

Bayes’ Theorem

In classical inference the data, which are assumed to depend on a vector of parameters, θ, are thought of as random with θ fixed (but unknown). In Bayesian inference the thinking is opposite - the data are regarded as fixed (at what has been observed) and the parameter vector θ is treated as unknown.

1.5 MCMC for Bayesian Inference

In the Bayesian approach in addition to specifying the model for the ob-

served data y = (y1, . . . , yn) given the vector of unknown parameters θ, in the

form of the likelihood function π(y|θ), we also define the prior distribution π(θ). The prior should contain all knowledge we have about the unknown parameter before analysis starts. Inference concerning θ is then based on its posterior distribution, given by

π(θ|y) = R π(y|θ)π(θ)

π(y|θ)π(θ)dθ ∝ π(y|θ)π(θ). (1.13)

This formula is referred to as Bayes’ Theorem. The integral in the denomina- tor is a normalising constant to ensure the distribution is a valid probability distribution and it’s calculation has traditionally been a computational obsta- cle. The main difficulty is that the calculation involves a many-dimensional integration and the resulting distribution cannot always be written down in closed form. However, it is possible to avoid its calculation using MCMC methods. Equation (1.13) can be thought of as “The posterior is proportional to the likelihood times the prior”.

Prior Distributions

Presented here are the two most popular approaches for choosing a prior distribution.

Informative priors

An informative prior for a parameter θ is a prior used when some information is known about the parameter before any data is obtained. For example, assume we were interested in estimating the average weight of newborn female babies. Then before we actually collect any observations of the weight of newborn babies, we find on a website that the average weight of a newborn is 3.4 kg. A prior is then chosen to incorporate this information - we choose a normal prior

incorporate the strength of our belief in the mean value of 3.4. The lower the

value of σ2_{, the stronger our belief in the mean. If we believe female babies}

to weigh less than male babies we can also include this belief by reducing our mean value.

Non-Informative or Diffuse Priors

In many situations no prior information concerning θ is available, or inference based solely on the data is desirable. Typically in this case we wish to define a prior distribution π(θ) that contains no information whatsoever about the parameter θ in the sense that it does not favour one particular value of θ over another. Such a distribution is called a noninformative prior for θ and it can be argued that the information about θ contained in the posterior comes only from the data. In classical inference prior distributions are not used in fitting models and so ‘noninformative’ priors are often used in Bayesian inference to compare with classical results.

In the case where the parameter space is Θ = {θ1, . . . , θn_{} i.e. discrete and}

finite, then the distribution

π(θi) = 1

n, i = 1, . . . , n

places the same prior probability of 1/n on any of the n candidate θ values. Similarly, in the case of a bounded continuous parameter space, say

Θ_{= [a, b], −∞ < a < b < ∞, then the uniform distribution}

π(θ) = 1

b − a, a < θ < b

is noninformative. A normal distribution with large variance may also be used as a noninformative prior. As the variance of a normal distribution is increased, the distribution becomes ‘flatter’ around the mean (see figure 1.5). This explains the alternative names for noninformative priors of ‘flat’ or ‘diffuse’ priors.

1.5 MCMC for Bayesian Inference −4 −2 0 2 4 0.0 0.1 0.2 0.3 0.4

Figure 1.5: Normal distributions with mean 0 and variances 1, 5, 10, 20 and 100 respectively.

For unbounded intervals the definition of a noninformative distribution is not straightforward. In the case that Θ = (−∞, ∞) a distribution such as

π(θ) = c is clearly improper since R _{π(θ)dθ = ∞. However, Bayesian inference}

is still possible in the case where R _{π(y|θ)dθ = D < ∞. Then}

π(θ|y) = R π(y|θ)c

π(y|θ)cdθ =

π(y|θ)

D .

It should be noted that there is not a ‘universal’ noninformative prior. It is possible in some cases for a constant prior to actually be informative under a different parameterisation. One method used to overcome this problem is

where I(θ) = E " ∂ ∂θ log π(y|θ) 2#

is the Fisher information.

When choosing a prior from a parametric family it can be possible to select a distribution which is conjugate to the likelihood, that is one that leads to a posterior belonging to the same family as the prior. The use of MCMC does not require conjugate priors but they can be computationally convenient.

In document Statistical analysis of proteomic mass spectrometry data (Page 39-43)