2.2 Learning and Bayesian Inference
2.2.3 Bayesian Inference
In order to retain a truly probabilistic framework we must also place distributions over the random variables before the model sees any evidence in the form of data. Such distributions are called prior distributions and express our a priori beliefs about the phenomenon we are trying to infer (as prior beliefs on the model parameters imply prior beliefs for the phenomenon). In order to update these prior beliefs to a posteriori beliefs, having seen the evidence, we need Bayes rule which is the foundation of Bayesian inference:
Bayes Rule : P (A|B) = P (B|A)P (A)
P (B) (2.14)
where
• P (A) - The prior belief for A independent of B.
• P (B) - The prior belief for B independent of A. Also defined as the 5The variance of the estimate is available but has no contribution in the final prediction.
marginal likelihood as it is equivalent with integrating out A from the joint likelihood which is the numerator.
• P (B|A) - The conditional probability of B given A which corresponds6 to
the likelihood of A for known B.
• P (A|B) - The posterior belief for A after observing B.
Returning back to the linear regression framework we place a zero-mean Gaussian prior distribution over the parameters or regression coefficients w:
p(w|α) = D Y j=1 α 2π 1/2 expn−α 2w 2 j o (2.15) where α is a common scale or inverse variance across dimensions and the prior distribution expresses our prior belief that the evidence are generated from a relatively smooth phenomenon and hence smaller weights are preferred a priori. Following Bayes rule and recalling the likelihood function in Equation 2.10 we can now update our beliefs for the parameters w to the posterior distribution (Tipping 2004): p(w|y, X, α, σ2) = p(y|X, w, σ 2)p(w|α) p(y|X, α, σ2) = N (µ, Σ) (2.16) where µ =XTX + σ2αI −1 XTy (2.17) Σ = σ2(XTX + σ2αI)−1 (2.18)
Hence now we have a closed form solution for the posterior over the pa- rameters due to the accommodating nature of linear regression where both the likelihood and the prior can be described with Gaussian distributions that give rise to a Gaussian posterior. This unfortunately will not always be the case and we will have to resort to either sampling techniques or deterministic approxima- tions that are described in later sections.
It is worth noting that the prior placed on the regression coefficients has an analogous function to the regularisation component within the SLT framework. 6When P (B|A) is treated as a function of B given A it corresponds to a probability (dis-
tribution/density) function but when is treated as a function of A given B it is a likelihood function.
It places a bias for smooth estimating functions and hence ensures the model is not over-fitting the data. We can further see the analogy between the approaches by maximising over the posterior and examining the mode of that distribution:
ˆ
wMAP = µ = ˆwPLS (2.19)
assuming λ = σ2α. Thus the maximum a posteriori (MAP) solution is equivalent
to the PLS estimate and the parameter product σ2α has a similar function to λ
of penalising complex functions and avoiding over-fitting.
This analogy is only present when we restrict our probabilistic model to resulting point estimates such as the ML or MAP solutions. In reality we have a posterior distribution over the regression coefficients and we can make full use of it through the Bayesian tool of marginalisation:
p(y∗|y, X, α, σ2) =
Z
p(y∗|w, σ2)p(w|y, X, α, σ2) dw (2.20)
where we see that our final predictive function is an average over the whole of the regression coefficients posterior. In the case where integration cannot be performed in closed form, the Monte Carlo estimate can be employed. The above marginalisation provides another Bayesian benefit, that of explicitly tak- ing into account the uncertainty for the parameters in the form of the posterior distribution (if it is concentrated or diffuse).
Finally, it is worth noting that we can place further prior distributions on the scales and the variance, propagating uncertainty into higher levels in the model and becoming “truly” Bayesian by marginalising over all model parameters. In some of these cases however we loose the benefit of having a closed form posterior distribution as the joint posterior over all parameters can become intractable. At this point, sampling or deterministic approximations become necessary for Bayesian inference and we will review such strategies later in this Chapter.
The Bayesian framework will be adopted for the remainder of this thesis on the basis of its advantages, most of which we have already seen. In a summary these are:
• Prior beliefs - Explicitly incorporate prior knowledge regarding the prob- lem under consideration via the prior distributions placed on the model parameters. Bayesian inference is within the so-called subjective7probabil- 7There is a great history and interesting controversy in statistics between “Bayesians” and
ity theory field (Good 1983) and accommodates prior knowledge and also prior non-informative “objective” beliefs with appropriate distributions. • Probabilistic Responses - Instead of a single point response, a distri-
bution over responses is offered via the Bayesian framework. Therefore a direct measure of the confidence of the model’s responses is offered which is crucial for decision making in critical applications such as health infor- matics or security.
• Marginalisation - Model parameters can be marginalised (integrated) out, effectively averaging over all their possible values. Very useful and informative quantities, as we shall see and employ later on, such as the marginal likelihood are based on marginalisation.
• Uncertainty - Posterior distributions directly express the uncertainty over model parameters which is taken directly into account via the process of marginalisation. Uncertainty can be encoded and propagated into higher levels of model hierarchy through the use of priors and hyper-priors (prior distributions over parameters from lower level prior distributions).
• Formality - Bayesian inference is firmly based on probability theory and the corresponding axioms of plausible reasoning (Jaynes 2003) providing a systematic and formal way of dealing with uncertainty.