2.4 Statistical Inference
2.4.3 Occam’s Razor and Bayesian Model selection
The principle of simplicity dates back to the days of Aristotle, who wrote
Nature operates in the shortest way possible.
Several scientists, philosophers and priests who proceeded Aristotle, provided diﬀerent point of views. Until the 14th Century when William of Ockham an
English Franciscan friar Father stated
Pluralitas non est ponenda sine necessitate.
This statement could be translated as ‘Plurality should not be posited without necessity’. Ockham was an important ﬁgure in the medieval era, and because he used to cut out or shave away the arguments of others, his principle became known as Ockham’s razor or Occam’s razor. This principle is still valid till our modern days and has been used by many scientists. Hawking (1995), one of the most brilliant theoretical physicists in our days, stated
We could still imagine that there is a set of laws that determines events completely for some supernatural being, who could observe the present state of the universe without disturbing it. However, such models of the universe are not of much interest to us mortals. It seems better to employ the principle known as Occam’s razor and cut out all the features of the theory that cannot be observed.
We stated in section 2.4.1 that a major limitation of the maximum likelihood approach in determining model parameters is due to the problem of over ﬁtting. Bayesian inference is an alternative approach that avoids this problem and con- sists of computing the posterior distribution over model parameters, which takes the form
p(θi|X, Mi) = p(X|θi,Mi)p(θi|Mi)
2.4. Statistical Inference 35
Once determined, this distribution enables us to rectify or correct our prior beliefs over the parameter values after observing the data. The model evidence or the marginal likelihood of the data is described in the denominator of Equation (2.55) that consists of integrating out over all possible parameters settings, as deﬁned in Equation (2.51). We represent it here again for the sake of convenience, so that
Finding the marginal likelihood is an important task because on one hand it enables us to compute the posterior distribution and on the other hand it is nec- essary to develop Bayesian model comparison for ﬁnding the model that best describes the data. In this situation we are not going to ﬁt the parameters to the data, but we are going to integrate out over model parameters to avoid the over ﬁtting. This approach does not prevent us from choosing models with inﬁnitely large number of parameters because the size of the complexity penalty increases as we increase the model complexity, as we shall see shortly. Hence the Occam’s razor becomes crucial for applying a trade-oﬀ for ﬁnding the best model.
Figure 2.7 illustrates the Occam’s razor axiom where the horizontal axis repre- sents the space of possible data sets to be modelled so that each point on this axis represents a particular data set; the vertical axis represents the normalised distribution of the marginal likelihood, which is integrable to one. A common ap- proach for simulating a data set consists of averaging the probability of the data with respect to the values of the parameters, which are taken from their prior distributions p(θi|Mi). Accordingly, if a model has low variability, the generated data sets would appear almost with the same pattern —simple representation. On the other hand, if a model has high variability, the generated data sets would then appear to be very diﬀerent —complex representation. For example, Figure 2.7 illustrates three models M1,M2 and M3 with increased complexity, such
that the ﬁrst model p(X|M1) represents a very simple representation (because it
generates a limited variability of data sets), the third model p(X|M3) represents
a very complex representation (because it generates a wide range of data sets); however, the second model p(X|M2) represents a reasonable level of complexity.
In general, one may select the model that would provide the highest marginal likelihood value (known as model selection) or estimate some quantity under each candidate model and then construct a weighted average over all of them
Figure 2.7: Pictorial representation of Occam’s razor, adapted from (MacKay, 2003).
Possible Data sets
(known as model averaging). When the computation of the marginal likelihood becomes intractable, one may approximate the problem by choosing Maximum a
Posteriori (MAP) estimate, as deﬁned in Equation (2.49), given by
θMAP = arg max
θ (p(x|θ, M)p(θ, M)),
which is equivalent to work on a simpliﬁed form of the posterior, such that
P osterior∝ Likelihood × P rior.
This assumes that the posterior distribution is maximised at the point θMAP, which is known as MAP solution of the model. The Bayesian Information Cri- terion (BIC) (Schwarz, 1978) can be obtained from the Laplace approximation applied to the evidence, and so taking logs we obtain
log p(X)≈ log p(X|θMAP) + log p(θMAP) +D
2 log(2π)− 1
2log det(A), (2.56) where D is the space dimension of the data set X, and A is the second order derivative of the posterior, which will be developed later. By assuming that the det(A) ∝ nN, where n is the size of the data set and N is the number of model parameters, we obtain the BIC expression, which can be written in the form
2.4. Statistical Inference 37
A similar criterion was developed and known as the Akaike Information Criterion (AIC)
log p(X)≈ −2 log p(X|θML) + 2n. (2.58) Both criteria penalize the model when the number of parameters increase un- necessarily. A limitation of a AIC and BIC scores is that they do not account for parameter correlations, and hence cannot be used with regularised models (Rattray, 2008).