The marginal likelihood and model selection

1.2 Statistical concepts and methods

1.2.3 The marginal likelihood and model selection

From the Bayesian point of view, model comparison captures uncertainty in the choice of the model. Let us assume that we want to compare a set of L models, i.e. Mi, where i = 1,· · ·, L. All of these models define a probability distribution

over the observationsD. Also assume that the data are generated from one of these models and we do not know which model is the true one. Our uncertainty can be expressed through a prior distribution over the models, i.e. P(Mi). Therefore given

a set of data,D, the posterior distribution can be written as

P(Mi|D)∝P(Mi)P(D|Mi).

For simplicity we assume that the prior is equally probable among all models. The in- teresting term to observe here isthe model evidencewhich is also known asmarginal likelihood,P(D|Mi), which shows the preference provided by the data for different

models. In other words one can see themarginal likelihood as a likelihood function over the space of models, where the parameters have been marginalized. Jeffreys [1961], Kass and Raftery [1995], as well as Berger and Pericchi [2001], proposed the Bayes factor for comparing modelsM1 and M2

B12=

P(M1|D)

P(M2|D)

/P(M1) P(M2)

There are other standard frameworks for model selection which we can imple- ment; for instance Schwartz’s criterion, which is also called the Bayesian Information Criterion (BIC) [Schwarz, 1978]. The BIC provides a first order approximation of the Bayes factor, and requires the maximum likelihood estimation (MLE) of parameters for all models.

S=−2 logλn−(p2−p1)log(n)

and M2 evaluated at the MLE, p1, p2 are the dimensions of the parameter space

associated withM1 and M2 and n is the sample size.

Based on deviance, Spiegelhalter et al. [2002] developed an alternative to the BIC, called the DIC (for Deviance Information Criterion). For Bayesian model selection or comparision DIC is particulary prefered. The deviance is defined as

D(θ) = −2 log(P(Y|θ)) +C, where Y is the data, θ is unknown parameter and

P(Y|θ) is the likelihood function. The constantCwill be cancel out on comparision of different models. Based on the deviance of the model, the deviance information criterion (DIC) can be calculated as

DIC=D[E(θ)] +pD,

whereE(θ) is the expectation ofθ andpD computes the effective number of param-

eters,

pD =E[D(θ)]−D[E(θ)],

where E[D(θ)] is the posterior mean of the deviance term, that measures the strength of the model fitting the data. The DIC is then calculated for the eval- uation of the model. Providing the DIC value is smaller, the model is regarded as better. This criterion is more satisfactory when compared to BIC. Firstly, because it considers the prior information and gives a natural penalization factor to the log-likelihood; secondly, because the DIC can easily be calculated from MCMC simulated samples.

Finally, we describe Bayesian evidence (or marginal likelihood) as a yardstick for model selection. The obvious question to raise here is “why use a marginal likelihood for model selection?”. This can be answered by considering the principle of “Ockham’s Razor”. This principle states a preference for simple models. Bayes’ theorem may be used to rank models by comparing how well they predict the data. These predictions are based on model evidence. As shown in Figure 1.5 a simple

model makes only a certain range of predictions for the data, whilst a more complex model will be freer to predict multiple datasets. This means that a simple model may still predict the data more strongly than a complex one and hence fulfills the Ockham’s Razor principle.

too simple

too complex "just right"

All possible data sets

P(Y|

)

Figure 1.5: Model classes may be either too simple or too complex to generate the data set. In such cases computing marginal likelihood gives a probabilistic yardstick for selection of the model class [MacKay, 2003].

In the practice of Bayesian statistics, the use of MCMC methods to simulate the posterior distribution is widespread [Gelfand and Smith, 1990]. Once sufficient samples have been drawn from the posterior distribution, one can tackle or solve the problem of estimation and prediction very well by using these methods. However, calculation of the model evidence has proved extremely challenging. Chib [1995] demonstrates a method to compute marginal likelihoods using Gibbs sampler out- put. Chib’s method gives the simplest way to compute the marginal likelihood, given parameters drawn from the posterior distribution.

We will later compare marginal likelihood calculations for State Space Models using the outputs of the Gibbs sampler [Chib, 1995], to the lower bound calculated by the variational approximation [Beal et al., 2005]. The variational approximation has its roots in the ‘calculus of variations’. Recently, variational methods have been used in the context of approximate inference and estimation. Using the variational

free energy as a framework for statistical inference, an ensemble of parameter vectors is optimised, rather than a single parameter vector [MacKay, 1995]. This method was utilised by Beal et al. [2005] in the reconstruction of genetic regulatory networks using hidden factors.

In document Reconstructing regulatory networks from high throughput post genomic data using MCMC methods (Page 38-41)