The criteria for model selection - Variational Bayesian data driven modelling for biomedical sy

As introduced in Section 2.2, a good model selection criterion selects a model that describes the data adequately without overfitting it. The VB method uses the lower bound of the model evidence as in Fig.2.2, i.e. the maximised value of the free energy, to select the best model candidates. Compared with other model selection criteria in the literature such as the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) [47,78], the free energy criterion assigns a heavy penalisation term due to the integration over the whole parameter space, which gives a bias towards over- simplified models. As shown in the previous section, the calculation of the free energy is complicated and influenced by all of the priors and the posteriors of the parameters and the states. In cases where the free energy is sensitive to the change in priors, trusting the free energy value blindly can be problematic. Therefore, in practice, instead of treating the free energy criterion as a definite rule, other criteria, alongside the free energy, should be considered to gain quantitative information about how well the models fit the data, especially when the ‘goodness-of-fit’ for the measurements between the models is close [76]. When conflicts between the criteria occur, which is not unusual, it depends on the modeller to decide which criteria to use and which results to trust.

The Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are two commonly used model selection criteria, and both are simplified versions of the free energy criteria (or K-L divergence). Akaike introduced the K-L divergence (see Section2.3.1) as a fundamental basis for model selection, and started advocating the AIC in the mid-70s. The AIC is based on the idea of minimising the K-L divergence between the true probability distribution p(y) and the approximated probability distribution q(y|M) of the measurementsy as follows (refer to (2.11)):

The model-dependent part of the K-L divergence KL[PkQ] is the second term:

KLr =hlogq(y|M)ip(y) (2.39)

which is known as the relative K-L divergence. Assuming the distribution q(y|M) is parameterised byθ, the relative K-L divergence can be written as follows:

KLr=hlogq(y|θ, M)ip(y) (2.40)

Unlike the free energy criterion, the AIC only considers the most probable value of the

parameters θ. The problem is that the parameters are unknown and the expectation

with respect to p(y) in (2.40) cannot be evaluated since p(y) is unknown. Using the same measurement datayto estimate the parameters and then integrating logq(y|θ, M) with respect to the probability distribution ofy would have the bias of using the same datay twice. To avoid this problem, it is assumed that the most probable parameterθˆ can be obtained by applying the ML estimation from a fictitious data vectorxwith the same length and the same probability distribution asy but independent from y. Then the following equation can be considered as the approximation for logq(y|M):

logq(y|M) =hlogq(y|θˆx, M)ip(x) (2.41)

The fictitious dataxare created to estimate the parameters, and the dependence of the fictitious data x can be eliminated via the expectation operation h·i_p₍_x₎ in (2.41). A second-order Taylor expansion of logp(y|θˆx) aroundθˆy, logp(y|θˆx) can be obtained as

follows:

logq(y|θˆx, M) = logp(y|θˆy, M) +

2(θˆx−θˆy)H( ˆθy)(θˆx−θˆy) (2.42) where H( ˆθy) is the Hessian matrix evaluated atθˆy. θˆx and θˆy are the ML estimates

of the parameters for the data xand y respectively. Using the fact thatxand y have the same distribution and have the same length,KLr – which defines the AIC – can be

obtained by inserting (2.42) into (2.39) as follows (a formal derivation can be found in [79]): AIC =KLr= hlogq(y|θˆx, M)ip(x) p(y) = log(p(y|θˆy, M))−k (2.43)

wherek is the number of the parameters in the model.

A simple example will be presented to demonstrate how the AIC can be calculated. In this example, assuming the measurementsytare conditionally independent of each other

given the model and the parameter vectorθ, the marginal likelihood P(y|θ, M) can be factorised intoT separate terms:

P(y|M,θ) =

t=1

P(yt|M,θ) (2.44)

Assuming that the measurement noise is Gaussian distributed with zero mean and vari- anceσ2

ε, (2.44) can be further decomposed as follows:

P(y|M,θ) = T Y t=1 1 p 2πσ2 ε exp(− 1 2σ2 ε (yt−yˆt)2)) (2.45)

where ˆytis the estimation of the value ofytbased on the estimated parameters ˆθ. Taking

the natural logarithm of both sides of (2.45) yields:

logP(y|M,θ) =−T 2 log(2πσ 2 ε)− 1 2 T X t=1 yt−yˆt σε 2 (2.46)

The logarithm of the marginal likelihood, logP(y|M,θ), often called thelog-likelihood, is used instead of the likelihood itself to simplify the calculation (using summation instead of multiplication). The maximum value ofP(y|M,θ) can be achieved when the parameter mean and variance satisfy the following conditions (the ML estimation):

∂logP(y|M,θ) ∂θ = 0 (2.47) and ∂logP(y|M,θ) ∂σε = 0⇒σ_ε2= 1 T T X t=1 (yt−yˆt)2 (2.48)

Substituting the estimated variance in (2.46), the log-likelihood function can be calculated as follows: logP(y|M,θ) =−T 2 log yt−yˆt σε 2 +C (2.49)

whereCis a constant independent of the model. Substituting the log-likelihood function in (2.43) by (2.49), the AIC can be calculated as follows:

AIC =−T 2 log yt−yˆt σε 2 −k+C (2.50) where yt−yˆt σε 2

is defined as the residual sum of squares(RSS).

The AIC has been reported to perform poorly for small numbers of data points, which motivated the inclusion of a correction term as follows [79]:

AICc = AIC− k(k+ 1)

N −k−1 (2.51)

where N is the number of data points and k is the dimension of the parameter space. The AICc, known as the ‘corrected AIC’, penalises parameters more than AIC does and the two criteria become approximately equal forN > k2. Both the AIC and AICc have a tendency to select an overfitted model from the competing model candidates, which originates from overfitting the fictitious sample xto estimate the parameter vector θˆx.

However, the data generating mechanisms in practice are often more complicated than any proposed model candidates, especially when the data are sparse. In these cases, the AIC or the AICc may perform better compared with the free energy criterion due to the tendency of the AIC or the AICc to select an overfitted model. Both the AIC and AICc have been found useful in many applications reported in the literature [47]. The Bayesian Information Criterion (BIC) is an extension of the AIC. The expectation with respect to the probability distribution ofθˆx,hlogq(y|θˆx, M)i_p₍_θˆ_x₎, rather than the

probability distribution of x, hlogq(y|θˆx, M)ip(x), is used to estimate the log-evidence

logq(y|M). Viewing the estimated parameter vectorp(θˆx) as the prior for the parame-

ter, model evidence q(y|M) can be obtained as follows:

q(y|M) =

q(y|θˆx, M)q(θˆx|M)dθˆx (2.52)

An approximation of the integral can be obtained under two assumptions: 1) the probability distribution of the parameter is flat around the estimated value θˆx; 2) the prob-

ability of the parameter is independent of the length of the data. Expanding q(y|θˆx)

formal derivation can be found in [79]):

KLr≈logq(y|θˆy, M) + logq(θˆy) +

k 2 log 2π

| {z }

does not scale withN −k

2logN (2.53)

As the sample size N tends to infinity, the terms that do not scale with the number of data pointsN can be neglected, yielding the BIC:

BIC = logq(y|θ, Mˆ )− k

2logN (2.54)

The BIC does not need the prior to quantify the model evidence q(y|M) because the prior term does not scale with the number of data points as stated by assumption 2)

in the last paragraph. When the number of data points N is greater than 15, the

complexity term of the BIC, k₂logN, is larger than the complexity term of the AIC,k, and as a result the BIC has less probability of overfitting the data. Since it neglects the terms that do not scale with the number of data points, the complexity term is smaller than the complexity term in the free energy criterion. Therefore, the BIC is less strict towards model complexity compared with the free energy criterion, but stricter compared with the AIC.

Depending on the chosen criterion (AIC or BIC), a higher value of the AIC or the BIC indicates a better model. The difference in either the AIC or the BIC between two models is assessed using the Bayes factor [80]. Advocated by Jeffreys in 1961, the Bayes factor aims at providing a Bayesian equivalent to hypothesis testing in classical statistics. When two models are compared, the probability of one model M1 being true

over the probability of another modelM2 being true can be obtained as follows:

p(M1|y) p(M2|y) = p(M1) p(M2) p(y|M1) p(y|M₂) (2.55)

The prior probabilities of the two models are considered equal, i.e.p(M1) =p(M2), and

therefore the Bayes factor in favour ofM1 overM2 is defined as follows:

B1,2 =

p(y|M1)

p(y|M₂) (2.56)

Kass and Raftery [80] suggested interpreting the Bayes factor in the natural logarithm scale as in Table2.1:

Table 2.1: Bayes factor compared between modelsM1andM2

log(B1,2) B1,2 Evidence againstM2

0∼2 1∼3 Not significant

2∼6 3∼20 Positive

6∼10 20∼150 Strong

>10 >150 Very strong

Bayes’ rule works for the AIC, BIC, and the free energy criterion: the differences in the AIC values, the BIC values and the free energy values are equivalent to a log-Bayes factor log(B1,2) in Table 2.1. If the AIC is chosen to be the model selection criterion, then

AIC(M1)−AIC(M2) > 3 (3 is a conventional choice based on [81] and [80]) indicates

thatM1 outperformsM2. If the BIC is chosen to be the model selection criterion, then

BIC(M1)−BIC(M2)>3 indicates thatM1 outperformsM2. If the free energy criterion

is chosen, then F(M1)− F(M2)>3 indicates thatM1 outperformsM2.

Compared with the free energy criterion, both the AIC and BIC are more widely used due to their simplicity. They do not take the uncertainty about parameters into consid- eration, i.e. the value of the AIC or the BIC is only based on the estimated parameter values from the ML method. The uncertainty of the parameters is underestimated in both methods, and therefore overfitted models are more likely to be chosen using either of these two criteria [80], which has been shown by [82] and [83]. The free energy criterion, on the other hand, provides a model comparison taking the uncertainty of the parameters into account. It is worth noting that a higher free energy value does not guarantee higher model evidence because the gap between the free energy and the model evidence can be different for each model candidate. As shown in Chapter 4 in [41], the gap – the K-L divergence – is found to be positively correlated with the number of parameters, which means that the estimation of the model evidence becomes more pessimistic as the model complexity increases, rendering the VB methods to suffer from a tendency to select a model that underfits the data. Therefore, in later applications in Chapter 3 and Chapter 4, several criteria are considered together to select the best model, with the purpose of minimising the limitations of each criterion.

In document Variational Bayesian data driven modelling for biomedical systems (Page 58-64)