Model evaluation
5.2 Criteria based on scoring functions
In the previous section we have seen how the Kullback– Leibler sample discrep- ancy can be used to derive statistical tests to compare models. Often, however, we will not be able to derive a formal test. Examples can be found even among statistical models for data mining, for example models for data analysis with missing values or mixed graphical models. Furthermore, it may be important to have a complete ordering of models, rather than a partial one, based on pairwise comparisons. For this reason, it is important to develop scoring functions that assign a score to each model. The Kullback– Leibler discrepancy estimator is a scoring function that can often be approximated asymptotically for complex models.
A problem with the Kullback– Leibler score is that it depends on the complex- ity of a model, perhaps described by the number of parameters, hence its use may lead to complex models being chosen. Section 6.1 explained how a model selec- tion strategy should reach a trade-off between model fit and model parsimony. We now look at this issue from a different perspective, based on a trade-off between bias and variance. In Section 4.9 we defined the mean squared error of an estimator. The mean squared error can be used to measure the Euclidean distance between the chosen model pθˆ and the underlying modelf:
MSE(pθˆ)=E[(pθˆ−f )2].
Note that pθˆ is estimated on the basis of the data and is therefore subject to sampling variability. In particular, forpθˆwe can define an expected valueE(pθˆ), roughly corresponding to the arithmetic mean over a large number of repeated samples, and a variance Var(pθˆ), measuring its variability with respect to this expectation. From the properties of the mean squared error it follows that
MSE(pθˆ)=[bias(pθˆ)]2+Var(pθˆ)=[E(pθˆ)−f]2+E[(pθˆ−E(pθˆ))2]. This indicates that the error associated with a model pθˆ can be decomposed into two parts: a systematic error (bias), which does not depend on the observed
data, and reflects the error due to the parametric approximation; and a sampling error (variance), which reflects the error due to the estimation process. A model should therefore be selected to balance the two parts. A very simple model will have a small variance but a rather large bias (e.g. a constant model); a very complex model will have a small bias but a large variance. This is known as the bias– variance trade-off.
We now define score functions that penalise model complexity. The most important of these functions is the Akaike information criterion (AIC). Akaike (1974) formulated the idea that (i) the parametric model is estimated using the method of maximum likelihood and (ii) the parametric family specified contains the unknown distribution f (x) as a particular case. He therefore defined a function that assigns a score to each model by taking a function of the Kullback– Leibler sample discrepancy. In formal terms, the AIC criterion is defined by the following equation:
AIC= −2 logL(θˆ;x1, . . . , xn)+2q,
where logL(θˆ;x1, . . . , xn)is the logarithm of the likelihood function calculated
at the maximum likelihood parameter estimate andqis the number of parameters in the model. Notice that the AIC score essentially penalises the log-likelihood score with a term that increases linearly with model complexity.
The AIC criterion is based on the implicit assumption thatq remains constant when the size of the sample increases. But this assumption is not always valid, so AIC does not lead to a consistent estimate for the dimension of the unknown model. An alternative and consistent scoring function is the Bayesian information criterion (BIC), formulated by Schwarz (1978) and defined by the following expression:
BIC= −2 logL(θˆ;x1, . . . , xn)+qlog(n).
It differs from the AIC criterion only in the second term, which now also depends on the sample sizen. Asn increases, BIC favours simpler models than AIC. As
n grows large, the first term (linear in n) will dominate the second term (log- arithmic inn). This corresponds to the fact that, for large n, the variance term in the MSE expression becomes negligible. Despite the superficial similarity between AIC and BIC, AIC is usually justified by resorting to classical asymp- totic arguments, whereas BIC is usually justified by appealing to the Bayesian framework.
To conclude, the scoring function criteria we have examined are easy to cal- culate and lead to a total ordering of the models. Most statistical packages give the AIC and BIC scores for all the models considered. A further advantage of these criteria is that they can be used to compare non-nested models and, more generally, models that do not belong to the same class (e.g. a probabilistic neu- ral network and a linear regression model). The disadvantage of these criteria is the lack of a threshold, as well as the difficulty of interpreting their mea- surement scale. In other words, it is not easy to determine whether or not the difference between two models is significant, and how it compares with another difference.