Information Criteria - Model Evaluation and Selection

Model Evaluation and Selection

5.2 Information Criteria

+ n log1

n(y − ˆy)^T(y − ˆy)

≈ n log

1+2

n(p+ 1)

+ n log1

n(y − ˆy)^T(y − ˆy) (5.20)

≈ 2(p + 1) + n log1

n(y − ˆy)^T(y − ˆy).

In contrast to RSS= (y − ˆy)^T(y − ˆy), which decreases with increas-ing model complexity, both Mallows’ Cpcriterion and the FPE criterion thus include the number of model parameters and thereby penalize model complexity. It may also be noted that they yield equations equivalent to that of the AIC in (5.43) for the Gaussian linear regression model in Ex-ample 5.4 of Section 5.2.2, which shows that they are closely related to AIC.

5.2 Information Criteria

In its prediction of the value at each data point xi by ˆyi = u(xi; ˆβ) for a model y = u(x; ˆβ) estimated on the basis of observed data, the PSE deﬁned by (5.5) is a measure of the distance from future data zion av-erage, and each criterion described in the previous section may be es-sentially regarded as an estimate of this distance. The basic idea behind information criteria, however, consists of expressing the estimated model in a probability distribution and using the Kullback-Leibler information (Kullback-Leibler, 1951) to measure the distance from the true proba-bility distribution generating the data. By its nature, this measurement is not just an assessment of the goodness of ﬁt of the model but one that is performed from the perspective of prediction, i.e., from the perspective of whether the model can be used to predict future data.

In this section, we discuss the basic concept of information criteria with consideration for evaluation of various statistical models expressed by probability distribution, as well as regression models.

5.2.1 Kullback-Leibler Information

Let us begin by considering the closeness of a probability distribution model estimated by maximum likelihood to the true probability distri-bution generating the data. For this purpose, we assume that the data y₁, · · · , ynare generated in accordance with the density function g(y) or the probability distribution function G(y). LetF = { f (y|θ); θ ∈ Θ ⊂ R^p} be a probability distribution model speciﬁed to approximate the true dis-tribution g(y) that generated the data. If the true probability disdis-tribution g(y) is contained in the speciﬁed probability distribution model, then there exists aθ0∈ Θ such that g(y) = f (y|θ0). One example of this is the case of a normal distribution model N(μ, σ²) where the probability dis-tribution that generates the data is also a normal disdis-tribution N(μ0, σ²₀).

We estimate the probability distribution model f (y|θ) by maximum likelihood and set the estimate ˆθ = (ˆθ1, ˆθ₂, · · · , ˆθp)^T. Then by substituting the estimates for the parameters of the speciﬁed probability distribution model, we approximate the true probability distribution g(y) by f (y|ˆθ).

The model thus expressed by a probability distribution f (y|ˆθ) is known as a statistical model. A simple example of a statistical model is a prob-ability distribution model N(y, s²) in which the sample mean y and sam-ple variance s², as maximum likelihood estimates, are substituted for the normal distribution model N(μ, σ²) ((5.32) in Example 5.1).

Let us ﬁrst consider statistical models within the framework of re-gression modeling. Suppose that the data y₁, y₂,· · ·, ynfor the response variable y are mutually independent and have been obtained in accor-dance with the probability distribution g(y|x), which is a conditional dis-tribution of y given data points x for the predictor variables. To approx-imate this true distribution, we assume a probability distribution model f (y|x; θ) characterized by parameters, and attempt to approximate g(y|x) with a statistical model f (y|x; ˆθ) having maximum likelihood estimators ˆθ substituted for the model parameters θ. Equation (5.40) in Example 5.4 is a statistical model for a Gaussian linear regression model expressed by a probability distribution model. Our purpose here is to evaluate the goodness or badness of the statistical model f (y|x; ˆθ). We describe the basic concept for derivation of an information-theoretic criterion from the standpoint of making a prediction.

We represent as the estimator ˆθ = ˆθ(y) dependent on the data y for a case in which it is necessary to elucidate the diﬀerence from future data.

To evaluate the models from the perspective of prediction, we assess the expected goodness or badness of ﬁt for the estimated model f (z|x; θ) when it is used to predict the independent future data Z = z taken at

random from g(z|x). To measure the closeness of these two distributions, we use the Kullback-Leibler information (K-L information)

I{g(z); f (z|ˆθ)} = EG

where the expectation is taken with respect to the unknown true distribu-tion g(z) by ﬁxing ˆθ=ˆθ(y).

In the discrete model approach, the procedure may be regarded as first measuring the difference between g(zi) and f (zi|ˆθ) at zi by log{g(zi)/ f (zi|ˆθ)}, adding the probability g(zi), which occurs at zi as a weight, and measuring the difference between the distributions on this basis. Accordingly, if the true distribution and the estimated model are distant at those places where the occurrence probability of data is high, the K-L information is then also large. The same applies for continuous models.

The K-L information has the following properties:

(i) I{g(z); f (z|ˆθ)} ≥ 0, (ii) I{g(z); f (z|ˆθ)} = 0 ⇐⇒ g(z) = f (z|ˆθ).

This implies that a smaller K-L information indicates closer proximity of the model f (z|ˆθ) to the probability distribution g(z) generating the data. In comparing multiple models, accordingly, the model yielding the smallest K-L information is the one that is selected. As may be seen when K-L information is rewritten

I{g(z); f (z|ˆθ)} = EG the ﬁrst term EG[log g(Z)] on the far right side is always constant, inde-pendent of the individual model, and the closeness of the model to the true distribution thus increases with

log f (Z|ˆθ)

log f (z|ˆθ)g(z)dz, (5.23)

which is known as the expected log-likelihood of the estimated model f (z|ˆθ). It must be noted, however, that this expected log-likelihood can-not be directly obtained as it is dependent on the unknown probability distribution g(z) that has generated the data. For this reason, the problem of constructing the information criterion is ultimately one of ﬁnding an eﬀective estimator for this expected log-likelihood.

5.2.2 Information Criterion AIC

The expected log-likelihood in (5.23) depends on the unknown probabil-ity distribution g(z) generating the data, which raises the possibilprobabil-ity of estimating this unknown probability distribution from the observed data.

One such estimator is ˆg(z)= 1

n, z= y1, y₂, · · · , yn, (5.24) a discrete probability distribution known as the empirical distribution in which an equal probability of 1/n is given to each of the n observations.

This yields, as an estimator of the expected log-likelihood, EGˆ

log f (Z|ˆθ)

= log f (y1|ˆθ(y))ˆg(y1)+ · · · + log f (yn|ˆθ(y))ˆg(yn)

= 1 n

n i=1

log f (yi|ˆθ(y)). (5.25)

This represents the log-likelihood of the estimated model f (z|ˆθ(y)) or the maximum log-likelihood

(ˆθ) =

n i=1

log f (yi|ˆθ(y)) = log f (y|ˆθ(y)). (5.26) It should be noted that the estimator of the expected log-likelihood E_G[log f (Z|ˆθ)] is here (ˆθ)/n , and the log-likelihood (ˆθ) is the estimator of nEG[log f (Z|ˆθ)].

The log-likelihood given by (5.26) is an estimator of the expected log-likelihood. However, the log-likelihood was obtained by estimating the expected log-likelihood by reusing the observed datay that were ini-tially used to estimate the model in place of the future data. The use of the same data twice for estimating the parameters and for estimating the evaluation measure (the expected log-likelihood) of the goodness of the estimated model gives rise to the bias

log f (y|ˆθ(y)) − nEG(z)

log f (Z|ˆθ(y))

. (5.27)

The above equation represents the bias for speciﬁc datay, but the bias that may be expected when data sets of the same size are repeatedly extracted may be given as

bias(G)= EG(y⁾

log f (Y|ˆθ(Y)) − nEG(z)

log f (Z|ˆθ(Y))

, (5.28) where the expectation is taken with respect to the joint distribution of Y.

Hence by correcting the bias of the log-likelihood, an estimator of the expected log-likelihood is given by

log f (y|ˆθ) − (estimator for bias(G)). (5.29) The general form of the information criterion (IC) can be constructed by evaluating the bias and correcting for the bias of the log-likelihood as follows:

IC= −2(log-likelihood of the estimated model − bias estimator)

= −2 log f (y|ˆθ) + 2(estimator for bias(G)). (5.30) In general, the bias in (5.28) can take various forms depending on the relationship between the true distribution generating the data and the speciﬁed model and on the estimation method employed to construct a statistical model.

The AIC, the information criterion proposed by Akaike (1973, 1974), is a criterion for evaluating models estimated by maximum likelihood, and is given by the following equation:

AIC= −2(maximum log-likelihood) + 2(no. of free parameters)

= −2 log f (y|ˆθ) + 2(number of free parameters). (5.31) This implies that the bias of the log-likelihood of the estimated model may be approximated by the number of free parameters contained in the model. In what is known as AIC minimization, the model that minimizes the AIC is selected as the optimum model from all the models under comparison. A detailed derivation of the information criterion will be given in the next section.

We consider several illustrations of the information criterion AIC, in Examples 5.1 through 5.4, and then sort out the relationship between the maximum log-likelihood and the number of free parameters in a statisti-cal model.

Example 5.1 (AIC in normal distribution model) Assume a normal distribution model N(μ, σ²) as the probability distribution for approxi-mating a histogram constructed from the data y1,y2,· · ·, yn. The

in which the parameters μ and σ²have been replaced by the sample mean y =n

and the model contains two free parameters μ and σ². From (5.31), the AIC is then given as

AIC= n log(2πs²)+ n + 2 × 2. (5.34)

Example 5.2 (AIC in polynomial regression model order selection) For response variable y and predictor variable x, assume a polynomial regression model with Gaussian noise N(0, σ²)

y = β0+ β1x+ β2x²+ · · · + βpx^p+ ε = β^Tx+ ε, (5.35) whereβ = (β0, β₁, · · · , βp)^T and x = (1, x, x², · · · , x^p)^T. The speciﬁed probability distribution model is then a normal distribution N(β^Tx, σ²) with meanβ^Tx and variance σ².

For the n observations {(yi, x_i); i = 1, 2, · · · , n}, the maximum likelihood estimates of the regression coeﬃcient vector β and the er-ror variance are respectively given by ˆβ = (X^TX)⁻¹Xy and ˆσ² =

model (maximum log-likelihood) is

n i=1

log f (yi|xi; ˆθ) = −n

2log(2π ˆσ²)− n 2 ˆσ²

1 n

n i=1

(yi− ˆβ^Tx_i)²

= −n

2log(2π ˆσ²)−n

2. (5.37)

Note that the number of free parameters in the model is (p + 2), which is the sum of the (p+ 1) number of regression coeﬃcients in the p-th order polynomial plus 1 for the error variance. The AIC for the polynomial regression model is therefore given by

AIC= n log(2π ˆσ²)+ n + 2(p + 2). (5.38) The polynomial model in the order that minimizes the value of AIC is selected as the optimum model.

Example 5.3 (Numerical example for polynomial regression model) Figure 5.2 shows the fitting of a linear model (dashed line), a 2nd-order polynomial model (solid line), and an 8th-order polynomial model (dot-ted line) to 20 data observed for predictor variable x and response vari-able y. Tvari-able 5.1 shows the RSS, prediction error estimate by CV, and AIC for the fits of the polynomial models of order 1 through 9. Although the RSS value decreases with further increases in the polynomial order, the values obtained by CV and AIC are both smallest with the 2nd-order polynomial model, and that model is therefore selected. As exemplified in the figure, a model that is too simple fails to reflect the structure of the target phenomenon, and a complex model of excessively high degree overfits the observed data and incorporates more error than necessary, and both would therefore be ineffective for the prediction of future phe-nomenon.

Example 5.4 (AIC for Gaussian linear regression model) Let us model the relationship between the response variable y and the predictor variables x₁, x2,· · ·, xpwith Gaussian linear regression model

y = β^Tx+ ε, ε ∼ N(0, σ²), (5.39) whereβ = (β0, β₁, · · · , βp)^T and x = (1, x1, x₂, · · · , xp)^T. The speciﬁed probability distribution model is then a normal distribution with mean β^Tx and variance σ². The maximum likelihood estimates of the regres-sion coeﬃcient vector and the error variance, as found in Section 2.2.2

OLQHDUPRGHO

WKRUGHUSRO\QRPLDO

QGRUGHUSRO\QRPLDO

[

\

Figure 5.2 Fitting a linear model (dashed line), a 2nd-order polynomial model (solid line), and an 8th-order polynomial model (dotted line) to 20 data.

Table 5.1 Comparison of the values of RSS, CV, and AIC for fitting the polyno-mial models of order 1 through 9.

order RSS CV AIC

1 0.0144 0.0185 -22.009 2 0.0068 0.0092 -35.003 3 0.0065 0.0099 -33.912 4 0.0064 0.0121 -32.111 5 0.0064 0.0177 -30.116 6 0.0062 0.0421 -29.034 7 0.0048 0.0284 -32.163 8 0.0044 0.0280 -31.934 9 0.0043 0.0283 -30.363

(2), are given by ˆβ = (X^TX)⁻¹X^Ty and ˆσ² = (y − X ˆβ)^T(y − X ˆβ)/n, re-spectively. The statistical model is therefore the probability distribution

f (y|x; ˆθ) = 1

√2π ˆσ²exp

− 1

2 ˆσ²(y− ˆβ^Tx)²

, (5.40)

where ˆθ = (ˆβ^T, ˆσ²)^T. This may be regarded as an expression of the stochastic variation in the data and the structure of the phenomenon, in a single probability distribution model.

The maximum log-likelihood, in which the maximum likelihood es-timates are substituted for the log-likelihood function, is then

( ˆβ, ˆσ²)= −n

2log(2π ˆσ²)− 1

2 ˆσ²(y − X ˆβ)^T(y − X ˆβ)

= −n

2log(2π ˆσ²)−n

2. (5.41)

Since the number of free parameters in the Gaussian linear regression model is (p+ 2), (p + 1) for the number of regression coeﬃcients and 1 for the number of error variances, AIC may then be given as

AIC= n log(2π ˆσ²)+ n + 2(p + 2). (5.42) The predictor variable combination that minimizes the value of this AIC is taken as the optimum model.

Maximum log-likelihood and number of free parameters

Both the AIC of polynomial regression model in (5.38) and the AIC of linear regression model in (5.42) can be rewritten as

AIC= −2 log f (y|X; ˆθ) + 2(p + 2)

= n log(2π) + n + n log ˆσ²+ 2(p + 2). (5.43) We consider only terms that affect the model comparison, fixing the term n log(2π)+ n. Then we find that n log ˆσ², which represents the model fit to the data, decreases with increasing model complexity and thus tends to lower AIC, whereas increasing the number of model parameters, and thus (p+ 2), has the opposite effect, clearly showing the role of the num-ber of parameters as a penalty in AIC.

Complex models characterized by a large number of parameters gen-erally tend to yield a relatively good ﬁt to observed data, but models of

excessive complexity do not function effectively in prediction of future phenomena. For optimum model selection based on information con-tained in observed data from the perspective of effective prediction, it is therefore necessary to control both goodness of model fit to the data and model complexity.

5.2.3 Derivation of Information Criteria

An information criterion was essentially constructed as an estimator of the Kullback-Leibler information between the true distribution g(z) gen-erating the datay and the estimated model f (z|ˆθ(y)) for a future obser-vation z that might be obtained on the same random structure. We recall that the information criterion in (5.30) is obtained by correcting the bias

bias(G)= EG(y⁾

log f (Y|ˆθ(Y)) − nEG(z)

log f (Z|ˆθ(Y))

, (5.44) when the expected log-likelihood nEG(z)

log f (Z|ˆθ(Y))

is estimated by the log-likelihood log f (Y|ˆθ(Y)). Here EG(y⁾denotes the expectation with respect to the joint distribution of a random sample Y, and EG(z) repre-sents the expectation with respect to the true distribution G(z). We derive the bias correction term (5.44) for the log-likelihood of a statistical model estimated by the methods of maximum likelihood.

The maximum likelihood estimator ˆθ is given as the solution of the likelihood equation Letθ0be a solution of the equation

EG(z) Then it can be shown that the maximum likelihood estimator ˆθ converges in probability toθ0as n tends to inﬁnity.

We ﬁrst decompose the bias in (5.44) as follows:

EG(y⁾

+EG(y⁾

Note that ˆθ = ˆθ(Y) depends on the random sample Y drawn from the true distribution G(y). In the next step, we calculate separately the three expectations D₁, D₂and D₃.

(1) Expectation D₂: Since D₂does not contain an estimator, it can easily be seen that

This implies that although D2varies randomly depending on the data, its expectation is 0.

(2) Expectation D3: Noting that D3depends on the maximum likelihood estimator ˆθ, we write

η(ˆθ) ≡ EG(z)

log f (Z|ˆθ)

. (5.49)

By expanding η(ˆθ) in a Taylor series around θ0 given as a solution of (5.46), we have (5.50) can be approximated as

η(ˆθ) = η(θ0)−1

2(ˆθ − θ0)^TJ(θ0)(ˆθ − θ0), (5.52)

where J(θ0) is the p× p matrix given by follows from (5.52) that D3can be approximately calculated as

D₃= EG(y⁾

It is known (see, e.g., Konishi and Kitagawa, 2008, p. 47) that for the maximum likelihood estimator ˆθ, the distribution of √

n(ˆθn − θ0) con-verges in law to the p-dimensional normal distribution with mean vector 0 and the variance-covariance matrix J⁻¹(θ0)I(θ0)J⁻¹(θ0) as n tends to

By substituting the asymptotic variance-covariance matrix EG(y⁾

(ˆθ − θ0)(ˆθ − θ0)^T

= 1

nJ(θ0)⁻¹I(θ0)J(θ0)⁻¹ (5.59)

into (5.55), we have Taylor series around the maximum likelihood estimator ˆθ yield

(θ) = (ˆθ) + (θ − ˆθ)^T∂ (ˆθ)

∂θ +1

2(θ − ˆθ)^T∂² (ˆθ)

∂θ∂θ^T(θ − ˆθ) + · · · . (5.61) Here ∂ (ˆθ)/∂θ = 0 for the maximum likelihood estimator given as a solution of the likelihood equation ∂ (θ)/∂θ = 0. By applying the law of large numbers, we see that

−1 as n tends to inﬁnity. Using these results, the equation in (5.61) can be approximated as

(θ0)− (ˆθ) ≈ −n

2(θ0− ˆθ)^TJ(θ0)(θ0− ˆθ). (5.63) Then, taking expectation and using the result in (5.59), we have

D₁= EG(y⁾

Finally, combining the results in (5.48), (5.60), and (5.64), we ob-tain the bias (5.44) for the likelihood in estimating the expected log-likelihood as follows:

where I(θ0) and J(θ0) are respectively given by (5.57) and (5.58).

Information criterion AIC

Let us now assume that the parametric model is{ f (y|θ); θ ∈ Θ ⊂ R^p}, and that the true distribution g(y) is contained in the speciﬁed parametric model, that is, there exists aθ0 ∈ Θ such that g(y) = f (y|θ0). Under the assumption, the equality I(θ0)= J(θ0) holds for the p× p matrices J(θ0) and I(θ0) given in the bias (5.65). Then the bias of the log-likelihood of f (y|ˆθ) in estimating the expected log-likelihood is asymptotically given by

EG(y⁾

log f (Y|ˆθ) − nEG(z)

log f (Z|ˆθ)

= tr

I(θ0)J(θ0)⁻¹

= tr(Ip)= p, (5.66) where Ipis the identity matrix of dimension p. Hence, by correcting the asymptotic bias p of the log-likelihood, we obtain AIC

AIC= −2 log f (y | ˆθ) + 2p. (5.67) We see that AIC is an asymptotic approximate estimator of the Kullback-Leibler information discrepancy between the estimated model and the true distribution generating the data. It was derived under the as-sumptions that model estimation is by maximum likelihood, and this is carried out in a parametric family of distributions including the true dis-tribution. The AIC does not require any analytical derivation of the bias correction terms for individual problems and does not depend on the un-known probability distribution G, which removes fluctuations due to the estimation of the bias. Further, Akaike (1974) states that if the true distri-bution that generated the data exists near the specified parametric model, the bias associated with the log-likelihood of the estimated model can be approximated by the number of free parameters. These attributes make the AIC a highly flexible technique from a practical standpoint.

Although direct application of AIC is limited to the models with pa-rameters estimated by the maximum likelihood methods, Akaike’s basic idea of bias correction for the log-likelihood can be applied to a wider class of models deﬁned by statistical functionals (Konishi and Kitagawa, 1996, 2008). Another direction of the extension is based on the boot-strapping. In Ishiguro et al. (1997), the bias of the log-likelihood was estimated by using the bootstrap methods. The theory of the bootstrap bias correction for the log-likelihood was further developed in Kitagawa and Konishi (2010).

Burnham and Anderson (2002) provided a nice review and explana-tion of the use of AIC in the model selecexplana-tion and evaluaexplana-tion problems (see also Linhart and Zucchini, 1986; Sakamoto et al., 1986; Bozdo-gan, 1987; Kitagawa and Gersch, 1996; Akaike and Kitagawa, 1998;

McQuarrie and Tsai, 1998; Konishi, 1999, 2002; Clarke et al., 2009;

Kitagawa, 2010). Burnham and Anderson (2002) also discussed model-ing philosophy and perspectives on model selection from an information-theoretic point of view, focusing on AIC.

Finite correction of Gaussian linear regression models

If the true probability distribution that generated the data is included in the speciﬁed Gaussian linear regression model, then we obtain the fol-lowing information criterion (Sugiura, 1978)

AICC= n

log(2π ˆσ²)+ 1

+ 2n(p+ 2)

n− p − 3, (5.68) incorporating exact log-likelihood bias correction. That is, with the as-sumption that the true distribution is included in the Gaussian linear re-gression model, the bias of the log-likelihood can be evaluated exactly using the properties of the normal distribution, and it is given by

EG where the expectation is taken over the joint normal distribution of Y.

The exact bias correction term in (5.69) can be expanded in powers of order 1/n as The leading factor on the right-hand side, (p+ 2), is an asymptotic bias, and this example thus shows that AIC corrects the asymptotic bias for the log-likelihood of the estimated model.

Hurvich and Tsai (1989, 1991) have obtained similar results for time series models and discussed the eﬀectiveness of exact bias correction.

For speciﬁc models and methods of estimation, it is thus possible to per-form exact bias correction by utilizing the characteristics of the normal distribution. In a general framework, however, the discussion becomes more diﬃcult.

5.2.4 Multimodel Inference

In contrast to model selection, which involves the selection of a sin-gle model that best approximates the probabilistic data structure from among a set of candidate models, multimodel inference (Burnham and Anderson, 2002) involves inference from a model set using the relative importance or certainty of multiple constructed models as weights. One form of multimodel inference is model averagingɽIt may essentially be regarded as a process of assigning large weights to good models, obtain-ing the mean value of the regression coeﬃcients of the multiple models, and then constructing a single model on this basis. In contrast to the Akaike (1978, 1979) concept, which relates the relative plausibility of models based on the diﬀerence of the AIC value of each model from the

In document [Sadanori Konishi]Introduction to Multivariate Analysis Linear and Nonlinear Modeling(pdf){Zzzzz}.pdf (Page 139-155)