• No results found

Model Evaluation and Selection

5.3 Bayesian Model Evaluation Criterion

Like the AIC, the Bayesian information criterion (BIC) proposed by Schwarz (1978) is a criterion for evaluation of models estimated by max-imum likelihood. Its construction is based on the posterior probability of the model as an application of Bayes’ theorem, which is described in more detail in Chapter 7. In this section, we consider the basic concept of its derivation.

5.3.1 Posterior Probability and BIC

In its basic concept, the BIC consists of selecting as the optimum model the one that yields the highest posterior probability under Bayes’ theo-rem. The derivation of this criterion is essentially as follows.

We begin by setting the r candidate models as M1, M2, · · · , Mr, and characterizing each model Miby the probability distribution fi(y|θi) (θi∈ Θi⊂ Rpi) and the prior distribution πii) of the parameter vectorθi, as

Model Mi: fi(y|θi), πii), θi∈ Θi⊂ Rpi, i = 1, 2, · · · , r. (5.74) The r probability distribution models that have been estimated contain unknown parameter vectors θi of differing dimensions. We have pre-viously replaced the unknown parameter vectors by estimators ˆθi, and constructed an information criterion for evaluating the goodness of the statistical models fi(y|ˆθi). In the Bayesian approach, in contrast, the tar-get of the evaluation with n observed datay is the following distribution

obtained by integrating fi(y|ˆθi) over the prior distribution πii) of model parameters

pi(y) =

"

fi(y|θiii)dθi. (5.75) The pi(y) represents the likelihood when the data have been observed and thus the certainty (plausibility) of the data’s observation with model Mi, and is known as the marginal likelihood or the marginal distribution.

If we designate as P(Mi) the prior probability that the i-th model will occur, the posterior probability of the i-th model is then given by Bayes’

theorem as

P(Mi|y) = pi(y)P(Mi)

r j=1

pj(y)P(Mj)

, i= 1, 2, · · · , r. (5.76)

This posterior probability represents the probability that, when the data y are observed, they will have originated in the i-th model. It follows that if one model is to be selected from among r models, it is most naturally the one that exhibits the highest posterior probability. Since the denom-inator in (5.76) is the same for all of the models, moreover, the selected model will be the one that maximizes the numerator pi(y)P(Mi). When the prior probability P(Mi) is the same for all of the models, further-more, the selected model will be the one that maximizes the marginal likelihood pi(y) of the data. If we can express the marginal likelihood represented by the integral of (5.75) in a form that is practical and easy to use, there will be no need to obtain the integral for each problem and, just as with the AIC, it can thus be used as a general model evaluation criterion.

The BIC proposed by Schwarz (1978) was obtained by approximat-ing the integral in (5.75) usapproximat-ing Laplace method of integration, which will be discussed in a later section, and is usually applied in the form of the natural logarithm multiplied by -2 and thus as

−2 log pi(y) = −2 log

"

fi(y|θiii)dθi



≈ −2 log fi(y|ˆθi)+ pilog n, (5.77) where ˆθiis a maximum likelihood estimate of a pi-dimensional parame-ter vectorθiincluded in fi(y|θi). The BIC for evaluating statistical models

estimated by maximum likelihood is given by

BIC= −2(maximum log-likelihood) + log n (no. of free parameters)

= −2 log f (y|ˆθ) + log n (number of free parameters). (5.78) The model that minimizes BIC is selected as the optimum model.

In the absence of observed data, the equality of all of the models in prior probability implies that all of them may be selected with the same probability. Once data are observed, however, the posterior proba-bility of each model can be calculated from Bayes’ theorem. Then, even though the same prior probability P(Mi) is assumed in (5.76) for all of the models, the posterior probability P(Mi|y) incorporating the informa-tion gained from the data resolves the comparison between the models and thus identifies the model that generates the data.

5.3.2 Derivation of the BIC

The marginal likelihood (5.75) of the datay can be approximated by us-ing the Laplace approximation for integrals (Barndorff-Nielsen and Cox, 1989, p. 169). In this description, we omit the index i and the marginal likelihood is expressed as

p(y) =

"

f (y|θ)π(θ)dθ, (5.79)

where θ is a p-dimensional parameter vector. This equation can be rewritten as

p(y) =

"

exp #log f (y|θ)$π(θ)dθ =

"

exp{ (θ)} π(θ), (5.80) where (θ) is the log-likelihood function (θ) = log f (y|θ).

The Laplace approximation takes advantage of the fact that when the number n of observations is sufficiently large, the integrand is concen-trated in a neighborhood of the mode of (θ), or in this case, in a neigh-borhood of the maximum likelihood estimator ˆθ, and that the value of the integral depends on the behavior of the function in this neighborhood. By applying Laplace’s method of integration, we approximate the marginal likelihood defined by (5.79), and then derive the Bayesian information criterion BIC.

For the maximum likelihood estimator ˆθ, the Taylor expansion of the log-likelihood function (θ) around ˆθ is given by

(θ) = (ˆθ) −n

2(θ − ˆθ)TJ(ˆθ)(θ − ˆθ) + · · · , (5.81)

where Similarly, we expand the prior distribution π(θ) in a Taylor series around the maximum likelihood estimator ˆθ as

π(θ) = π(ˆθ) + (θ − ˆθ)T∂π(θ)

∂θ ---θθ+ · · · . (5.83) Substituting (5.81) and (5.83) into (5.80) and arranging the results leads to the following approximation of the marginal likelihood

p(y) =

Here, we used the fact that ˆθ converges to θ in probability with order ˆθ − θ= Op(n−1/2), and also that the following equation holds: which can be considered as the expectation of the random variableθ − ˆθ distributed as the multivariate normal distribution with mean vector 0.

In (5.84), integrating with respect to the parameter vectorθ yields

"

exp+

n

2(θ − ˆθ)TJ(ˆθ)(θ − ˆθ),

dθ = (2π)p/2n−p/2|J(ˆθ)|−1/2, (5.86) since the integrand is the density function of the p-dimensional normal distribution with mean vector ˆθ and variance-covariance matrix J−1θ)/n.

Consequently, when the sample size n is large, the marginal likelihood defined by (5.79) can be approximated as

p(y) ≈ exp (ˆθ)

π(ˆθ)(2π)p/2n−p/2|J(ˆθ)|−1/2. (5.87) Taking the logarithm of this expression and multiplying it by -2, we have

−2 log p(y) = −2 log

Then the following model evaluation criterion BIC is obtained by ignor-ing terms with order less than O(1) with respect to the sample size n

BIC= −2 log f (y|ˆθ) + p log n. (5.89) We see that BIC was obtained by approximating the marginal likeli-hood associated with the posterior probability of the model by Laplace’s method for integrals, and that it is not an information-theoretic criterion, leading to an estimator of the Kullback-Leibler information. It can also be seen that BIC is an evaluation criterion for models estimated by the methods of maximum likelihood. Konishi et al. (2004) extended the BIC in such a way that it can be applied to the evaluation of models estimated by the regularization methods discussed in Section 3.4.

The Laplace approximation for integrals is, in general, given as fol-lows. Let q(θ) be a real-valued function of a p-dimensional parameter vectorθ, and let ˆθ be the mode of q(θ). Then the Laplace approximation of the integral is given by

"

exp{nq(θ)}dθ ≈ (2π)p/2

np/2|Jq(ˆθ)|1/2exp nq(ˆθ)

, (5.90)

where

Jq(ˆθ) = −∂2q(θ)

∂θ∂θT

---θθ. (5.91) The use of Laplace’s method for integrals has been extensively in-vestigated as a useful tool for approximating Bayesian predictive distri-butions, Bayes factors, and Bayesian model selection criteria (Davison, 1986; Tierney and Kadane, 1986; Kass and Wasserman, 1995; Kass and Raftery, 1995; O’Hagan, 1995; Konishi and Kitagawa, 1996; Neath and Cavanaugh, 1997; Pauler, 1998; Lanterman, 2001; Konishi et al., 2004).

5.3.3 Bayesian Inference and Model Averaging

In statistical modeling, as we have seen, the BIC is focused on selec-tion of a single optimum approximaselec-tion model for predicselec-tion of future phenomena. In multimodel inference, in contrast, the focus has shifted to multiple models as a basis for prediction of phenomena (Burnham and Anderson, 2002). In this section, we consider multimodel inference through the construction of a predictive distribution by model averag-ing based on the Bayesian approach (Hoetaverag-ing et al., 1999; Wasserman, 2000)ɽ

As in Section 5.3.1, we designate the r candidate models as M1, M2, · · · , Mr and characterize each model Miby the probability dis-tribution fi(y|θi) (θi ∈ Θi ⊂ Rpi) and the prior distribution πii) of the parameter vectorθi. The predictive distribution is the model used in in-ference of future data Z= z and thus in the predictive perspective, and is defined as

hi(z|y) =

"

fi(z|θiii|y)dθi, i= 1, 2, · · · , r, (5.92) where πii|y) is the posterior distribution defined by Bayes’ theorem as

πii|y) = " fi(y|θiii) fi(y|θiii)dθi

. (5.93)

The basic concept of model averaging in the Bayesian approach com-prises model construction incorporating some form of weighting as in

h(z|y) =

r i=1

wihi(z|y), (5.94)

rather than selection of a single optimum model from among the r predic-tive distribution models. Posterior probability is used for this weighting in predictive distribution modeling by model averaging in the Bayesian inference and is defined essentially as follows.

By Bayes’ theorem, if P(Mi) is given as the prior probability of the i-th model occurrence, the posterior probability of the i-th model is then

P(Mi|y) = pi(y)P(Mi)

r j=1

pj(y)P(Mj)

, i= 1, 2, · · · , r, (5.95)

where pi(y) is the marginal distribution in (5.75) defined as the integral for the prior distribution πii) of the parameter vectorθi. The predictive distribution obtained by model averaging in Bayesian inference using the posterior probability is then given by

h(z|y) =

r i=1

P(Mi|y)hi(z|y). (5.96) The posterior probability represents the probability that data, when observed, will have originated in the i-th model. The relative certainty of

each model in the model set is assessed in terms of its posterior proba-bility and the predictive distribution is thus obtained using the weighted average of the model weights represented by their posterior probabilities.

Exercises

5.1 Verify that the Kullback-Leibler information defined by (5.21) has the property

I{g(z); f (z|ˆθ)} ≥ 0, and that equality holds only when g(z)= f (z|ˆθ).

5.2 Suppose that the true distribution g(y) (G(y)) generating data and the specified model f (y) have normal distributions N(m, τ2) and N(μ, σ2), respectively.

(a) Show that

EG[log g(Y)]= −1

2log(2πτ2)−1 2,

where EG[·] is an expectation with respect to the true distribu-tion N(m, τ2).

(b) Show that

EG[log f (Y)]= −1

2log(2πσ2)−τ2+ (m − μ)22 . (c) Show that the Kullback-Leibler information of f (y) with respect

to g(y) is given by

I{g(y), f (y)} = EG[log g(Y)]− EG[log f (Y)]

=1 2

 logσ2

τ22+ (m − μ)2

σ2 − 1

 . 5.3 Suppose that the true distribution is a double exponential (Laplace)

distribution g(y)= 12exp(−|y|) (−∞ < y < ∞) and that the specified model f (y) is N(μ, σ2).

(a) Show that

EG[log g(Y)]= − log 2 − 1,

where EG[·] is an expectation with respect to the double expo-nential distribution.

(b) Show that

EG[log f (Y)]= −1

2log(2πσ2)− 1

2(4+ 2μ2).

(c) Show that the Kullback-Leibler information of f (y) with respect to g(y) is given by

I{g(y), f (y)} = EG[log g(Y)]− EG[log f (Y)]

=1

2log(2πσ2)+2+ μ2

2 − log 2 − 1.

(d) Find the values of σ2and μ that minimize the Kullback-Leibler information.

5.4 Assume that there are two dice that have the following probabilities for rolling the numbers one to six:

fa= {0.20, 0.12, 0.18, 0.12, 0.20, 0.18}, fb= {0.18, 0.12, 0.14, 0.19, 0.22, 0.15}.

In terms of the Kullback-Leibler information, which is the fairer dice?

5.5 Suppose that two sets of data G1 = {y1, y2, · · · , yn} and G2 = {yn+1, yn+2, · · · , yn+m} are given. To check the homogeneity of the two data sets in question, we assume the following models:

G1: y1, y2, · · · , yn ∼ N(μ1, σ21), G2: yn+1, yn+2, · · · , yn+m ∼ N(μ2, σ22).

Derive the AIC under the following three restricted cases:

(a) μ1 = μ2= μ and σ21= σ22= σ2. (b) σ21= σ22= σ2.

(c) μ1 = μ2= μ.

5.6 Suppose that there exist k possible outcomes E1,· · ·, Ekin a trial. Let P(Ei)= pi, where ki=1pi = 1, and let Yi(i = 1, · · · , k) denote the number of times outcome Eioccurs in n trials, wherek

i=1Yi= n. If the trials are repeated independently, then a multinomial distribution with parameters n, p1,· · ·, pk is defined as a discrete distribution having

f (y1, y2, · · · , yk|p1, p2, · · · , pk)= n!

y1!y2!· · · yk!pyi1py22· · · pykk,

where yi= 0, 1, 2, · · · , n (k

i=1yi= n).

Assume that we have a set of data Y1 = n1, Y2 = n2,· · · Yk = nk

having k categories. Then, show that the AIC is given by

AIC= −2⎧⎪⎪⎨

⎪⎪⎩logn! −

k i=1

log ni!+

k i=1

nilog(ni n

)⎫⎪⎪⎬

⎪⎪⎭ + 2(k − 1).

This equation can be used to determine the optimal bin size of a histogram (Sakamoto et al., 1986).

Chapter 6