and a Single Explanatory Variable
2.5 MODEL COMPARISON
Suppose we have two simple regression models, M1 and M2; which purport to explain y. These models differ in their explanatory variables. We distinguish the two models by adding subscripts to the variables and parameters. That is, Mj for
j D 1; 2 is based on the simple linear regression model:
yi DþjxjiC"ji (2.24)
for i D 1; : : : ; N. Assumptions about "ji and xji are the same as those about"i and xi in the previous section (i.e. "ji is i.i.d. N.0; h1j /, and xji is either not random or exogenous for j D 1; 2).
For the two models, we write the Normal-Gamma natural conjugate priors as: þj; hjjMj ¾N G.þj; Vj; s2j ; ¹j/ (2.25) which implies posteriors of the form:
þj; hjjy; Mj ¾N G.þj; Vj; sj2; ¹j/ (2.26) where Vj D 1 V1j CXx2ji (2.27) þj DVj.V1j þjCbþj X x2ji/ (2.28) ¹j D¹j CN (2.29)
and sj2 is defined implicitly through
¹js2j D¹js2j C¹js2j C .bþjþj/2 VjC X1 x2ji ! (2.30) b
þj, s2j and ¹j are OLS quantities analogous to those defined in (2.3)–(2.5). In other words, everything is as in (2.7)–(2.12), except that we have added j sub- scripts to distinguish between the two models.
Equations (2.26)–(2.30) can be used to carry out posterior inference in either of the two models. However, our purpose here is to discuss model comparison. As described in Chapter 1, a chief tool of Bayesian model comparison is the posterior odds ratio:
PO12D
p.yjM1/p.M1/
p.yjM2/p.M2/
The prior model probabilities, p.Mi/ for i D 1; 2, must be selected before seeing the data. The noninformative choice, p.M1/ D p.M2/ D 12, is commonly made. The marginal likelihood, p.yjMj/, is calculated as:
p.yjMj/ D Z Z
p.yjþj; hj/p.þj; hj/dþjdhj (2.31) Unlike with many models, in the Normal linear regression model with natu- ral conjugate prior, the integrals in (2.31) can be calculated analytically. Poirier (1995, pp. 542–543) or Zellner (1971, pp. 72–75) provide details of this calcu- lation, which allows us to write:
p.yjMj/ D cj Vj Vj !1 2 .¹js2j/ ¹j 2 (2.32)
for j D 1; 2, where cj D 0 ¹j 2 .¹js2j/ ¹j 2 0 ¹j 2 ³N 2 (2.33)
and 0./ is the Gamma function.2 The posterior odds ratio comparing M1to M2 becomes: PO12D c1 V1 V1 !1 2 .¹1s21/¹12 p.M1) c2 V2 V2 !1 2 .¹2s22/¹22 p.M2) (2.34)
The posterior odds ratio can be used to calculate the posterior model proba- bilities, p.Mjjy/, using the relationships:
p.M1jy/ D PO12 1 C PO12 and
p.M2jy/ D 1 1 C PO12
A discussion of (2.34) offers insight into the factors which enter a Bayesian comparison of models. First, the greater is the prior odds ratio, p.M1/
p.M2/, the higher the support for M1. Note, secondly, that ¹js2j contains the term ¹js2j which is the sum of squared errors (see (2.3) and (2.5)). The sum of squared errors is a common measure of the model fit, with lower values indicating a better model fit. Hence, the posterior odds ratio rewards models which fit the data better. Thirdly, other things being equal, the posterior odds ratio will indicate support for the model where there is the greatest coherency between prior and data information (i.e. .bþj þj/2 enters ¹js2j). Finally,
V1
V1
is the ratio of posterior to prior variances. This term can be interpreted as saying, all else being equal, the model with more prior information (i.e. smaller prior variance) relative to posterior information receives most support.
As we shall see in the next chapter, posterior odds ratios also contain a reward for parsimony in that, all else being equal, posterior odds favor the model with fewer parameters. The two models compared here have the same number of parameters (i.e.þj and hj) and, hence, this reward for parsimony is not evident. However, in general, this is an important feature of posterior odds ratios.
2See Poirier (1995, p. 98) for a definition of the Gamma function. All that you need to know
here is that the Gamma function is calculated by the type of software used for Bayesian analysis (e.g. MATLAB or Gauss).
Under the noninformative variant of the natural conjugate prior (i.e. ¹j D0, V1j D0), the marginal likelihood is not defined and, hence, the posterior odds ratio is undefined. This is one problem with the use of noninformative priors for model comparison (we will see another problem in the next chapter). However, in the present context, a common solution to this problem is to set¹1D¹2 equal to an arbitrarily small number and do the same with V11 and V12 . Also, set s21Ds22. Under these assumptions, the posterior odds ratio is defined and simplifies and becomes arbitrarily close to:
PO12D 1 X x1i2 !1 2 .¹1s12/N2p.M1) 1 X x2i2 !1 2 .¹2s22/N2p.M2) (2.35)
In this case, the posterior odds ratio reflects only the prior odds ratio, the rela- tive goodness of fit of the two models, and the ratio of terms involving P1x2
ji , which reflect the precision of the posterior for Mj. However, as we shall see in the next chapter, this solution to the problem which arises from the use of the noninformative prior will not work when the number of parameters is different in the two models being compared.
In this section, we have shown how a Bayesian would compare two models. If you have many models, you can compare any or all pairs of them or calculate posterior model probabilities for each model (see the discussion after (1.7) in Chapter 1).
2.6 PREDICTION
Now let us drop the j subscript and return to the single model with likelihood and prior defined by (2.6) and (2.7). Equations (2.8)–(2.12) describe Bayesian methods for learning about the parameters þ and h, based on a data set with N observations. Suppose interest centers on predicting an unobserved data point generated from the same model. Formally, assume we have the equation:
yŁDþxŁC"Ł (2.36)
where yŁ is not observed. Other than this, all the assumptions of this model are the same as for the simple regression model discussed previously (i.e. "Ł is independent of "i for i D 1; : : : ; N and is N.0; h1/, and the þ in (2.36) is the same as the þ in (2.1)). It is also necessary to assume xŁ is observed. To understand why the latter assumption is necessary, consider an application where the dependent variable is a worker’s salary, and the explanatory variable is some characteristic of the worker (e.g. years of education). If interest focuses
on predicting the wage of a new worker, we would have to know her years of education in order to form a meaningful prediction.
As described in Chapter 1, Bayesian prediction is based on calculating: p.yŁjy/ D
Z Z
p.yŁjy; þ; h/p.þ; hjy/dþ dh (2.37)
The fact that"Łis independent of"
i implies that y and yŁare independent of one another and, hence, p.yŁjy; þ; h/ D p.yŁjþ; h/. The terms inside the integral in (2.37) are thus the posterior, p.þ; hjy/, and p.yŁjþ; h/. Using a similar reasoning to that used for deriving the likelihood function, we find that
p.yŁjþ; h/ D h 1 2 .2³/12 exp h 2.y ŁþxŁ/2 ½ (2.38) Multiplying (2.38) by the posterior given in (2.8) and integrating as described in (2.37) yields (Zellner, 1971, pp. 72–75):
p.yŁjy/ / [¹ C .yŁþxŁ/2s2.1 C V xŁ2/1]¹C12 (2.39) It can be verified (see Appendix B, Definition B.25) that this is a univariate t-density with mean þxŁ, variance ¹2¹s2.1 C V xŁ2/, and degrees of freedom ¹. In other words,
yŁjy ¾ t.þxŁ; s2f1 C V xŁ2g; ¹/ (2.40) These results can be used to provide point predictions and measures of uncertainty associated with the point prediction (e.g. the predictive standard deviation).
Our discussion of prediction is a logical place to introduce an important Bayesian concept: model averaging. In the previous section, we have shown how to calculate posterior model probabilities, p.Mjjy/, for j D 1; 2. These can be used to select one of the two models to work with. However, it is not always desirable to simply choose the one model with highest posterior model proba- bility and throw away the other (or others). Bayesian model averaging involves keeping all models, but presenting results averaged over all models. In terms of the rules of probability, it is simple to derive:
p.yŁjy/ D p.yŁjy; M1/p.M1jy/ C p.yŁjy; M2/p.M2jy/ (2.41) In words, insofar as a interest centers on p.yŁjy/, you should not simply choose one model and work with, e.g., p.yŁjy; M1/, but rather average results over the two models with weights given by the posterior model probabilities. Using the properties of the expected value operator (see Appendix B, Definition B.8), it follows immediately that:
which can be used to calculate point predictions averaged over the two models. If g.:/ is any function of interest (see (1.11)), then the preceding result generalizes to E [g.yŁ/jy] D E[g.yŁ/jy; M1] p.M1jy/ C E[g.yŁ/jy; M2] p.M2jy/ (2.42) which can be used to calculate other functions of the predictive such as the predictive variance.
These results can be generalized to the case of many models and to the case where the function of interest involves parameters instead of yŁ. Bayesian model averaging is discussed in much greater detail in Chapter 11.