The Normal Linear Regression Model with Other Priors
4.3 THE NORMAL LINEAR REGRESSION MODEL SUBJECT TO INEQUALITY CONSTRAINTS
4.3.4 Model Comparison
The inequality restrictions usually make it impossible to calculate the marginal likelihood for this model directly. Depending upon the exact form of the models being compared, some of the model comparison techniques discussed above can be used. Alternatively, the generic method for marginal likelihood calculation which we will discuss in the next chapter can be used.
Here we discuss two particular sorts of model comparison. Consider first the case where M1 is the model considered in this section where inequality restric- tions are imposed on the Normal linear regression model with natural conjugate prior (i.e. þ 2 A). Let M2be the same model except that the inequality restric- tions are violated (i.e. þ =2 A). Since inequality restrictions are often implied by economic theory, comparing models of this form is often of interest. That is, a particular economic theory might imply þ 2 A and, hence, p.M1jy/ is the probability that the economic theory is correct.
A particular example of model comparison of this sort, for the case of linear inequality restrictions, is given in Chapter 3 (Section 6). As described in this pre- vious material, model comparison involving such inequality restrictions is quite easy (and use of noninformative priors is not a problem). In practice, we can use the unrestricted Normal linear regression model with natural conjugate prior, and calculate p.M1jy/ D p.þ 2 Ajy/ and p.M2jy/ D 1 p.M1jy/. If the inequality restrictions are linear, p.þ 2 Ajy/ can be calculated analytically. Alternatively, the importance sampling strategy outlined in (4.40) allows for its simple calcu- lation. That is, with the unrestricted model, p.þ 2 Ajy/ D E[g./jy], where g./ D 1.þ 2 A/. But as we have stressed, posterior simulation is designed precisely to evaluate such quantities. Hence, we can take random draws from the unrestricted posterior density (which is ft.þjþ; s2V; ¹/) and simply calculate the proportion which satisfy þ 2 A. This proportion is an estimate of p.þ 2 Ajy/. But taking draws from ft.þjþ; s2V; ¹/ is precisely what we advocated in the importance sampling strategy outlined in (4.40). Hence, by doing importance sampling and keeping a record of how may draws are kept and how many are discarded (i.e. receive weight of zero), you can easily calculate p.M1jy/ and p.M2jy/.
The Savage–Dickey density ratio can be used to compare nested models where both have the same inequality restrictions imposed. That is, we now let M2 be the model given in this section (i.e. the Normal linear regression model with natural conjugate prior with inequality restrictions imposed and posterior given by equation 4.36) and let M1 be equal to M2 except that þ D þ0 is imposed. If the same prior is used for the error precision, h, in both models then the Savage–Dickey density ratio says that the Bayes factor can be calculated as
B F12D
p.þ D þ0jy; M2/ p.þ D þ0jM2/
Unfortunately, evaluating this Bayes factor is not as easy as it looks, since the results in (4.34) and (4.36) only provide the prior and posterior kernels (i.e. these equations have proportionality signs, not equality signs). Formally, the prior and posterior densities have the form
p.þ/ D c ft.þjþ; s2V; ¹/1.þ 2 A/ and
p.þjy/ D c ft.þjþ; s2V; ¹/1.þ 2 A/
where c and c are prior and posterior integrating constants which ensure the densities integrate to one. The Savage–Dickey density ratio thus has the form
B F12D
c ft.þ D þ0jþ; s2V; ¹/ c ft.þ D þ0jþ; s2V; ¹/
(4.42) Note that this involves evaluating two multivariate t densities at the point þ D þ0 and calculating c and c. For some hypotheses, it is easy to obtain c and c. Consider, for instance, the case of univariate inequality restrictions such as þj > 0. In this case, we can simply use statistical tables for the t distribution (or their computer equivalent) to obtain these integrating constants. For more general inequality restrictions, the method outlined in the previous paragraph can be used. That is, this method calculated p.M1jy/ which was the probability that the restrictionþ 2 A held. But, c D p.M1
1jy/ since
c D Z 1
ft.þjþ; s2V; ¹/1.þ 2 A/ dþ
and p.M1jy/ DR ft.þjþ; s2V; ¹/1.þ 2 A/ dþ. Calculation of c can be done in an analogous manner, except that importance sampling must be done on the prior instead of the posterior.
4.3.5 Prediction
The strategy outlined in (4.27) to (4.32) to carry out prediction can be imple- mented here with only slight modifications. With importance sampling, the draws
from the importance function must be weighted as described in (4.37) and (4.38). In terms of our generic model notation, let.s/be a random draw from an impor- tance function, and yŁ.s/ be a random draw from p.yŁjy; .s// for s D 1; : : : ; S. Then bgY D S X sD1 w..s//g.yŁ.s// S X sD1 w..s// (4.43)
converges to E [g.yŁ/jy] as S goes to infinity, where w..s// is given in (4.38) or (4.39). This strategy for calculating predictive features of interest can be used anywhere importance sampling is done, including the Normal linear regression model with natural conjugate prior subject to inequality constraints.
4.3.6 Empirical Illustration
We continue with our empirical illustration using the house price data set. Re- member that the dependent variable is the sales price of a house, and the explana- tory variables are lot size, number of bedrooms, number of bathrooms and number of storeys. We would expect all of these explanatory variables to have a positive effect on the price of a house. Furthermore, let us suppose the researcher knows that þ2 > 5; þ3 > 2500; þ4 > 5000 and þ5 > 5000 and wishes to include this information in her prior. In terms of the terminology of (4.33), this defines the region A. The prior in (4.33) is the product of 1.þ 2 A/ and a Normal-Gamma density and, hence, requires the elicitation of hyperparametersþ; V ; s2 and ¹. We choose the same values for these hyperparameters as in Chapter 3. That is, we set s2D4:0 ð 108,¹ D 5, þ D 2 6 6 6 4 0:0 10 5000 10 000 10 000 3 7 7 7 5 and V D 2 6 6 6 6 4 2:40 0 0 0 0 0 6:0 ð 107 0 0 0 0 0 0:15 0 0 0 0 0 0:60 0 0 0 0 0 0:60 3 7 7 7 7 5
We use importance sampling to carry out inference in this model.8 The com- puter code necessary to do this is a simple extension of the computer code used 8Note that, for simple restrictions of the sort considered in this empirical illustration, it would be
to do Monte Carlo integration in the empirical illustration in Chapter 3. That is, we can use (4.40) as the importance function, but this importance function is precisely the same as the posterior in Chapter 3. The importance sampling weights are then calculated as (4.36). As described above (see the discussion after (4.38)), for this choice of importance function, the importance sampling weights are either equal to one (if the draw satisfies the constraints) or zero (if it does not). By taking weighted averages of the importance sampling draws, as in (4.35), we can calculate the posterior properties of þ. Numerical stan- dard errors can be calculated using the results in Theorem 4.3. Table 4.2 con- tains posterior means, standard deviations and NSEs of þ along with a poste- rior odds ratio for comparing a model with þj D þj against the model with only the inequality restrictions imposed. This choice of models to compare is purely illustrative, and the posterior odds ratio is calculated using (4.42). Since þj Dþj is a univariate restriction, c and c can be calculated using the properties of the univariate t distribution. Table 4.2 is based on 10 000 replications (i.e. S D 10 000).
The results in Table 4.2 are quite close to those presented in Tables 3.1 or 4.1. Note that, forþ4 andþ5, the inequality restrictions we have imposed have little impact. That is, the unrestricted posterior means (standard deviations) forþ4and þ5in Table 3.1 are 16 965 (1708) and 7641 (997), respectively. Thus virtually all of the posterior probability is in the region whereþ4> 5000 and þ5> 5000. Imposing the latter inequality restrictions through the prior thus has a minimal effect on the posterior. Intuitively, the data already tell us thatþ4 > 5000 and þ5 > 5000, so incorporating these restrictions to the prior does not add any new information.
The inequality restrictions do, however, affect þ2 and þ3, increasing their posterior means somewhat. By cutting off the regions of the posterior withþ2< 5 and þ3 < 2500, it is not surprising that the means increase. The posterior standard deviations in Table 4.2 are somewhat smaller than those in Table 3.1, indicating that the additional information provided in the prior decreases our posterior uncertainty about what the coefficients are.
The numerical standard errors indicate that we are achieving reasonably pre- cise estimates and, as with any posterior simulator, if you wish more accurate estimates you can increase S. A careful comparison, however, with Table 3.4, indicates that NSEs (and, hence, approximation errors) are somewhat larger with
Table 4.2 Posterior Results forþ
Standard Post. Odds
Mean Deviation NSE forþjDþj
þ1 5645.47 2992:87 40:53 1.20
þ2 5.50 0:30 0:0041 1:36 ð 1029
þ3 3577.58 782:58 10:60 0.49
þ4 16 608.02 1666:26 22:56 5:5 ð 104
importance sampling than with Monte Carlo integration. For instance, with an identical number of replications, 10 000, the NSE relating to the estimation of E.þ2jy/ was 0.0037 with Monte Carlo integration and 0.0041 with importance sampling. Since Monte Carlo integration involves drawing directly from the pos- terior, and importance sampling involves drawing from an approximation to the posterior, the latter is less numerically efficient than the former.
The posterior odds ratios are in line with the evidence provided by posterior means and standard deviations. Except for the intercept, there is no strong evi- dence that þj Dþj. However, for þ3 and þ5 the posterior odds ratios attach a little bit of probability to the restrictions. Since, for these coefficients, the pos- terior means are not that far from þj (relative to posterior standard deviations), the evidence of the posterior odds ratios is sensible.
The predictive density of the price of a house with given characteristics can be calculated as described in Section 4.3.5. That is, at each importance sampling draw for using the methods outlined in Section 4.2.6 can be used to take a random draw, yŁ.s/ for s D 1; : : : ; S. These draws can then be averaged as described in (4.43) to obtain any predictive feature of interest. In the previous empirical illustration in Section 4.2.7, we took draws from p.yŁjþ.s/; h.s//. This was simple to do since the latter density was Normal. It is straightforward to adopt the same strategy here, although we would have to extend our importance function to provide draws for h.s/. The Normal-Gamma posterior in (3.9) would be a logical importance function for such a case. Alternatively, techniques analogous to those used to go from (3.39) to (3.40) imply that
p.yŁjy; þ/ D p.yŁjþ/ D ft.yŁjXŁþ; s2IT; ¹/
Hence, draws from p.yŁjþ.s// can be taken from the t distribution. In case you are wondering where the inequality restrictions on þ have gone to, note that our predictive draws taken from p.yŁjþ.s// are conditional on the importance sampling draws fromþ. The latter draws already have the inequality restrictions imposed on them. If we use this method to work out the predictive density of the sales price of a house with a lot size of 5000 square feet, two bedrooms, two bathrooms and one storey, we find the predictive mean and standard deviation to be 69 408 and 18 246, respectively. These results are similar to those we have found in previous empirical illustrations using this data set.
4.4 SUMMARY
In this chapter, we have described Bayesian methods for posterior and predictive analysis and model comparison for the Normal linear regression model with two priors. The first of these is an independent Normal-Gamma prior and the second a natural conjugate prior subject to inequality restrictions. These priors were partly introduced because they are useful in many empirical settings. However, another reason for discussing them is that they allowed us to introduce important methods
of computation in a familiar setting. The first of these computational methods is Gibbs sampling. In contrast to Monte Carlo integration, which involved drawing from the joint posterior distribution, Gibbs sampling involves sequentially draw- ing from the full posterior conditional distributions. Such draws can be treated as though they came from the joint posterior, although care has to be taken since Gibbs draws are not independent of one another, and can be dependent on the initial point chosen to start the Gibbs sampler. MCMC diagnostics are described which can be used to ensure that these two problems are overcome.
The second computational method introduced is importance sampling. This algorithm involves taking random draws from an importance function and then appropriately weighting the draws to correct for the fact that the importance function and posterior are not identical. This chapter also introduces the Sav- age–Dickey density ratio, which is a convenient way of writing the Bayes factor for nested model comparison.
At this stage, we have three posterior simulation algorithms: Monte Carlo integration, Gibbs sampling and importance sampling. The question of which one to use is a model-specific one. If it is easy to draw from the posterior, then Monte Carlo integration is the appropriate tool. If direct simulation of the posterior is difficult, but simulation from posterior conditionals is simple, then Gibbs sampling suggests itself. If neither Monte Carlo integration nor Gibbs sampling is easy, but a convenient approximation to the posterior suggests itself, then importance sampling is a sensible choice.
4.5 EXERCISES
4.5.1 Theoretical Exercises
1. The Savage–Dickey density ratio.
(a) Prove Theorem 4.1. (Hint: If you are having trouble with this problem, the proof is provided in Verdinelli and Wasserman, 1995.)
(b) How would your answer change if the condition p. j! D !0; M2/ D
p. jM1/ did not hold?
2. For the Normal linear regression model with natural conjugate prior, the Bayes factor for comparing M1: þi D 0 to M2: þi 6D 0 (where the þi is a single regression coefficient and the same prior is used for h in both models) can be obtained from Chapter 3 (3.34). Alternatively, this Bayes factor can be derived using the Savage–Dickey density ratio. Show that these two approaches lead to the same result.
4.5.2 Computer-Based Exercises
Remember that some data sets and MATLAB programs are available on the website associated with this book.
3. The purpose of this question is to learn about the properties of the Gibbs sampler in a very simple case. Assume that you have a model which yields a bivariate Normal posterior,
1 2 ¾N 0 0 ½ ; 1² ² 1 ½
where j²j < 1 is the (known) posterior correlation between 1 and2. (a) Write a program which uses Monte Carlo integration to calculate the
posterior means and standard deviations of1and 2.
(b) Write a program which uses Gibbs sampling to calculate the posterior means and standard deviations of 1 and 2. (Hint: Use the properties of the multivariate Normal in Appendix B, Theorem B.9 to work out the relevant conditional posterior distributions.)
(c) Set ² D 0 and compare the programs from parts (a) and (b). How many replications from each posterior simulator are necessary to estimate pos- terior means and standard deviations of1 and2to two decimal places? (d) Repeat part (c) of this question for² D 0:5, 0.9, 0.95, 0.99 and 0.999. Discuss how the degree of correlation between1and 2affects the per- formance of the Gibbs sampler.
(e) Modify your Monte Carlo and Gibbs sampling programs to include numer- ical standard errors and (for the Gibbs sampling program) Geweke’s con- vergence diagnostic. Repeat the parts (c) and (d) of this question. Do the numerical standard errors provide a correct view of the accuracy of approximation of the posterior simulators? Does the convergence diagnos- tic accurately indicate when convergence of the Gibbs sampler has been achieved?
4. The purpose of this question is to learn about the properties of importance sampling in a very simple case. Assume you have a model which a single parameter, , and its posterior is N.0; 1/.
(a) Write a program which calculates the posterior mean and standard devia- tion of using Monte Carlo integration.
(b) Write a program which calculates the posterior mean and standard devi- ation of using importance sampling, calculates a numerical standard error using Theorem 4.3 and calculates the mean and standard deviation of the importance sampling weights. Use the ft.j0; 1; ¹/ density as an importance function.
(c) Carry out Monte Carlo integration and importance sampling with¹ D1, 3, 5, 10, 20, 50 and 100 for a given number of replications (e.g. S D 1000). Compare the accuracy of the estimates across different algorithms and choices for ¹. What happens to the mean and standard deviation of the importance sampling weights as¹ increases?