Basics of Bayesian computation
11.2 Importance sampling
Exercise 11.4 (Importance sampling) The purpose of this question is to learn about the properties of importance sampling in a very simple case. Assume you have a model with a single parameter,θ, and its posterior is N(0, 1).
(a) Write a program that calculates the posterior mean and standard deviation ofθ using Monte Carlo integration.
(b) Write a program that calculates the posterior mean and standard deviation ofθ using importance sampling and calculates the mean and standard deviation of the importance sampling weights. Use thet(0, 1, ν) distribution as the importance function.
(c) Carry out Monte Carlo integration and importance sampling with ν = 2, 5, and 100 for a given number of replications (e.g.,R = 10). Compare the accuracy of the estimates across different algorithms and choices forv. Discuss your results, paying attention to the issue of what happens to the importance sampling weights asv increases.
(d) Redo part (c) using thet(3, 1, ν) as an importance function. Discuss the factors affect-ing accuracy of importance samplaffect-ing in light of the fact that this importance function is a poor approximation to the posterior.
Solution
(a and b) The structure of computer code that does Monte Carlo integration and importance sampling to carry out posterior inference onθ is as follows:
• Step 1: Do all the preliminary things to create all the variables used in the Monte Carlo and importance sampling procedures (i.e., specify parameters of the importance function and initialize all the Monte Carlo and importance sampling sums at zero).
• Step 2: Draw θ from the normal posterior (for Monte Carlo integration).
• Step 3: Draw θ from the importance function and calculate the importance sampling weight (and weight squared). Multiply the drawnθ by the importance sampling weight to obtain that draw’s contribution to the numerator of the importance sampling estima-tor (see, e.g., the introduction to this chapter).
11.2 Importance sampling 125 Table 11.4: Monte Carlo and importance sampling results forN(0, 1) posterior.
Mean Imp. Std. Dev. Imp.
Post. Mean Post. Std. Dev. Samp. Wt. Samp. Wt.
Monte Carlo 0.046 0.874 — —
Importance Sampling
t(0, 1, 2) −0.110 0.500 0.815 0.473
t(0, 1, 5) 0.055 0.721 0.948 0.106
t(0, 1, 100) 0.049 0.925 1.004 9.6 × 10−4
t(3, 1, 2) 1.469 0.525 0.373 0.638
t(3, 1, 5) 1.576 0.426 0.186 0.439
t(3, 1, 100) 1.996 0.368 0.055 0.120
• Step 4: Repeat Steps 2 and 3 R times.
• Step 5: Average the Step 2 draws and draws squared (to produce the Monte Carlo esti-mate of the posterior mean and standard deviation). Divide the Step 3 draws and draws squared by the sum of the weights (to produce the importance sampling estimate of the posterior mean and standard deviation). Calculate the mean and standard deviation of the weights.
The Matlab code used to perform these steps is provided on the Web site associated with this book.
(c and d) Table 11.4 presents Monte Carlo integration and the importance sampling results for each of the different importance functions specified in the question using R = 10.
Note that asR → ∞ the posterior means and standard deviations will be the same for all approaches (i.e., they will equal their true values of0 and 1). Hence, choosing a relatively small value of R is important to see the effect of the importance function on accuracy of estimation. In an empirical exercise you would chooseR to be much larger.
Table 11.4 illustrates some of the issues relating to selection of an importance function.
First, note that asν increases the t-distribution approaches the normal; the t(0, 1, 100) is virtually the same as the trueN(0, 1) posterior. Thus, importance sampling is virtually the same as Monte Carlo integration and the importance sampling weights are all virtually1.0.
Second, the t(0, 1, 5) and t (0, 1, 2) importance functions both approximate the posterior well, but have fatter tails. Thus, they yield results that are a bit less accurate than Monte Carlo integration and have importance sampling weights that vary widely across draws.
That is, the standard deviation of the importance sampling weights indicates that some draws are receiving much more weight than others (in contrast to the more efficient Monte Carlo integration procedure, which weights all draws equally).
Third, the importance functions with mean 3.0 all approximate the N (0, 1) posterior poorly. Results with these importance functions are way off, indicating that it is impor-tant to choose an importance function that approximates the posterior well. Note that the posterior means are all too high. These importance functions are taking almost all of their draws in implausible regions of the parameter space (e.g., most of the importance sampling
126 11 Basics of Bayesian computation
draws will be greater than 2.0 whereas the true posterior allocates very little probability to this region). Of course, importance sampling corrects for this by giving little weight to the draws greater than2.0 and great weight to the (very few) draws less than 2.0 and, as R→ ∞, importance sampling estimates will converge to the true values. But with small R importance sampling results can be misleading and care should be taken (e.g., by looking at numerical standard errors) to make sureR is large enough to yield accurate results. You might wish to experiment to find out how largeR has to be to obtain accurate results using thet(3, 1, ν) importance function. (We find setting R = 10, 000 yields reasonably accurate results.)
Fourth, of the t(3, 1, 2), t (3, 1, 5), and t (3, 1, 100) importance functions, it is the one with ν = 2 degrees of freedom that seems to be yielding most accurate results. This is because it has fatter tails and, thus, is taking more dispersed draws, which means more draws in regions where theN(0, 1) posterior is appreciable.
Overall, these results suggest that it is important to get an importance function that ap-proximates the posterior reasonably well. In empirical exercises, strategies such as setting the mean and variance of a t-distribution importance function to maximum likelihood or posterior quantities are common. However, as an insurance policy it is common to choose a small degrees-of-freedom parameter to ensure that key regions of the parameter space are covered. A common rule of thumb is that importance functions should have fatter tails than the posterior.
Exercise 11.5 (Importance sampling: prior sensitivity) The purpose of this question is to see how importance sampling can be used to carry out a prior sensitivity analysis.
This question uses the normal linear regression model with natural conjugate prior and a data set. Definitions and notation for the model are given in the introduction to Chapter 10.
Generate an artificial data set of size N = 100 from the normal linear regression model with an intercept and one other explanatory variable. Set the intercept (β1) to0, the slope coefficient (β2) to1.0, and h to 1.0. Generate the explanatory variable by taking random draws from theU(0, 1) distribution (although you may wish to experiment with other data sets). Suppose your prior beliefs are reflected in your base prior, which has prior hyperpa-rameter values ofβ = [0 1], V = I2, s−2 = 1, and ν = 1. However, you wish to carry out a prior sensitivity analysis with respect to the prior mean and standard deviation of the slope coefficient and, hence, also want to consider priors withβ = [0 c]andV =
1 0
0 d
for values ofc= 0, 1, and 2 and d = 0.01, 1, and 100.
(a) Calculate the posterior mean and standard deviation for the slope coefficientβ2for this data set for every one of these priors using analytical results.
(b) Write a program that does the same things as part (a) using Monte Carlo integration and use this program to produceR= 100 draws of β2using the base prior.
(c) Write a program that uses importance sampling to carry out the prior sensitivity analysis, using only the draws produced in part (b). Compare results obtained using this approach to those obtained in part (a). How do your results change when you setR= 10, 000?
11.2 Importance sampling 127 Solution
(a and b) The solution to parts (a) and (b) is given as part of the solution to Exercise 11.3. Empirical results for these parts are given in the following. The answer to part (c) is based on the insight that the posterior for the normal linear regression model with the base prior can be treated as an importance function for the posteriors corresponding to the other priors used in the prior sensitivity analysis. Formally, let M1 be the normal linear regression model with base prior,M2be the normal linear regression model with any other prior, andθ = [β1β2h] be the parameter vector (common to both models). Monte Carlo integration using the program written for part (b) will provide draws fromp(θ|y, M1). If we treat this posterior as an importance function in a Bayesian analysis of M2, then the importance sampling weights will be
wr= p
θ= θ(r)|y, M2 p
θ= θ(r)|y, M1,
where θ(r) forr = 1, . . . , R are draws from p (θ|y, M1). In a prior sensitivity analysis, the importance sampling weights simplify considerably since both models have the same likelihood function, which cancels out in the ratio, yielding
wr= p
θ= θ(r)|M2 p
θ= θ(r)|M1.
In fact, for the particular prior structure in the question, the importance sampling weights simplify even further since the same prior is used for the error precision in each model and β1andβ2are, a priori, uncorrelated with one another. Thus, we have
wr = p
β2= β2(r)|M2 p
β2= β2(r)|M1, which is the ratio of twot densities.
(c) The sketch outline of the program is the same as for any Monte Carlo integration/
importance sampling program (see the solutions to Exercises 11.3 and 11.4) and will not be repeated here. The Matlab code used to answers parts (a), (b), and (c) is provided on the Web site associated with this book.
Table 11.5 presents results using this program. A discussion of this table completes the answer to this question. It is worth noting that we are carrying out a sensitivity analysis over a very wide range of priors forβ2. To give an idea about the information in the data, note that the OLS estimate ofβ2 is1.074 and its standard error is 0.311. The priors with large prior variances (i.e.,d= 100) are very noninformative relative to the data, whereas priors withd= 0.01 are more informative than the data. Priors with c = 1.0 are centered over the true value used to generate the data, whereas priors withc= 0.0 and c = 2.0 are far away from the true value (at least for priors with informative values for the prior variance ofβ2).
In the columns of Table 11.5 labeled “Analytical Results” we can see that results are quite sensitive to the prior (especially whend= 0.01).
128 11 Basics of Bayesian computation Table 11.5: Results from prior sensitivity analysis.
Hyperparameter
Values Analytical Results Importance Sampling Results
c d E(β2|y)
Var(β2|y) R= 100 R= 10, 000
0.0 0.01 0.081 0.089 0.286 0.294 0.116 0.090
0.0 1.0 0.957 0.291 0.963 0.323 0.997 0.306
0.0 100.0 1.073 0.306 1.076 0.344 1.076 0.322
1.0 0.01 1.006 0.084 1.019 0.099 1.008 0.095
1.0 1.0 1.066 0.291 1.068 0.325 1.070 0.305
1.0 100.0 1.074 0.306 1.077 0.344 1.077 0.322
2.0 0.01 1.930 0.088 1.910 0.046 1.923 0.097
2.0 1.0 1.176 0.291 1.174 0.328 1.164 0.307
2.0 100.0 1.076 0.306 1.078 0.344 1.079 0.322
The posterior corresponding to the base prior (withc = 1.0 and d = 1.0) is used as an importance function for the posteriors corresponding to all the other priors. We expect importance sampling to work best when the former posterior is similar to the latter. Given the wide spread of priors, it might be the case that importance sampling works poorly, espe-cially for the priors withd= 0.01. When R = 100 replications are used, then importance sampling results can, indeed, be a bit off, especially for thed= 0.01 cases. However, when R = 10, 000 replications are used, the performance of the importance sampling algorithm works much better, even for the extreme cases considered here. Thus, Table 11.5 indicates that importance sampling can be used to carry out prior sensitivity analyses, even when the sensitivity analysis is over a wide range of priors. Of course, for the normal linear regres-sion model with a natural conjugate prior it is not necessary to use importance sampling since analytical results are available. However, for a model that does not allow for analyti-cal posterior results, the strategy outlined in this question – of taking posterior draws from a model based on one prior and then reweighting them using importance sampling – may be a very efficient way of conducting a prior sensitivity analysis.