Bayesian computation
Theorem 5.1 Suppose that Suppose the prior
5.3 Noniterative Monte Carlo methods .1 Direct sampling
We begin with the most basic definition of Monte Carlo integration, found in many calculus texts. Suppose
which converges to
Law of Large Numbers. In our case, is the posterior mean of
tations requires only a sample of size N from the posterior distribution.
Notice that, in contrast to the methods of Section 5.2, the quality of the approximation in (5.6) improves as we increase N, the Monte Carlo sample size (which we choose), rather than n, the size of the dataset (which is typically beyond our control). Another contrast with asymptotic methods is that the structure of (5.6) also allows us to evaluate its accuracy for any fixed N. Since
have that
sample variance of the is given by
Finally, the Central Limit Theorem implies that
approximate 95% confidence interval for the true value of the posterior mean
preset level of narrowness to this interval. While it may seem strange for a practical Bayesian textbook to recommend use of a frequentist interval (5.7) (5.6) (say, bigger than 10),
and we seek
with probability 1 as by the Strong Hence the computation of posterior
expec-is itself a sample mean of independent observations, we But can be estimated by the values, so that a standard error estimate for
provides an Again, N may be chosen as large as necessary to provide any
we have
is a posterior distribution and
To obtain a 95% equal-tail posterior credible set for use the empirical .025 and .975 quantiles of the sample of
Estimates of functions of the parameters are also easily obtained. For example, suppose we seek an estimate of the distribution of
the coefficient of variation. We simply define the transformed Monte Carlo We may thus generate samples from the joint posterior quite easily as follows. First sample
j
= 1, . . ., N. This then creates the set 1, . . ., N} fromuse
Example 5.2 Let i = 1, . . ., reference prior
4.12), or Lee (1997, Section 2.12)) that the joint posterior of given by
with associated binomial standard error estimate suggests that a histogram of the sampled
itself, since the probability in each histogram bin converges to the true bin probability. Alternatively, we could use a kernel density estimate to
"smooth" the histogram,
estimation procedure. Monte Carlo simulations provide one (and perhaps the only) example where they are clearly appropriate!
In addition. letting note that
so that an estimate of p is available simply as
values.
we might simply we would To estimate the posterior mean of
and then sample where
Then one can show (see Berger (1985, problem n, and suppose we adopt the and is
is a window width satisfying and as
where K is a "kernel" density (typically a normal or rectangular distribu-tion) and
The interested reader is referred to the excellent book by Silver-man (1986) for more on this and other methods of density estimation.
In fact. this would estimate the posterior denote the indicator function of the set (a, b),
samples
estimate based on these values.
As a final illustration, suppose we wish to estimate is a new observation, not part of y. Writing
the second equality coming from the fact that since the
inside the brackets in the third line is nothing but where
5.3.2 Indirect methods
Example 5.2 suggests that given a sample from the posterior distribution, almost any quantity of interest can be estimated. But what if we can't directly sample from this distribution? This is an old problem that predates its interest by Bayesian statisticians by many years. As a result, there are several approaches one might try, of which we shall discuss only three:
importance sampling, rejection sampling, and the weighted bootstrap.
Importance sampling
This approach is outlined carefully by Hammersley and Handscomb (1964);
it has been championed for Bayesian analysis by Geweke (1989). Suppose we wish to approximate a posterior expectation, say
where for notational convenience we again suppress any dependence of the function of interest f and the likelihood L on the data y. Suppose we can roughly approximate the normalized likelihood times prior,
by some density
variate t density, or perhaps a "split-t" (i.e., a t that uses possibly dif-ferent scale parameters on either side of the mode in each coordinate
di-j = 1, . . ., N, and create a histogram or kernel density
> c|y), where we have
since the
But now the quantity is the cdf of a standard normal distribution. Hence
from which we can easily sample - say, a multi-are conditionally independent given
where it resembles
see this, note that if g(
roughly equal, which in turn will minimize the variance of the numerator and denominator (see Ripley, 1987, Exercise 5.3). If on the other hand is a poor approximation, many of the weights will be close to zero, and thus a few
Example 5.3 Suppose distribution, b but
will take many draws from g to obtain a few samples in these tails, and these points will have disproportionately large weights (sincegwillbe small relative to
a result, a very large N will be required to obtain an approximation of acceptable accuracy.
We may check the accuracy of approximation (5.8) using the following formula:
Here, K is the density function of a multivariate t density with mode and scale matrix
matrix and drawing to
Finally, West (1992) recommends adaptive approximation of posterior densities using mixtures of multivariatetdistributions. That is, after draw-ing a sample of size
compute the weighted kernel density estimate
rection; see Geweke, 1989, for details). Then defining the weight function we have
(5.8)
Here, is called the importance function; how closely controls how good the approximation in (5.8) is. To
) is a good approximation, the weights will all be
will dominate the sums, producing an inaccurate approximation.
is taken to be the relatively light-tailed normal has much heavier, Cauchy-like tails. Then it
for these points), thus destabilizing the estimate (5.8). As
from an initial importance sampling density we
where V is an estimate of the posterior covariance is a kernel window width. We then iterate the procedure, importance samples from
and so on until a suitably accurate estimate is obtained.
and revising the mixture density (We would of course plug in
(i) Generate
(ii) Generate U ~ Uniform(0, 1).
(iii) If MU
Figure 5.2 Unstandardized posterior distribution and proper rejection envelope.
Rejection sampling
This is an extremely general and quite common method ofrandom gener-ation; excellent summaries are given in the books by Ripley (1987, Section 3.2) and Devroye (1986, Section II.3). In this method, instead of trying to approximate the normalized posterior
we try to "blanket" it. That is, suppose there exists an identifiable constant M > 0 and a smooth density
for all
rejection method proceeds as follows:
(iv) Return to step (i) and repeat, until the desired sample
is obtained. The membersofthis sample will then be random variables from
(this situation is illustrated in Figure 5.2). The called the envelope function such that
accept otherwise, reject
Unfortunately,
there is no guarantee that (5.10) is close to
served envelope violations do not necessarily imply a small inaccuracy in the posterior sample.
the situation illustrated in Figure 5.3 with tion of the accepted
where p is the probability of acceptance. So P(K = i) decreases mono-tonically, and at an exponential rate. It is left as an exercise to show that p = c/M, where c is the normalizing constant for the posterior
our geometric distribution has mean E(K)
do indeed want to minimize M. Note that ifh were available for selection as the g function, we would choose the minimal acceptable value M = c, obtaining an acceptance probability of 1.
Like an importance sampling density, the envelope density g should be similar to the posterior in general appearance, but with heavier tails and sharper infinite peaks, in order to assure that there are sufficiently many rejection candidates available across its entire domain. One also has to be careful that
To see what happens if this condition is not met, suppose
A formal proof of this result is available in Devroye (1986, pp. 40-42) or Ripley (1987. pp. 60-62); we provide only a heuristic justification. Consider a fairly large sample of points generated from
histogram of these points would have roughly the same shape as the curve labeled "Mg" in Figure 5.2. Now consider the histogram bar centered at the point labeled "a" in the figure. The rejection step in the above algorithm has the effect of slicing off the top portion of the bar (i.e.. the portion between the two curves), since only those points having -If( -g(
below the lower curve are retained. But this is true for every potential value of "a" along the horizontal axis, so a histogram of the accepted
would mimic the shape of the lower curve, which of course is proportional to the posterior distribution h(
Intuition suggests that M should be chosen as small as possible, so as not to unnecessarily waste samples. This is easy to confirm, since if K denotes the number of iterations required to get one accepted candidate
K is a geometric random variable, i.e.,
(5.9)
is actually an "envelope" for the unnormalized posterior
= (a, b). Then the distribu-is not but really
so even if is small, That is, only a few ob-(5.10)
Figure 5.3 Unstandardized posterior distribution and deficient rejection envelope.
As a possible solution, when we find a
we may do a local search in the neighborhood of
ingly. Of course, we should really go back and recheck all of the previously accepted
M. We discuss rejection algorithms designed to eliminate envelope viola-tions following Example 5.6 below.
Weighted bootstrap
This method was presented by Smith and Gelfand (1992), and is very similar to the sampling-importance resampling algorithm of Rubin (1988).
Suppose an M appropriate for the rejection method is not readily available, but that we do have a sample
Define
Now draw mass qi at
such that
since some may no longer be acceptable with the new, larger and increase M
accord-from some approximating density
which places from the discrete distribution over
Then
with the approximation improving as since instead of resampling from the set
probabilities of selection, we are resampling some points more often than others due to the unequal weighting.
To see that the method does perform as advertised, notice that for the standard bootstrap,
so that
so that
Note that, similar to the previous two indirect sampling methods, we need
accuracy. In particular, the "tail problem" mentioned before is potentially even more harmful here, since if there are no candidate
tails of
In any of the three methods discussed above, if the prior it can play the role of
Example 5.4 Suppose with
is maximized at Then
So we simply generate
Clearly this ratio is also the probability of accepting a
This is aweighted bootstrap, with equally likely
For the weighted bootstrap, is approximately distributed as
is now approximately distributed as as desired.
or else a very large N will be required to obtain acceptable located in the there is of course no way to resample them!
as shown in the following example.
is proper, and
and known. The likelihood
Let M = in the rejection method, and let
Uniform(0, 1), and accept if
candidate. Hence
this approach will be quite inefficient unless to the likelihood L(
chance of being accepted. Unfortunately, this will not normally be the case, since in most applications the data will carry much more information about
5.4 Markov chain Monte Carlo methods