Noniterative Monte Carlo methods .1 Direct sampling

Bayesian computation

Theorem 5.1 Suppose that Suppose the prior

5.3 Noniterative Monte Carlo methods .1 Direct sampling

We begin with the most basic definition of Monte Carlo integration, found in many calculus texts. Suppose

which converges to

Law of Large Numbers. In our case, is the posterior mean of

tations requires only a sample of size N from the posterior distribution.

Notice that, in contrast to the methods of Section 5.2, the quality of the approximation in (5.6) improves as we increase N, the Monte Carlo sample size (which we choose), rather than n, the size of the dataset (which is typically beyond our control). Another contrast with asymptotic methods is that the structure of (5.6) also allows us to evaluate its accuracy for any fixed N. Since

have that

sample variance of the is given by

Finally, the Central Limit Theorem implies that

approximate 95% confidence interval for the true value of the posterior mean

preset level of narrowness to this interval. While it may seem strange for a practical Bayesian textbook to recommend use of a frequentist interval (5.7) (5.6) (say, bigger than 10),

and we seek

with probability 1 as by the Strong Hence the computation of posterior

expec-is itself a sample mean of independent observations, we But can be estimated by the values, so that a standard error estimate for

provides an Again, N may be chosen as large as necessary to provide any

we have

is a posterior distribution and

To obtain a 95% equal-tail posterior credible set for use the empirical .025 and .975 quantiles of the sample of

Estimates of functions of the parameters are also easily obtained. For example, suppose we seek an estimate of the distribution of

the coefficient of variation. We simply define the transformed Monte Carlo We may thus generate samples from the joint posterior quite easily as follows. First sample

j

= 1, . . ., N. This then creates the set 1, . . ., N} from

use

Example 5.2 Let i = 1, . . ., reference prior

4.12), or Lee (1997, Section 2.12)) that the joint posterior of given by

with associated binomial standard error estimate suggests that a histogram of the sampled

itself, since the probability in each histogram bin converges to the true bin probability. Alternatively, we could use a kernel density estimate to

"smooth" the histogram,

estimation procedure. Monte Carlo simulations provide one (and perhaps the only) example where they are clearly appropriate!

In addition. letting note that

so that an estimate of p is available simply as

values.

we might simply we would To estimate the posterior mean of

and then sample where

Then one can show (see Berger (1985, problem n, and suppose we adopt the and is

is a window width satisfying and as

where K is a "kernel" density (typically a normal or rectangular distribu-tion) and

The interested reader is referred to the excellent book by Silver-man (1986) for more on this and other methods of density estimation.

In fact. this would estimate the posterior denote the indicator function of the set (a, b),

samples

estimate based on these values.

As a final illustration, suppose we wish to estimate is a new observation, not part of y. Writing

the second equality coming from the fact that since the

inside the brackets in the third line is nothing but where

5.3.2 Indirect methods

Example 5.2 suggests that given a sample from the posterior distribution, almost any quantity of interest can be estimated. But what if we can't directly sample from this distribution? This is an old problem that predates its interest by Bayesian statisticians by many years. As a result, there are several approaches one might try, of which we shall discuss only three:

importance sampling, rejection sampling, and the weighted bootstrap.

Importance sampling

This approach is outlined carefully by Hammersley and Handscomb (1964);

it has been championed for Bayesian analysis by Geweke (1989). Suppose we wish to approximate a posterior expectation, say

where for notational convenience we again suppress any dependence of the function of interest f and the likelihood L on the data y. Suppose we can roughly approximate the normalized likelihood times prior,

by some density

variate t density, or perhaps a "split-t" (i.e., a t that uses possibly dif-ferent scale parameters on either side of the mode in each coordinate

di-j = 1, . . ., N, and create a histogram or kernel density

> c|y), where we have

since the

But now the quantity is the cdf of a standard normal distribution. Hence

from which we can easily sample - say, a multi-are conditionally independent given

where it resembles

see this, note that if ^g(

roughly equal, which in turn will minimize the variance of the numerator and denominator (see Ripley, 1987, Exercise 5.3). If on the other hand is a poor approximation, many of the weights will be close to zero, and thus a few

Example 5.3 Suppose distribution, b but

will take many draws from g to obtain a few samples in these tails, and these points will have disproportionately large weights (sincegwillbe small relative to

a result, a very large N will be required to obtain an approximation of acceptable accuracy.

We may check the accuracy of approximation (5.8) using the following formula:

Here, K is the density function of a multivariate t density with mode and scale matrix

matrix and drawing to

Finally, West (1992) recommends adaptive approximation of posterior densities using mixtures of multivariatetdistributions. That is, after draw-ing a sample of size

compute the weighted kernel density estimate

rection; see Geweke, 1989, for details). Then defining the weight function we have

(5.8)

Here, is called the importance function; how closely controls how good the approximation in (5.8) is. To

) is a good approximation, the weights will all be

will dominate the sums, producing an inaccurate approximation.

is taken to be the relatively light-tailed normal has much heavier, Cauchy-like tails. Then it

for these points), thus destabilizing the estimate (5.8). As

from an initial importance sampling density we

where V is an estimate of the posterior covariance is a kernel window width. We then iterate the procedure, importance samples from

and so on until a suitably accurate estimate is obtained.

and revising the mixture density (We would of course plug in

(i) Generate

(ii) Generate U ~ Uniform(0, 1).

(iii) If MU

Figure 5.2 Unstandardized posterior distribution and proper rejection envelope.

Rejection sampling

This is an extremely general and quite common method ofrandom gener-ation; excellent summaries are given in the books by Ripley (1987, Section 3.2) and Devroye (1986, Section II.3). In this method, instead of trying to approximate the normalized posterior

we try to "blanket" it. That is, suppose there exists an identifiable constant M^> 0 and a smooth density

for all

rejection method proceeds as follows:

(iv) Return to step (i) and repeat, until the desired sample

is obtained. The membersofthis sample will then be random variables from

(this situation is illustrated in Figure 5.2). The called the envelope function such that

accept otherwise, reject

Unfortunately,

there is no guarantee that (5.10) is close to

served envelope violations do not necessarily imply a small inaccuracy in the posterior sample.

the situation illustrated in Figure 5.3 with tion of the accepted

where p is the probability of acceptance. So P(K = i) decreases mono-tonically, and at an exponential rate. It is left as an exercise to show that p = c/M, where c is the normalizing constant for the posterior

our geometric distribution has mean E(K)

do indeed want to minimize M. Note that ifh were available for selection as the g function, we would choose the minimal acceptable value M = c, obtaining an acceptance probability of 1.

Like an importance sampling density, the envelope density g should be similar to the posterior in general appearance, but with heavier tails and sharper infinite peaks, in order to assure that there are sufficiently many rejection candidates available across its entire domain. One also has to be careful that

To see what happens if this condition is not met, suppose

A formal proof of this result is available in Devroye (1986, pp. 40-42) or Ripley (1987. pp. 60-62); we provide only a heuristic justification. Consider a fairly large sample of points generated from

histogram of these points would have roughly the same shape as the curve labeled "Mg" in Figure 5.2. Now consider the histogram bar centered at the point labeled "a" in the figure. The rejection step in the above algorithm has the effect of slicing off the top portion of the bar (i.e.. the portion between the two curves), since only those points having -If( -g(

below the lower curve are retained. But this is true for every potential value of "a" along the horizontal axis, so a histogram of the accepted

would mimic the shape of the lower curve, which of course is proportional to the posterior distribution h(

Intuition suggests that M should be chosen as small as possible, so as not to unnecessarily waste samples. This is easy to confirm, since if K denotes the number of iterations required to get one accepted candidate

K is a geometric random variable, i.e.,

(5.9)

is actually an "envelope" for the unnormalized posterior

= (a, b). Then the distribu-is not but really

so even if is small, That is, only a few ob-(5.10)

Figure 5.3 Unstandardized posterior distribution and deficient rejection envelope.

As a possible solution, when we find a

we may do a local search in the neighborhood of

ingly. Of course, we should really go back and recheck all of the previously accepted

M. We discuss rejection algorithms designed to eliminate envelope viola-tions following Example 5.6 below.

Weighted bootstrap

This method was presented by Smith and Gelfand (1992), and is very similar to the sampling-importance resampling algorithm of Rubin (1988).

Suppose an M appropriate for the rejection method is not readily available, but that we do have a sample

Define

Now draw mass qi at

such that

since some may no longer be acceptable with the new, larger and increase M

accord-from some approximating density

which places from the discrete distribution over

Then

with the approximation improving as since instead of resampling from the set

probabilities of selection, we are resampling some points more often than others due to the unequal weighting.

To see that the method does perform as advertised, notice that for the standard bootstrap,

so that

Note that, similar to the previous two indirect sampling methods, we need

accuracy. In particular, the "tail problem" mentioned before is potentially even more harmful here, since if there are no candidate

tails of

In any of the three methods discussed above, if the prior it can play the role of

Example 5.4 Suppose with

is maximized at Then

So we simply generate

Clearly this ratio is also the probability of accepting a

This is aweighted bootstrap, with equally likely

For the weighted bootstrap, is approximately distributed as

is now approximately distributed as as desired.

or else a very large N will be required to obtain acceptable located in the there is of course no way to resample them!

as shown in the following example.

is proper, and

and known. The likelihood

Let M = in the rejection method, and let

Uniform(0, 1), and accept if

candidate. Hence

this approach will be quite inefficient unless to the likelihood L(

chance of being accepted. Unfortunately, this will not normally be the case, since in most applications the data will carry much more information about

5.4 Markov chain Monte Carlo methods

In document Bayes and Empirical Bayes Methods for Data Analysis - Carlin Louis (Page 143-151)