In the simple slice sampler, we take a uniform joint distribution

whose marginal density for x is clearly p(x). A Gibbs sampler for this distri-bution would require only two uniform updates:

Convenient generation of the latter variate (over the "slice" defined by thus requires that p(x) be invertible, either ana-lytically or numerically. Once we have a sample of (x, u) values, we simply disregard the u samples, thus obtaining (after a sufficiently long burn-in period) a sample from the target distribution p(x) via marginalization.

More generally, if p(x)

in the usual Bayesian context of an unknown posterior p proportional to a prior

tional distribution of u is Uni f (0, L(x)). More generally we might factor p(x) as q(x)L'(x) where

is

a positive function. This is what is gen-erally referred to as the

slice sampler.

Note that, like the Gibbs sampler, this algorithm does not require the user to specify any tuning constants or proposal densities, and yet, like the Metropolis-Hastings algorithm, it allows sampling in nonconjugate model settings. Indeed, Higdon (1998) shows that the Metropolis algorithm can be viewed as a special case of the slice sampler.

Finally, suppose variables

We can use a collection of aux-where we take to be Uni f (0,

where

L(x) is

a positive function (as and This leads to by alternately updating x and u via Gibbs sampling or some

(5.37)

times a likelihood L), then a convenient choice for the

condi-illiary

Clearly the marginal density for x is p(x), as desired. The full conditionals for the

restricted to the set

(1999) show that this product slice sampler can be used to advantage in numerous hierarchical and nonconjugate Bayesian settings where the readily invertible and the intersection set

thus easy to compute. However, Neal (1997) points out that introducing so many auxiliary variables (one for each term in the likelihood) may lead to slow convergence. This author recommends hybrid Monte Carlo methods that suppress the random walk behavior, including the overrelaxed version ( Neal, 1998) described above. Mira (1998) also shows clever approaches to sample the variable of interest given the proposed auxiliary variable(s); rejection sampling from

Mira and Tierney (1997) and Roberts and Rosenthal (1999a) give mild and easily verified regularity conditions that ensure that the slice sampler is geometrically and often even uniformly ergodic. Mira and Tierney (1997) show that given any Metropolis-Hastings independence chain algorithm us-ing someh(x) as proposal distribution, a "corresponding" slice sampler can be designed (by taking q(x) = h(x) in the p(x) = q(x)L'(x) factorization above) that produces estimates with uniformly smaller asymptotic vari-ance (on a sweep-by-sweep basis), and that converges faster to stationarity.

Roberts and Rosenthal (1999a) also show that the slice sampler is stochas-tically monotone, which enables development of some useful, quantitative convergence bounds. All of this suggests that, when the slice sampler can be implemented, it has spectacularly good theoretical convergence properties.

Future work with slice sampling looks to its sampling" extension (Mira, and Roberts, 1999; see also our brief discussion of this area on page 176), as well as its application in challenging applied settings be-yond those concerned primarily with image restoration, as in Hurn (1997) and Higdon (1998). An exciting recent development in this regard is the polar slice sampler (Roberts and Rosenthal, 1999b), which has excellent convergence properties even for high dimensional target densities.

Finally, we hasten to add that there are many closely related papers that explore the connection between expectation-maximization (EM) and data augmentation methods, borrowing recent developments in the former to obtain speedups in the latter. The general idea is the same as that for the auxiliary variable methods described above: namely, to expand the pa-rameter space in order to achieve algorithms which are both suitable for general usage (i.e., not requiring much fine-tuning) and faster than

stan-for all i. Now the joint density is given by

(5.38)

are f (0, while the full conditional for x is simply Damien et al.

are is

may be used as a last resort.

5.4.5 Variance estimation

We now turn to the problem of obtaining estimated variances (equivalently, standard errors) for posterior means obtained from output. Our summary here is rather brief; the reader is referred to Ripley (1987, Chapter 6) for further details and alternative approaches.

Suppose that for a given parameter

Many of the hybrid algorithms presented in this subsection attempt to remedy slow convergence due to high correlations within the parameter space. Examples include MCMC algorithms which employ blocking (i.e., updating parameters in medium-dimensional groups. as in SMCMC)and collapsing (i.e., generating from partially marginalized distributions), as well as overrelaxation and auxiliary variable methods like the slice sampler.

But of course, the problem of estimation in the face of high correlations is not anew one: it has long been the bane of maximization algorithms applied to likelihood surfaces. Such algorithms often use a transformation designed to make the likelihood function more regular (e.g.. in 2-space. a switch from a parametrization wherein the effective support of the likelihood function is oblong to one where the support is more nearly- circular). ( 1991)

transformation applied within his univariate Metropolis algorithm is one such approach; see also Hills and Smith (1992) in this regard. Within the class of hierarchical linear models. Gelfand et al. (1995) discuss simple hierarchical centering that

often enable improved algorithm convergence.

Accelerating the convergence of an MCMC algorithm to its stationary distribution is currently an extremely active area of research. Besides those presented here, other promising ideas include resampling and adaptive switching of the transition kernel (Gelfand and Sahu. 1994) and multi-chain annealing or tempering (Geyer and Thompson. 1995: Neal, 1996a).

Gilks and Roberts (1996) give an overview of these and other acceleration methods. In the remainder of this chapter, we continue with variance es-timation and convergence investigation for the basic algorithms, returning to these more specialized ones in subsequent chapters as the need arises.

dard Metropolis or Gibbs approaches. For example, the "conditional aug-mentation" EM algorithm of Meng and Van Dyk (1997) and the "marginal augmentation" parameter expanded EM (PX-EM) algorithm of Liu et al.

(1998) are EM approaches that served to motivate the later, specifically MCMC-oriented augmentation schemes of Meng and Van Dyk (1999) and Liu and Wu (1999), the latter of which the authors refer to as parameter expanded data augmentation (PX-DA).

Summary

we have a single long chain of

analogous to the estimator given in (5.6) for the case of iid sampling. Con-tinuing this analogy, then, we could to estimate

lowing the approach of (5.7). That is, we would simply use the sample variance,

MCMC samples

ary distribution of the Markov chain (we attack the convergence problem in the next subsection). The simplest estimate of

While this estimate is easy to compute, it would very likely be an underesti-mate due to positive autocorrelation in the MCMC samples. This problem could be ameliorated somewhat by combining the draws from a collection of initially overdispersed sampling chains. Alternatively, one could simply use the sample variance of the

parallel chains, as in Figure 5.4. But as already mentioned, this approach is horribly wasteful, discarding an appalling (N - 1)/N of the samples. A potentially cheaper but similar alternative would be to subsample a single chain, retaining only every

tained samples are approximately independent. However, MacEachern and Berliner (1994) give a simple proof using the Cauchy-Schwarz inequality that such systematic subsampling from a stationary Markov chain always increases the variance of sample mean estimators (though more recent work by MacEachern and Peruggia, 2000a, shows that the variance can decrease if the subsampling strategy is tied to the actual updates made). Thus, in the spirit of Fisher, it is better to retain all the samples and use a more sophisticated variance estimate, rather than systematically discard a large portion of them merely to achieve approximate independence.

One such alternative uses the notion of sample size, or ^ESS (Kass et al. 1998, p. 99). ESS is defined as

where

We may estimate

MCMC chain, cutting off the summation when these drop below, say, 0.1 which for now we assume come from the station-is then given by

divided by N, obtaining

fol-iteration output from m independent

sample with k large enough that the

re-is the ^time for given by

is the autocorrelation at lag kfor the parameter of interest using sample auto estimated from the

5.4.6 Convergence monitoring and diagnosis

We conclude our presentation on MCMC algorithms with a discussion of their convergence. Because their output is random and autocorrelated, even the definition of this concept is often misunderstood. When we say that an MCMC algorithm has convergedat time T, we mean that its output can be safely thought of as coming from the true stationary distribution of the Markov chain for all t > T. This definition is not terribly precise since we have not defined "safely," but it is for practical implementation:

no pre-convergence samples should be retained for subsequent inference.

Some authors refer to this pre-convergence time as the burn-in period.

To provide an idea as to why a properly specified MCMC algorithm may where

If the hatching method is used with fewer than 30 batches, it is a good idea to replace

m - 1 degrees of freedom.

provided that k is large enough so that the correlation between batches is negligible, and m is large enough to reliably estimate

important to verify that the batch means are indeed roughly independent - say, by checking whether the lag 1 autocorrelation of the

0.1. If this is not the case, we must increase k (hence N, unless the current m is already quite large), and repeat the procedure.

Regardless of which of the above estimates is used to approximate a 95% confidence interval for

Note that unless the that

we have fewer than N samples, we expect some inflation in the variance of our estimate.

A final and somewhat simpler (though also more naive) method of esti-mating

N into m successive batches of length k (i.e., = mk). with batch means Clearly

estimate

in magnitude. The variance estimate for

(5.39)

by the upper .025 point of a t distribution with

= 1.96, the upper .025 point of a standard normal distribution.

is then given by

is less than It is then have the variance is through hatching. Divide our single long run of length

in concert with intuition. That is, since

> 1 and ESS(

are uncorrelated,

is then

which gives have that bution of and

ities of getting from the two X-states to the two Y-states, and vice versa.

In fact, if all we care about is the X-marginal

X sequence alone, since it is also a Markov chain with transition matrix and

and similarly for Y. The two complete conditional distributions are also easily written in matrix form, namely,

The true marginal distribution of X is then and

characterized completely by the 2 x 2 matrix

Table 5.5 2 x 2 multinomial distribution of X and Y.

we have gives

Repeating this process t times, we Writing as the distri-we could consider the can be thought of astransition matrices, giving the probabil-(5.42) (5.41) (5.40) be expected to converge, we consider the following simple bivariate two-compartment model, originally analyzed by Casella and George (1992).

Example 5.11 Suppose the joint distribution of the two variables X and Y is summarized by the 2 x 2 layout in Table 5.5. In this table, all

Notating the joint distribution as we note it can be

Overparametrization and identifiability

Ironically, the most common source of MCMC convergence difficulties is a result of the methodology's own power. The MCMC approach is so gen-erally applicable and easy to use that the class of candidate models for a given dataset now appears limited only by the user's imagination. How-ever, with this generality has come the temptation to fit models so large that their parameters are or nearly so. To see why this trans-lates into convergence failure, consider the problem of finding the posterior distribution of

and we adopt independent flat priors for both

the two parameters is identified by the data, so without proper priors for and

Unfortunately, a naive application of the Gibbs sampler in this setting would not reveal this problem. The complete conditionals for

are both readily available for sampling as normal distributions. And for any starting point, the sampler would remain reasonably stable, due to the ridge in the likelihood surface

Using a rigorous mathematical framework, many authors have attempted to establish conditions for convergence of various MCMC algorithms in broad classes of problems. For example, Roberts and Smith (1993) provide relatively simple conditions for the convergence of the Gibbs sampler and the Metropolis-Hastings algorithm. and Tweedie (1996) show the geometric ergodicity of a broad class of Metropolis-Hastings algorithms, which in turn provides a central limit theorem (i.e., asymptotic normality of suitably standardized ergodic sums of the output from such an algorithm);

see also Meyn and Tweedie (1993, Chapter 17) in this regard. Results such as these require elements of advanced probability theory that are well beyond the scope of this book, and hence we do not discuss them further.

We do however discuss some of the common causes of convergence failure, and provide a brief review of several diagnostic tools used to make stopping decisions for MCMC algorithms.

(A check of this fact in our scenario is left as Exercise 19.) Hence using any starting distribution

being a better and better approximation as t grows larger.

Provided that every entry in

stochastic processes assures that for any which satisfies

(5.43)

is positive, a fundamental theorem from converges to a vector

does converge to as with

and Only the sum of their marginal posterior distributions will be improper as well.

and

The inexperienced where the likelihood is defined by

(5.44)

For the underidentified Gaussian linear model

less than full column rank, Gelfand and Sahu (1999) provide a surpris-ing MCMC convergence result. They show that under a flat prior on the Gibbs sampler for the full parameter vector

samples from the identified subset of parameters (say, exact sample from their (unique) posterior density

sampler will produce identically distributed draws from the true posterior for

and Trevisam (2000) consider a broad class of Gaussian models with covariates, namely,

with

0 as the prior variance component for

has converged. This then permits independent sampling from the posterior distributions of estimable parameters. Exact sampling for these parame-ters is also possible in this case provided that allof the

components go to infinity. Simulation work by these authors suggests these results still hold under priors (rather than fixed values) for the variance components that are imprecise but with large means. For more on over-parametrization and posterior impropriety, see Natarajan and McCulloch (1995) Hobert and Casella (1996), and Natarajan and Kass (2000). Carlin and Louis (2000) point out that these dangers might motivate a return to EB analysis in such cases; c.f. Ten Have and Localio (1999) in this regard.

user might be tempted to use a smoothed histogram of the obtained as a (proper) estimate of the (improper) posterior density

An experienced analyst might well argue that Gibbs samplers like the one described in the preceding paragraph are perfectly legitimate, providedthat their samples are used only to summarize the posterior distributions of iden-tifiable functions of the parameters (in this case,

the deficiency in model (5.44) is immediately apparent, in more complicated settings (e.g., hierarchical and random effects models) failures in identifi-ability can be very subtle. Moreover, models that are overparametrized (either deliberately or accidentally) typically lead to high posterior corre-lations among the parameters (cross correcorre-lations) which will dramatically retard the movement of the Gibbs sampler through the parameter space.

Even in models which areidentified, but "just barely" so (e.g., model (5.44) with a vague but proper prior for

associated high autocorrelations in the realized sample chains) can lead to excruciatingly slow convergence. MCMC algorithms defined on such spaces are thus appropriate only when the model permits a firm understanding of which parametric functions are well-identified and which are not.

It then turns out approaches

goes to infinity once the chain

prior variance with X

form an That is, such a and convergence is immediate. In subsequent work, Gelfand, Car-), such high crosscorrelations (and the But while samples

is divergent, but the

Proving diagnosing convergence

We have already mentioned how a transition matrix (or kernel, for a con-tinuous parameter space) can be analyzed to conclude whether or not the corresponding MCMC algorithm will converge. But in order to apply the method in actual practice, we need to know not only if the algorithm will converge, but when. This is a much more problem. Theoretical results often provide convergence rates which can assist in selecting be-tween competing algorithms (e.g., linear versus quadratic convergence), but since the rates are typically available only up to an arbitrary con-stant, they are of little use in deciding when to stop a given algorithm.

Recently, however, some authors have made progress in obtaining bounds on the number of iterations T needed to guarantee that distribution being

sampled at that time, tionary distribution,

time convergence bounds for a discrete jump Metropolis algorithm operat-ing on a log-concave target distribution in a discretized parameter space.

Rosenthal (1993, instead uses Markov minorization conditions, providing bounds in continuous settings involving finite sample spaces and certain hierarchical models. Approaches like these hold great promise, but typically involve sophisticated mathematics in sometimes laborious deriva-tions. Cowles and Rosenthal (1998) ameliorate this problem somewhat by showing how auxiliary simulations may often be used to verify the neces-sary conditions numerically, and at the same time provide specific values for use in the bound calculations. A second is that the bounds obtained in many of the examples analyzed to date in this way are fairly loose, suggesting numbers of iterations that are several orders of magni-tude beyond what would be reasonable or even feasible in practice (though see Rosenthal, 1996, for some tight bounds in model (3.9), the two-stage normal-normal compound sampling model).

A closely related area that is the subject of intense recent research is ex-act or perfect sampling. This refers to MCMC simulation methods that can guarantee that a sample drawn at a given time will be exactly distributed according to the chain's stationary distribution. The most popular of these is the "coupling from the past" algorithm, initially outlined for discrete state spaces by Propp and Wilson (1996). Here, a collection of chains from

initial states are run at a sequence of starting times going back-ward into the past. When we go far enough back that all the chains have

"coupled" by time 0, this sample is guaranteed to be an exact draw from the target distribution. Green and Murdoch (1999) extend this approach to continuous state spaces. The idea has obvious appeal, since it seems to eliminate the convergence problem altogether! But to date such algorithms have been made practical only for relatively small problems within fairly well-defined model classes; their extension to high-dimensional models (in

is in some sense within

For example, Polson (1996) develops polynomial of the true

sta-Of course, since the stationary distribution will always be unknown to us in practice, this same basic difficulty will plague any convergence diag-nostic. Indeed, this is what leads many theoreticians to conclude that all such diagnostics are fundamentally unsound. In the context of the previ-ous example, we would like to ensure that the distance between the true distribution and our estimate of it is small, i.e.

unfortunately, by the triangle inequality we have

The first term in the sum, sometimes referred to as the "mean squared error component" or "Monte Carlo noise," becomes small as m

Example 5.12 Recall the rat population growth model from Example 5.6.

Gelfand et al. (1990) used m = 50 parallel replications, and judged sampler

In document Bayes and Empirical Bayes Methods for Data Analysis - Carlin Louis (Page 182-200)