• No results found

CHAPTER 5. BAYESIAN LINEARLY CONSTRAINED GAUSSIAN

5.2 Bayesian Linearly Constrained GMRM

5.2.1 MCMC Estimation

Regardless of the choice of prior distributions in model (5.1), the joint posterior distri-bution π(p, ψ|z) is analytically intractable because the data distridistri-bution is a weighted sum of the component normal distributions. Therefore, we utilize Markov chain Monte Carlo (MCMC) to sample from the joint posterior. Since mixture models present a few unique challenges in MCMC estimation, we give a short review of the relevant literature here.

Gibbs sampling (Geman and Geman, 1984; Gelfand and Smith, 1990) is the most com-monly used MCMC sampling scheme for Bayesian mixture models, where, as with the expectation-maximization algorithm (Dempster et al., 1977), the data, zi, i = 1, . . . , n, are augmented by a latent variable Ci ∼ Categorical(p1, . . . , pk), for i = 1, . . . , n such that C correspond to the unobserved component labelings (Diebolt and Robert, 1994). The intro-duction of the latent component labelings (a process often referred to as data completion or data augmentation) can greatly simplify Gibbs sampling schemes, where, in simple mod-els, most or all of the full conditional distributions are of closed form. However, MCMC estimation of mixture models are additionally complicated by a lack of identifiability in the marginal posterior distributions of component parameters due to a phenomenon referred to as label-switching. The problem occurs when exchangeable priors are placed on the K sets of component parameters of the mixture model. The resulting posterior distributions are in-variant to permutation of the labels (k = 1, . . . , K), and thus contains K! symmetric modes corresponding to each permutation of the component labelings (Marin et al., 2005). The marginal posterior distributions for parameters of all components are therefore identical, and a well-mixing MCMC sampler will jump between the K! posterior modes, resulting in marginal distributions for component parameters for which posterior summaries computed from MCMC samples are meaningless. The likelihood and posterior predictive distributions

are unaffected by label switching and therefore this problem is inconsequential for appli-cations with the main goal of density estimation. For appliappli-cations focused on clustering and inference about component parameters however, the label-switching problem generally needs to be resolved. We revisit the label switching problem in Section5.3.4, and show how the problem can be “by-passed” for inference about Li growth and background trend.

Assessment of convergence properties of MCMC samplers for mixture models relates to the label switching problem, as it has been suggested by several authors that the presence of label switching is necessary to ensure that a well mixing sampler which visits all regions of the posterior with non-zero probability has been attained (e.g. Celeux et al. (2000), Fr¨uhwirth-Schnatter (2001), Jasra et al. (2005), and Papastamoulis and Iliopoulos (2010)).

While usually straightforward to implement for mixtures, Gibbs sampling schemes utilizing data augmentation can tend to get stuck in local models with high posterior probability and miss regions of lower posterior probability, especially if the number of observations, n, is large (Celeux et al., 2000). When symmetric modes are well separated, label switching may not occur at all and, while the samples may be statistically useful for inference, assessment of whether convergence has been reached is not possible (Marin et al., 2005). One could argue that the lack of label-switching is a less serious issue when the posterior is only mul-timodal in the K! symmetric models corresponding to unidentifiable component labels, as the symmetric modes only contain redundant information. A situation where the potential

“stickiness” of a Gibbs sampler is more problematic is in posteriors that exhibit genuine multimodality (i.e. multiple modes occur beyond the symmetric modes), and a Gibbs sam-pler which does not exhibit label-switching may not mix well enough to explore all possible genuine modes. Genuine mutlimodality can be common in mixture models, particularly when components are not well separated (significantly overlap) and multiple configurations of the mixture model result in very similar likelihoods.

Alternative MCMC schemes to the data augmented Gibbs sampler have been intro-duced for mixture models, including simulation of the model without completion using the

Metropolis-Hastings (M-H) algorithm (Marin et al., 2005). A downside of the M-H algo-rithm is that specification of appropriate proposal distributions may be very difficult for complex models. Trans-dimensional samplers such as reversible jump MCMC (Richardson and Green, 1997), and birth-and-death and continuous time samplers (Stephens, 2000a;

Capp et al., 2003) assume the number of components, K, is an additional parameter in the model with a prior distribution, and jump between K component models within a sampling run. These samplers are particularly advantageous when K is unknown and estimation of the true number of components is of interest. Richardson and Green (1997) also found that mixing between symmetric modes often improved using trans-dimensional samplers even if the sampler utilizes data completion, because the sampler is able to move out of local modes by jumping between models of varying K. Trans-dimensional samplers are, however, computationally expensive and inefficient as the sampler will jump out of the model of in-terest throughout a sampling run. Other methods such Hamiltonian Monte Carlo (HMC) and its extensions could also be used (Neal, 2010). However, we found that sampling from a Gaussian mixture model utilizing HMC with the statistical software STAN (Carpenter et al., 2017) could only be obtained for very simple, well separated models, and failed to sample from models with largely overlapping components.

Acknowledging the challenges in MCMC estimation for mixture models discussed above, we focus on developing a MCMC scheme that samples well from all modes with highest posterior probability, for the purpose of clustering and obtaining Li labelings in (S)TEM images. This is accomplished by utilizing a data augmented Gibbs sampling scheme with and an over-fitted LC-GMRM. The extra components of an over-fitted model allow for better exploration of the parameter space for finding posterior modes with high probability. The extra components are accounted for by specifying a sparse Dirichlet prior on mixture weights which encourages any redundant component to empty in the posterior, discussed in more detail in Section 5.3.3. We acknowledge that under the following proposed Gibbs sampler label-switching may not always occur, and true assessment of whether convergence has

been attained may not be reasonable. Simulations shown in Section 5.4 however suggest that the sampler performs well at finding modes corresponding to reasonable clusterings and Li labelings, and that the sampler is often able to sample from minor modes (genuine multimodality).