The Number of Components - Model Fitting - Finite Bayesian mixture models with applications in

3.2 Model Fitting

3.2.3 The Number of Components

The number K of components in a ﬁnite mixture model has so far been treated as ﬁxed. Both the EM algorithm and standard MCMC algorithms (as discussed in Sections 3.2.1 and 3.2.2) in principle cannot handle a varying dimensionality, i.e. the treatment of K as a random variable. If K is not known, a model selec- tion problem arises. In this situation, it is necessary to choose between models

M1, . . . ,MK_max, where MK denotes a model with K components. There exist

a number of model selection criteria intended to guide the choice of K that have the general structure

−2 log py|ˆθ, ˆπ∗,MK

+ C · p_M_K

and are minimized to identify the optimal number K. Here, ˆθ and ˆπ∗ are the

estimates given by the model MK (i.e., either maximum likelihood estimates

or posterior means), p_M_K is a measure for the complexity of MK, and C is a

penalty parameter. Choosing p_M

K as the number of parameters in the model

Akaike, 1974) and the Bayesian information criterion (BIC; Schwarz, 1978), respectively. Choosing C = 2 and

p_M_K = E_θ,ˆ_ˆ_π∗ −2 log p(y|ˆθ, ˆπ∗_,_M K) + 2 log p(y|ˆθ, ˆπ∗,MK) (3.13)

leads to the deviance information criterion (DIC; Spiegelhalter et al., 2002), which can be interpreted as a Bayesian analog to the AIC.

Although the generic nonidentifiability of mixture models (as discussed in Section 3.1.3) leads to problems regarding the regularity conditions required for the asymptotic justification of these criteria, they are frequently used to evaluate such models. In the opinion of several authors, both AIC and DIC tend to select overfitted models, i.e. models with K being too large, and corrections or alternatives to tackle this have been proposed for both criteria (see, e.g., Hurvich and Tsai, 1989; Ando, 2007). In practice, the DIC is frequently employed in Bayesian frameworks, while either AIC or BIC are mostly used in frequentist settings. Preferences w.r.t. AIC or BIC differ between authors. While some authors prefer the BIC, as it tends less to overestimate K (Fraley and Raftery, 2002), others favor the AIC for theoretical reasons, e.g., its derivation from principles of information (see Burnham and Anderson, 2002). It has also been proposed to use the BIC to approximate Bayes factors and then employ these factors as criterion (Dasgupta and Raftery, 1998). For Bayesian models fit via MCMC methods, such as the models developed in this thesis, the DIC may be calculated from measures that can be recorded as part of the sampling process, while the calculation of the AIC and BIC requires a separate maximization of the likelihood. In principle this would favor a use of the DIC to evaluate the methods presented in this thesis. However, the aspects underlying the evaluation of the methods developed in this thesis are not fully captured by the measures based on (3.13). Specifically, the method GAMMICS presented in chapter 4 is partly algorithmic, i.e. parts of the estimations done by the method are not represented by the likelihood. The models discussed in chapter 5, on the other hand, are

evaluated based considerably more on classiﬁcation and interpretability than on the ﬁt. Altogether, thus, the applicability of measures based on (3.13) is limited for these methods.

No matter which of the discussed options is pursued, the model needs to be fitted several times considering different values of K before the resulting models are compared based on the criterion. In a Bayesian context, there exist further options of dealing with an unknown K. If explicit inference on K is required, it is possible to treat K as random and estimate it within the model. In this case, different dimensions of the parameter space have to be considered, and an

MCMC algorithm that can deal with θ and π∗ of a varying dimension has to

be employed. The reversible jump algorithm introduced by Green (1995) fulﬁlls these requirements. Applied to mixture models, it contains moves for adding or removing empty components, as well as for splitting a component in two or fusing two separate components. This ensures that each step of the algorithm is reversible (hence, its name), which is a crucial condition for guaranteeing that Markov chains converge to the desired posterior distribution.

It is also possible to define the number of components via a formal decision- theoretic approach. For instance, one might specify a loss function that reflects the tradeoff between model complexity and the precision in solving the specific inferential problem, e.g. an estimation task (see, e.g., Quintana and Iglesias, 2003; Lau and Green, 2007).

If no explicit inference on K is required, the number of components can be chosen in the sense of an upper bound so that any probable value of K is considerably smaller. The model will then implicitly estimate K by leaving all unnecessary components empty, where the number of non-empty components is inﬂuenced by the mixture prior. Such a model (e.g., employing a truncated Dirichlet process or a ﬁnite-dimensional Dirichlet prior, as mentioned in Section 3.1.2) has already been applied to model cluster data originating from both biological images (Ji et al., 2009) similar to the data considered in Chapter 4 and omics measurements (Kirk et al., 2012) similar to the data considered in

Chapter 5.

Of course, nonparametric models for infinite mixtures may be preferred from the start, providing more flexibility in terms of fit. Of the potentially infinite number of components in this case, many will typically have weights near zero, however, leading to a limited number of relevant components in practice. In the simplest case, the Dirichlet distribution is then replaced by the Dirichlet process. For an assessment of the influence of different priors on the number of components, see Ishwaran and Zarepour (2000).

In this thesis, the numbers of groups or classes arise naturally from the

application at hand. The number of components only needs to exceed the

number of groups if one mixture component does not provide sufficient flexibility in terms of fit to represent one group. Furthermore, as mentioned, the focus lies considerably more on classification and interpretability than on the fit. Hence, for the applications considered in this thesis, it appears reasonable to fix K for the modeling tasks.

In document Finite Bayesian mixture models with applications in spatial cluster analysis and bioinformatics (Page 51-54)