Choice of number of clusters and model - Validation, comparison, and stability

4.4 Validation, comparison, and stability

4.4.2 Choice of number of clusters and model

Bayesian and Akaike information criteria Di↵erent criteria exist in the literature aiming to select the most optimal among a group of models, including the choice of optimal number of clusters.

Probably the oldest commonly used criterion is the Akaike Information Criterion (AIC) first presented by Akaike at a symposium in 1971 (Armenia, former USSR).

Its first term includes the log-likelihood computed at the maximum likelihood esti-mate (MLE) of the parameters, thus it privileges the solutions that generate a higher probability of observing the given sample from the model (i.e., higher log-likelihood logL = ln(P (X|K, ˆ⇥^{M LE}_k ))). At the same time, the second term penalizes for the complexity of the model in order to avoid overfitting. The penalty increases with the number of independent parameters p to estimate in the model:

AIC = 2logL + 2p

Another very common general-purpose criterion is the Bayesian Information Cri-terion (BIC) from Schwarz [134]. It represents an asymptotic result obtained under the assumption that the data are distributed from an exponential family distribution.

4.4. VALIDATION, COMPARISON, AND STABILITY 109 Inspired from the AIC, BIC increases the penalty term proportional to the size of the data (n):

BIC = 2logL + ln(n)p

Minimization of AIC or BIC is the simplest way to choose the optimal number of clusters in model-based clustering.

However, these criteria are not necessarily the most appropriate for all types of models and purposes. One of the disadvantages of BIC (and AIC) as a criterion for the number of clusters is that by privileging the fit to the data, a single non-Gaussian cluster may be represented by two or more Gaussian clusters which provide better fitting. Therefore BIC may sometimes over-estimate the number of clusters.

Note that for models containing several parameters for each component (cluster), by increasing the number of parameters one may penalize the criterion more severely com-pared to the case of having only one parameter per component. Given the HMTD model we are using, if one adds two lags for the mean of each cluster (⇥k ={'k,0, 'k,1, 'k,2, ✓k,0}), the optimal solution may contain less clusters than when we use only one lag for the mean (⇥k ={'k,0, 'k,1, ✓k,0}) because of the penalty term.

Integrated Complete Likelihood After the paper of Biernacki, Celeux and Govaert [21] in 2000, another criterion has gained popularity in mixture models particularly for clustering use - the Integrated Complete Likelihood (ICL).

In order to understand this method we must recall the notion of Integrated (or marginal) Likelihood (IL). Referred as the evidence of the model, it is important con-cept in Bayesian statistics. In general, its computation consists in marginalising out (integrating) the parameters in the likelihood function. The sampled values are used for this purpose. The aim is to obtain a remaining variable that represents the particu-larity of the model itself: for instance in mixture models often the selection of optimal number of components (variable k) is a major issue. Therefore one needs a likelihood function that indicates the probability that the data come from a mixture with k clus-ters, without assuming particular values for any other parameters (function of k only).

In this case the marginal likelihood of interest is integrated over all other parameters (noted ⇥), but K:

P (x|K) = Z

⇥

p(x|⇥, K)p(⇥|K)d⇥

The objective is to compute the model evidence of one model with k1 components,

against another model with k2 components. The posterior odds ratio is computed by multiplying the prior odds ratio by the ratio of the marginal likelihoods (called Bayes factor): ^p(k_p(k¹^|x)

2|x) = ^p_p^M^(K¹⁾

M(K2) p(x|K1) p(x|K2)

In cases of clustering where the mixture components are not well separated an alternative version of IL using the complete data is often recommended (Biernacki, Celeux and Govaert [21], Celeux [32]). The Integrated Complete Likelihood (ICL) makes use of the true missing data z (i.e. the allocation of observations to clusters) in addition to the observations x in the computation of the log-Likelihood of the model. It is however, not easy to estimate and several approximation methods have been proposed by Celeux [32].

The first and most straightforward computation of ICL proposed by Biernacki et al.

is the BIC-like approximation denoted ICLBIC. This approximation of the ICL uses the value of the BIC, penalized by the mean entropy of the solution:

ICLBIC(K) = BIC(K)

By taking into account the entropy, the ICL privileges the partition that provides more separated clusters compared to the classic BIC criterion. The latter is well suited to evaluate the fit of the model to the data and select the optimal data generating model, but ICL is more adapted to clustering where the discrepancy between groups matters, because it eases the interpretability of each cluster.

Various di↵erent computations of the ICL also exist in the literature (see Celeux [32]). Biernacki, Celeux and Govaert [22] and Bertoletti, Friel and Rastelli [20] discuss methods of exact computation of ICL using for instance di↵erent prior distributions.

However, we must note that these papers, like the majority of the publications, focus on Bayesian estimation of the mixtures and are not adapted to the frequentist case.

Therefore for the examples in the next chapters we will implement the computation of the ICLBIC approximation.

Other criteria Other approaches to choose the number of clusters, besides the above-mentioned, also exist. Some of them are based on bootstrap re-sampling. While such strategies are often used to measure the stability of clustering, another goal may be to choose the optimal number of clusters k for a given dataset and clustering method.

4.4. VALIDATION, COMPARISON, AND STABILITY 111 Fang and Wang [50], for instance, aim to find the number of clusters for which the average dissimilarity between the partitions (instability S) is minimal. More concretely, the bootstrap stability assessment follows four steps:

1. Generate B pairs of bootstrap subsamples with size n (number of observations) (Xb, X_b⁰), b = 1, . . . , B.

2. Using the same method, calculate the clustering partitions Pbk, P_bk⁰ for each sub-sample on k clusters.

3. To calculate the clustering dissimilarity sbk between the pairs of subsample clus-tering partitions, check whether or not every pair of observation falls within the same group in both partitions,

Then, define the clustering instability as the average of all dissimilarities between the b pairs of samples,

4. Repeat these calculations over all the possible number of clusters k. Now, the optimal number of clusters is the one for which the instability is minimal.

k = argminˆ k2[2...K]¯sBk

Although this procedure is designed to find the optimal number of clusters for the same clustering method, it could also be applied to compare the stability of the solutions of two di↵erent methods for a given dataset, provided the same number of clusters is chosen. The authors also propose a similar procedure to estimate the standard error of the estimated clustering instability.

Other methods are based on the between- and within-cluster sum of squared dis-tances. The gap statistic, for instance, is a very popular method for k-means (Tib-shirani, Walther, and Hastie, 2001). It evaluates the goodness of clustering based on average dispersion within the clusters as compared to a reference distribution. It is calculated with di↵erent number of clusters in order to choose the optimal number.

Note that the indices based on between and within sum of squares and those based on dissimilarity are not adapted to continuous longitudinal data.

4.4.3 Validation of clustering, comparison, and stability

In document Latent Markovian Modelling and Clustering for Continuous Data Sequences (Page 119-123)