2.2 Open Problems in Clustering
2.2.3 Estimating the Number of Clusters
In unsupervised learning, it is very unlikely that the true number of clusters that should be identified is known in advance. Therefore, it is necessary to estimate this as part of the learn- ing process. This is an open problem in the literature , and different approaches to cluster- ing offer different approaches to determining the number of clusters, such that the resulting
groups remain consistent with the specified cluster definition.
Hierarchical Clustering
A complete hierarchical clustering, which returns a nested clustering structure completely avoids this problem by providing a clustering result for all possible numbers of clusters from
1, ...,n. However, this is computationally expensive, and often, the user must still extract a final clustering from the hierarchy, and determine an appropriate number of clusters for the problem of interest. It may be desirable to define an appropriate stopping rule, to au- tomatically terminate the recursive splitting (or merging) of clusters in the hierarchy, such that the level of the hierarchy at which this stopping rule is satisfied allows determination of the number of clusters. For some applications, a stopping rule may be intuitive to specify, although this is not always the case, making this a non-trivial problem.
Given a complete hierarchy, with a single cluster at the root of the cluster tree (dendro- gram), andnleaves for each of the individual observations, the most common approach to extract a final, flat clustering is to set a horizontal threshold across the dendrogram to locate the clusters which result from a single level of similarity (Jain and Dubes,1988). However, it is well documented that this approach is unable to detect clusters on multiple scales (Stuet- zle,2003;Kriegel et al.,2011). Therefore,Campello et al.(2013) proposed the optimal extrac- tion of clusters from hierarchies (OCE). This permits the extraction of clusters which corre- spond to non-horizontal cuts of the dendrogram, and locates the clustering that maximises the quality of the resulting clusters using a local measure of cluster quality. This allows the identification of clusters on multiple scales and with different densities.
Centroid-Based Clustering
For centroid-based clustering, it is intuitive to determine the number of clusters using the within cluster sum of squared distances (or within cluster variance), since this is the func- tion which is minimised by this cluster definition, and therefore determines the quality of a clustering. The elbow heuristic considers the reduction in the within cluster variability for increasing numbers of clusters, and estimates the number of clusters such that any addi- tional clusters do not significantly reduce the within cluster variability.
The Gap statistic (Tibshirani et al.,2001) formalises this heuristic within a formal statis- tical procedure. This compares the total within cluster variability for different numbers of clusters to the expected value under a null reference distribution with no obvious cluster- ing structure (often the uniform distribution). For a given number of clusters,k, the Gap statistic is defined as,
Gapn(k) =En{log(Wk)} −log(Wk)
whereWkis the within cluster sum of squared distances when the data are partitioned into kclusters andEn(·)denotes the expectation under a sample of sizenfrom the reference
distribution, computed by Monte-Carlo simulation. Therefore, the Gap statistic measures the deviation of the observed within cluster sum of squared distances from its expected value under the null reference distribution. The standard error of the Monte-Carlo simu- lation withNnull samples is defined as,
sk =σk
√
whereσkis the standard deviation of the log within sum of squared distances when the null
samples are partitioned intokclusters. Finally, the number of clusters is chosen to be the minimum value ofkfor which the following holds,
Gapn(k) ⩾Gapn(k+1)−sk+1.
Therefore,kis chosen to be the smallest value for which the Gap statistic is within one stan- dard deviation of the Gap statistic withk+1clusters.
Spectral Clustering
In spectral clustering, the number of distinct connected components within the graph in- dicates the number of clusters present. It has been shown (Ng et al.,2002) that the largest eigenvalue of the graph Laplacian is equal to one, and that this eigenvalue will be a repeated with multiplicity equal to the number of groups in the graph. Therefore, it is possible to de- termine the number of clusters by counting the number of eigenvalues of the graph Lapla- cian which are equal to one. However, this property only holds if the clusters correspond to completely disconnected components within the graph. If the clusters are not discon- nected, the largest eigenvalues are not all equal to one. In this case, it may be possible to determine the number of clusters using the heuristic proposed byPolito and Perona(2002). This heuristic searches for the point where the eigenvalues of the graph Laplacian decrease sharply. However, the location of this point may not be clear in datasets with high levels of noise.
Zelnik-Manor and Perona(2004) propose to use the eigenvectors of the graph Laplacian to estimate the number of clusters for spectral clustering. If the clusters are completely dis- connected, the graph Laplacian may be sorted into a strictly block diagonal matrix, where
each block corresponds to the Laplacian of a sub-graph associated with a single cluster. In this case, the matrix of eigenvectors of the graph Laplacian,V ∈ Rn×kwill have non- zero values only in entries corresponding to a single cluster. For a graph withkclusters, if we compute more thankeigenvectors,Vwill have some rows which contain more than one non-zero entry. Similarly, if we compute fewer thankeigenvectors,Vwill have some rows which contain no non-zero entries. Therefore,Zelnik-Manor and Perona(2004) pro- pose to estimate the number of clusters to be the value which allows the minimal alignment cost between the eigenvectors of the graph Laplacian and the canonical co-ordinate system
e1, ...,ek.
Model-Based Clustering
The model-based approach to clustering allows the estimation of the number of clusters through standard statistical model selection techniques, provided it is possible to construct a likelihood for the chosen clustering model. The value of the likelihood for models with different numbers of mixture components (clusters) may be used to detect when a more complex model does not fit the data significantly better than a model with fewer parame- ters. The most common model selection techniques for this task are the Akaike information criterion (AIC) and Bayesian information criterion (BIC). For a model withpfitted parame- ters, with likelihoodL, the AIC is defined as,
AIC=2p−2 log(L),
while the BIC is,
wherenis the number of observations. The number of clusters is determined at the point where the information criterion is non-decreasing.
Density-Based Clustering
For density-based clustering, the level set definition given in Eq. (2.2) inherently estimates the number of clusters to be the number of regions of density greater than the level param- eter,c, which are separated by regions of density lower thanc. Irrespective of the choice of density estimate applied by different density-based algorithms, the number of clusters equates to the number of high-density regions, concentrated around the modes of the esti- mated density of the data. In practice, the specification of a threshold at which the density is considered sufficiently high to constitute a cluster is non-trivial. However, varying the level parameter does allow the computation of a complete cluster hierarchy to avoid this prob- lem.