When should the iteration terminate? - Knowledge discovery in high dimensional data

method is that the cluster having in effect the widest (largest variance) projection on the first principal component is chosen (“shoot the biggest animal”).

This can be very easily deemed unsuccessful through a simple example as shown in [TT08], which for completeness purposes is included here as well.

Fig. 5.6illustrates a case where the dataset has already been split into two clusters shown with different colours. In this case, the green large cluster has a SV value of 0.25414, which is 0.17945 larger than the SV value of the red cluster. So, in this case selecting the cluster with the larger SV value for further splitting would have no chance of producing a correct clustering.

The splitting criteria proposed in the previous section could also provide guidance for the cluster selection step. In the pool of already retrieved clusters, we expect the one with the largest distance among consecutive projections to probably contain more than one actual clusters.

The same can be said for the minimum estimated density criterion. A minimiser with very small density should be a good indicator of multi-modality of the density function and consequently it lessens the chance to split an actual cluster. Thus we propose the following two selection criteria:

• (Cluster Selection Criterion CS1): Let Π a partition of the data set D into k sets. Let M = {M_i : i = ∞, . . . , k} be the set of the largest distances M_i among consecutive projections, for each C_i ∈ Π, i= 1, . . . , k. The next set to split is Cj, with j = arg maxi{M_i : Mi∈ M}.

• (Cluster Selection Criterion CS2): Let Π a partition of the data set D into k sets. Let F be the set of the density estimates fi= ˆf(x^∗_i; h) of the minimisers x^∗_i for the projection of the data of each C_i ∈ Π, i = 1, . . . , k. The next set to split is Cj, with j = arg maxi{f_i : fi∈ F }.

§ 5.4 When should the iteration terminate?

Irrespectively of the method used, a fundamental issue in cluster analysis is the determination of the number of clusters present in a dataset. This issue also remains an open problem in cluster analysis. For instance, well-known and widely used iterative techniques, such as the k-means algorithm [HW79], require from the user a priori designation of the number of clusters present in the dataset. There also exist approaches that adjust the number of clusters during training. The ISODATA technique [BH67] is based on the same key idea as the k-means algorithm; starting with a typical number of initial clus-ters, it iteratively merges and splits existing clusters according to ”within-group variability” and “closeness” thresholds. Another popular approach is to employ Akaike Information Criterion (AIC) and Bayesian Information

62 Chapter 5. Enhancing Principal Direction Divisive Clustering

Figure 5.6: A dataset of unbalanced clusters and the corresponding scatter values.

Criterion (BIC) to choose among partitions with different number of clus-ters. In the same theme the Integrated Completed Likelihood (ICL) criterion has been proposed as more appropriate for clustering purposes [BCG00]. In this category of algorithms PG-means [FHE07] aims to learn the number of components of a Gaussian Mixture Modelling approach, using statistical hypothesis tests on one-dimensional projections of the data. The computa-tionally efficient x-means algorithm, proposed in [PM00], is another popular approach from the class of partitioning algorithms that has the ability to approximate the number of clusters in the data.

One of the most promising approaches from the density based category of algorithms is DBSCAN [SEKX98]. DBSCAN is a density-based clustering algorithm that tries to recover clusters from spatial databases and auto-matically decides the number of clusters. Clusters are defined by means of neighbourhoods of objects. The density of each such neighbourhood has to exceed some threshold. The value of the threshold is critical for the execution of the algorithm and heuristics have been proposed to determine it. Finally, there exist some recent agglomerative hierarchical clustering algorithms that have been shown to be able to achieve high quality clustering results, such as BIRCH [ZRL96], CHAMELEON [GRS98] and CURE [KHK99]. However, these type of algorithms require higher user intervention to provide accurate estimations for the cluster number.

There also exist a few grid-based algorithms [HK99,AGGR05], that have been shown to be able to produce good clustering results. One of the most notable of these is CLIQUE [AGGR05]. Their biggest drawback is that they have a running time that is exponential to the data dimensionality. To be

5.4. When should the iteration terminate? 63

more precise, they are exponential not to the actual data dimensionality, but to the dimensionality of the subspaces where the clusters reside and is possibly much smaller than the full data dimensionality. So they are more fitted to operate on cases where specific clusters lie on few dimensions and medium data dimensions (e.g. 20, 40), and not in cases where dimensions lie in the range of thousands.

Little has been done however to develop an efficient technique for PDDP based approaches. The crudest approach would be to stop the execution when all the discovered clusters have a scatter value that is smaller than a predefined value, but the tuning of this parameter can be difficult from a user perspective. The criteria used in other algorithms could also be employed in the PDDP case. For example in [KSI03] it is proposed to use BIC to determine if a further split would improve the clustering result or not. Additionally, we could use nearest neighbour statistics like the ones used in cluster tendency [TK06,You82].

In [TT08], a termination criterion based on the maximum distance be-tween consecutive projections was proposed. More specifically, we propose to have a maximum number of allowed clusters kmax as an upper bound and subsequently continue splitting as long as there exists clusters with more than M inP ts points, where M inP ts is a user defined parameter to describe the minimum number of points that are allowed to constitute a valid clus-ter. Notice that this is not an uncommon procedure for algorithms that are designed to deal with noisy datasets [SEKX98]. In this case, it is indirectly assumed that the distances between two outliers are larger than any two points of a cluster. Formally the termination criterion is the following based on the two user defined parameters k_max and M inP ts:

• (Stopping Criterion ST1): Iteratively split the data set D into k_max subsets. Report as clusters the ones with more than M inP ts points.

Designate the remaining points as outliers.

For the density based approach presented here, we could allow the exis-tence of a minimiser to guide the termination of the procedure. We can stop the iteration as long as no minimiser exists for any of the retrieved clusters.

This stopping criterion makes the assumption that all the retrieved clusters are uniformly dense, so they cannot be further split. Note however, that this depends on the bandwidth selection and automated bandwidth selec-tion techniques could be employed to remove user intervenselec-tion. Formally:

• (Stopping Criterion ST2): Let Π a partition of the data set D into k sets. Let X be the set of minimisers x^∗_i of the density estimates fˆ(x^∗_i; h) of the projection of the data of each Ci ∈ Π, i = 1, . . . , k.

Stop the procedure when the set X is empty.

64 Chapter 5. Enhancing Principal Direction Divisive Clustering

In document Knowledge discovery in high dimensional data (Page 71-74)