L is rarely known - Improved strategies for distance based clustering of objects on subsets of

A common limitation of the results is thatL, the number of clusters, was just pre-set beforehand and thus assumed known. This limitation did not receive any attention in the previous studies of the clustering algorithms. It is more informative to com- pare the algorithms under the assumption that L is unknown. To do so, we need a strategy to determine the number of clusters obtained by each algorithm. There are many cluster validation measures that can guide the researcher to determine the number of clusters; the silhouette width (Rousseeuw, 1987) is just one example among many. For overviews on this ongoing topic of cluster validation research, see Dudoit and Fridlyand (2002), Tibshirani and Walther (2005), Arbelaitz et al. (2013), Lee and Olafsson (2013), and Hancer et al. (2017). However, apart from Tibshirani and Guenther (2005), these overviews do not pay attention to either regularized attribute weighting clustering algorithms, or high-dimensional data settings. In this section we will fill this gap.

To selectLfor each algorithm we will apply the Gap statistic to the criterion as a function ofLfor each of theLgroups clustering algorithms. For SAS, SPARCL, and EWKM this is a natural choice since the Gap statistic performs well on algorithms that have objective functions that have the tendency to find convex-like shaped clusters in the data (Lee & Olaffson, 2013). However, it may get outperformed for specific high-dimensional data settings by the prediction strength method in Tibshirani and Guenther (2005). This latter ‘better’ approach, however, introduces a new problem of setting a certain threshold for prediction strength, and it comes with extra com-

putational costs due to a to be specified number of repetitions of the cross-validation procedure. Since we perform MVPIN on the COSA-λNN and COSA-KNN distances, we apply the Gap statistic procedure to the objective of MVPIN (see Chapter 5, equation 5.17).

6.3.1 Results when assuming

L

unknown

We applied the Gap statistic for all the algorithms on a gridL∈ {2,3, . . . ,10}, while using the same 25 permuted data sets we used for selecting the optimal tuning parameters for the attribute weighting. Letting the number of clustersLto be estimated by the Gap statistic resulted in the Rand indices presented in Figure 6.4.

0.562 0.492 0.568 0.48 0.842 0.523 0.915 0.731 0.498 0.519 0.788 0.595 0.498 0.772 0.82 0.945 0.566 0.915 0.474 0.498 0.505 0.947 0.562 0.501 0.492 0.794 0.593 0.499 0.915 0.474 0.474 0.505 0.898 0.746 0.509 0.503 0.537 0.5 0.578 0.952 0.459 0.546 0.511 1 0.74 0.475 0.524 0.765 0.618 0.494 0.652 0.746 0.536 0.515 0.679 0.559 0.53 0.725 0.948 0.757 0.649 0.974 0.728 0.538 0.52 0.947 SASdgs SASgss SPARCL EWKM COSA.KNN COSA.lNN

ApoE3 DLBCL Lung1 Leuk Colon Brain

SRBCT Lung2 Breast2 Su Prostate Data Method 0.00 0.25 0.50 0.75 1.00 RI

Figure 6.4: A comparison of the algorithms assumingL is unknown; for each algorithm the number of clusters is optimized by the Gap statistic.

Overall, all Rand Indices became a little lower (mean difference = 0.017) when compared to the situation in Figure 6.3, whereLwas pre-specified equal to the number of clusters. The largest changes in the Rand Index were observed for those data sets with more than two clusters. From the current results,COSA-λNNandSASdgsremain the best performers for the data sets. While for SASdgs the Rand Indices became slightly lower on average (mean = 0.685, and was 0.699), the Rand Indices ofCOSA-λNN remained on average about the same (mean = 0.716).

With the Gap statistic procedure applied to the value of the criterion of the two SAS algorithms, only two clusters were consistently selected (see Table 6.2). While for the DLCBL data a selection of a lower number of clusters results in a higher value of the Rand Index (from 0.693 to 0.915), for the Brain and SRBCT data sets, these are lower Rand Index values (0.787 to 0.595, from 0.654 to 0.474, respectively). The

performance of SASdgs, as well asCOSA-λNN, on the DLBCL data will be discussed in further detail in Section 6.4.

Table 6.2: The estimated number of clusters based on the Gap statisticL, and the supposedlyb

true number of clustersLin brackets behind the name of each data set. The estimateLbin red

indicates a better Rand Index value than obtained withL, in blue the estimate forLbindicates

a lower Rand Index value than than obtained withL.

Data L SASdgs SASggs SPARCL EWKM COSA-KNN COSA-λNN

ApoE3 2 2 2 3 5 2 2 Brain 5 2 2 2 6 6 3 Breast 2 2 2 3 6 2 2 Colon 2 2 2 2 6 5 2 Leukemia 2 2 2 3 6 3 2 Lung1 2 2 2 6 5 6 3 Lung2 2 2 2 3 5 2 3 DLBCL 3 2 2 2 6 4 3 Prostate 2 2 2 2 4 2 4 SRBCT 4 2 2 3 6 2 6 SuCancer 2 2 2 2 6 3 3

For SPARCL the Gap statistic procedure ‘worsened’ the results (mean difference

= -0.044) on the clusterable data sets. Large decreases in the Rand Indices occurred for the Brain (0.81 to 0.562), Leukemia (0.945 to 0.842), the Lung1 (0.785 to 0.48), and DLBCL (0.945 to 0.915) data sets. However, the Rand Indices improved for the ApoE3 data (0.492 to 0.788). ForEWKMthe application of the Gap statistic procedure resulted in slightly better Rand Index values (mean difference = 0.012). Compared to the results in Figure 6.3, the results are worse forCOSA-KNN(mean difference = -0.041).

6.3.2 Discussion of the Results So Far

Although COSA-KNN and COSA-λNN are not designed to find L groups automatically, in combination with MVPIN they can be used for this purpose. In general, COSA-λNN performs better than COSA-KNN when used in combination with MVPIN. Moreover, when assumingLis unknown, the application of COSA-λNN with MVPIN gives the arguably best clustering performance. In general, both COSA-λNN and SAS with the default grid search, provided the best clustering results on the benchmark data. However, it remains difficult to claim that both COSA-λNN and SAS outperform the other algorithms. For every data set, there is an alternative with a comparable performance, if not better.

When we optimize the pre-set number L with the Gap statistic procedure, the performances of the algorithms on the data sets change. Whereas estimating the ‘wrong’ number of clusters led on average towards a better recovery of the clustering structure for EWKM, COSA-KNN, and COSA-λNN, it did not lead to a better average performance for SPARCL or the SAS algorithms.

those in the original study (Arias-Castro & Pu, 2017). A first explanation for the lower Rand Index values of SAS (and SPARCL) could be that we have hit a bad starting state of the (pseudo) random generator. The ‘seed’ we used for our random number generator represented the date related to the start of the comparative study,

i.e.set.seed(20180830). Re-running the algorithm on the DLBCL data set did not

result in better performances (results not shown here). Note, however, the results of SPARCL are better than those presented in Jin and Wang (2016).

We conjecture that the differences in the results may be due to the volatile sen- sitivity to random starts of SAS and SPARCL algorithms. The regularized attribute weighting K-means (orC-means) type algorithms that perform attribute weighting for high-dimensional data settings are more prone to local minima than the original

K-means algorithm. In the subspace clustering literature, this problem is even larger since every cluster has its own subset, therefore EWKM automatically comes with the ‘advice’ to re-run the algorithm a multiple (of ten) times on each data set.

Consistent with the original studies by Arias-Castro and Pu (2017) and Jin and Wang (2016), is that none of the algorithms recovers the clustering structure of the Breast2, Lung2, Prostate and SuCancer data sets. Noteworthy is that even for su- pervised learning algorithms, where the group-label was used as an outcome variable, these datasets were difficult (Dettling, 2004; Yousefi et al., 2010). Another reason may be that the division into two groups of the Breast2, Lung2, and SuCancer by Yousefi et al. (2010), might not have been a good or the only representation of the original grouping structure.

A first strong point of this extension of the original study is that we compared SAS and SPARCL with the ‘subspace clustering’ algorithm EWKM. Moreover, this comparative study also gives a better understanding of the average performance of EWKM with a selection strategy for the value of tuning parameter, which was inde- pendent of the clustering structure. In Deng et al. (2013; 2016) the tuning parameters were ‘ideally’ adjusted towards the highest average value of the Rand Index with re- spect to the clustering structure. This strategy cannot be implemented in practice, since the cluster-labels are not known. In our study, EWKM does not belong to the best performers when comparing their average Rand Index values with the (single) Rand Index values of the other algorithms.

The second strong point of this comparative study is that we have compared the algorithms while not considering L, the number of clusters, to be known. However, this strong point coincides with a limitation: the ongoing debate on how to decide for

Lclusters (Hancer et al., 2017). When we use the Gap statistic to select the number of clusters in combination with SAS, we always seem to determine two clusters. When the attribute weights are the same for all clusters, it seems that the more strict the attribute weighting, the more the Gap statistic procedure points towards a smaller estimated number of clusters. This is an hypothesis that seems to be consistent with findings in Kou (2014), and may be resolved by applying the robust-GUD statistic to choose the number of clusters.

In document Improved strategies for distance based clustering of objects on subsets of attributes in high-dimensional data (Page 167-171)