4.3 Experiments
4.3.2 Penalised Error Rate
There exists no single universally agreed upon measure for clustering quality. As true class labels are available for all datasets, which are not used during clustering, one possible measure of cluster quality is the agreement of clusters with classes. Clearly one would expect better accuracies with more clusters, as it should be easier to find smaller class-pure clusters than larger ones. One caveat here is that when taking the majority class of each cluster as its “label”, clusters with only one example will automatically be correct – the degenerate case of this being a clustering where each cluster contains only one instance. Such a clustering would be treated as perfect. For this reason a Penalised Error Rate was used that treats instances in single-instance clusters as errors. The trends visible for penalised error rates follow reasonable expectations
– in general, higher number of clusters lead to smaller penalised error rates. The exception to this is the Musk1 dataset, where the comparatively small
number of instances leads to higher numbers of single-instance clusters as the number of clusters generated increases. Indeed, the penalised error rate begins to increase at around 30 clusters for Musk1, as shown in Figure 4.1. On the
Musk1 datatset, Rrr-c and Rsd performed very similarly, with Rrr-c(Wide)
performing slightly worse than the other algorithm-coverage pairs.
Figure 4.1: Penalised error rates on Musk1
For MutagenesisRF and MutagenesisAll (see Figures 4.2 and 4.3), the pe-
nalised error rates for Rrr-c and Rsd are also very similar, across all coverage ranges. On both datasets, Rrr-c performs slightly better than Rsd for smaller numbers of clusters, but as the number of clusters is increased the difference between the penalised error rates decreases.
On Carcinogenesis, Rrr-c and Rsd again perform similarly. For smaller numbers of clusters, Rrr-c(25%-75%) performs slightly worse than the others. On all four of the above datasets, Rkm performed worst of the three sys- tems. This is at least partially due to the tendency of Rkm to produce both larger clusters (less likely to be class-pure) and more single-instance clusters (automatic errors) than either of the other two algorithms, which increase the penalised error rate. An example of this for a 10-cluster run on the Musk1
4.3. EXPERIMENTS 81
Figure 4.2: Penalised error rates on MutagenesisRF
Table 4.2: Size of clusters generated by Rkm on Musk1
Cluster Size Quantity
80 1
4 1
1 8
On the Diterpenes datasets, the trends displayed are somewhat different. For these datasets, Rkm and Rrr-c generally perform better than Rsd, al- though Rrr-c(Wide) consistently produced a higher penalised error rate than the other Rrr-c experiments. The four coverage ranges for Rsd produce penalised error rates that are very similar to each other, with only the mi- nor exception that Rsd(25%) performs slightly worse than the other coverage ranges on Diterpenes54,3.
Rkm produces a much smaller number of single-instance clusters on these datasets, which may contribute to its improvement in penalised error rate rel- ative to the other algorithms, when compared to the non-Diterpenes datasets. The difference between the penalised error rates for Rrr-c and Rsd may be due to Rsd’s restriction on rule generation – Rsd will not accept rules that can be decomposed into two or more distinct rules.
Figure 4.4: Penalised error rates on Diterpenes52,54
4.3. EXPERIMENTS 83
very low (as shown in Figure 4.4 – the penalised error rates for the other two subsets follow a similar pattern), the penalised error rate for DiterpenesAll is
substantially higher, as shown in Figure 4.5. This occurs because DiterpenesAll
is a 23-class dataset, with a skewed class distribution such that three classes make up over 75% of the dataset – most of the generated clusters are dominated by one of the three major classes.
Figure 4.5: Penalised error rates on DiterpenesAll
The worse performance of the Wide coverage for Rrr-c, compared to the other coverage ranges, may be related to the fact that when this coverage range is used, a high proportion of the generated rules have coverage in the range (two instances – 5% of instances), as shown in Table 4.3.
As Carcinogenesis has somewhat similar proportions of low-coverage at- tributes to Diterpenes under Rrr-c, but does not display this behaviour, it may be that the number of instances in the dataset is also a factor, given that the Diterpenes datasets are 2-4 times larger than the Carcinogenesis dataset. The Wide coverage range is the only one bounded by an absolute number of instances, rather than a proportion, and two instances is a much smaller pro- portion of the Diterpenes datasets than of the smaller datasets, resulting in rules with correspondingly low coverage. It may even be the case that this behaviour is the result of some unknown property of the Diterpenes data.
Table 4.3: Proportion of rules generated by Rrr-c that cover (2 instances – 5% of instances) Dataset Proportion Diterpenes52,3 0.7204 Carcinogenesis 0.7180 Diterpenes52,54 0.7104 Diterpenes54,3 0.7039 DiterpenesAll 0.6961 MutagenesisRF 0.6372 MutagenesisAll 0.6353 Musk1 0.2946
rules are generated using the Wide coverage range – rulesets generated with a higher minimum coverage for rules appear to perform better for clustering.
The Normalised Mutual Information (NMI) measure [51] for cluster eval- uation was also investigated, but the relative performance of the algorithms was very similar to that observed using the Penalised Error Rate.