Silhouette Width - Random Relational Rules

4.3 Experiments

4.3.3 Silhouette Width

Another measure used to compare clusterings is the average silhouette width [70]. The silhouette value si for an instance i is calculated according to the

following formula: si = bi - ai max(ai,bi) (4.1) Where:

ai = md(i,ci) (ci = the cluster containing i)

bi = min(md(i,cj6=i)) for all clusters j not containing instance i

md(i,c) = the mean distance from instance i to all instances in cluster c

The silhouette value is thus a measure of clustering quality that is inde- pendent of the class labels of the data, instead using the distance measure to determine whether an instance has been optimally clustered. It compares the average distance from a given instance i to each other instance in its cluster to the average distance from i to each instance in the closest cluster (the closest cluster being that with the smallest average distance to i across all instances it contains). Higher silhouette values therefore arise from tighter clusters (smaller

4.3. EXPERIMENTS 85

intra-cluster distances) and more separated clusters (larger inter-cluster distances). Silhouette values lie between -1 and +1, with lower values indicating an increasing likelihood that the instance could have been better placed in the cluster represented by b. A silhouette value of zero indicates that the instance could be equally well clustered in the cluster represented by b as in its current cluster.

In the case where a cluster contains only one instance, the silhouette value of that instance is defined to be zero, again to avoid overly positive evaluation of single-instance clusters. Under propositionalisation, it is possible for two or more instances to have identical attribute values. This occurs when these instances produce the same Boolean values for each of the rules generated by Rrr-c or Rsd. When a cluster is composed of instances with identical attribute values, the silhouette is calculated as in Equation 4.1, but with a value of 0 for ai (because there is no distance between the instances), which

gives a result of 1 for each instance in the cluster. This is shown in Equation 4.2 (which assumes that bi is positive – this holds except in the pathological

case that all instances in the dataset are identical). The effects of this property are further discussed later in this section.

si = bi - ai max(ai,bi) = bi - 0 max(0,bi) = bi bi = 1 (4.2) Where:

ai = 0, as all instances in the cluster have identical attributes

bi = min(md(i,cj6=i)) for all clusters j not containing instance i

md(i,c) = the mean distance from instance i to all instances in cluster c

To evaluate the quality of a clustering, the average silhouette width is used, which is the average of the silhouette values for all instances in a dataset (shown in Equation 4.3).

Average silhouette width =

n X i=1 si n (4.3) Where:

si = the silhouette value for the ith instance in the dataset

n = the number of instances in the dataset

Table 4.4: Interpretation of silhouette width Average Sil. Interpretation

Width

0.71-1.00 Strong structure 0.51-0.70 Reasonable structure 0.26-0.50 Weak structure

up to 0.25 No substantial structure

and described in Table 4.4.

The average silhouette width follows similar trends for the MutagenesisRF

and MutagenesisAll datasets (shown in Figures 4.6 and 4.7) for both Rrr-c

and Rsd – a slow increase as the number of clusters increases, although with an initial peak for a very small number of clusters, followed by a drop, for MutagenesisRF.

Figure 4.6: Average silhouette widths for MutagenesisRF

The silhouette values for the Mutagenesis datasets show clear differences – Rsd produces higher silhouette values than Rrr-c. The silhouette value for Rsd(Wide) is lower than that for the other coverage ranges, and simi- larly, Rrr-c(Wide) produces worse silhouette values than the other Rrr-c runs. Frequently the Rsd silhouette values are ordered by minimum coverage – Rsd(25%) performing better than Rsd(10%), and so on – although for some

4.3. EXPERIMENTS 87

Figure 4.7: Average silhouette widths for MutagenesisAll

numbers of clusters the values are very similar. Rkm performs substantially worse than both. The high number of single-instance clusters generated by Rkm explains its low silhouette width – not only do instances in single-instance clusters have silhouette values of zero themselves, they can also significantly lower the silhouette widths of instances in larger clusters that lie in close prox- imity. In addition to this, properties of the non-Euclidean Ribl distance measure may also affect silhouettes. The Wide coverage range tends to generate more single-instance clusters than the other coverage ranges, explaining the slightly worse silhouettes obtained by both Rrr-c(Wide) and Rsd(Wide). By Rousseeuw’s interpretation (in Table 4.4) Rsd produces clusterings that range from ‘weak structure’ to ‘reasonable structure’, while Rrr-c produces ‘weak structure’. The comparatively low silhouette widths for the Wide coverage runs fall into the ‘weak structure’ range for Rsd and ‘no substantial structure’ for Rrr-c.

On Musk1, as shown in Figure 4.8, Rkm produces poor silhouette values,

while the silhouette values for Rrr-c and Rsd are very similar, with Rrr- c(Wide) and Rsd(Wide) producing slightly lower silhouette values than the other coverage ranges. As the number of clusters increases, all of the silhouette values tend towards zero, as the number of single-instance clusters generated also increases – and, as mentioned above, on the comparatively small Musk1

dataset, all of the algorithms produce a greater number of single-instance clusters. The structure found is initially in the ‘weak’ range for most coverage ranges, but drops to ‘no substantial structure’ as the number of clusters is increased.

Figure 4.8: Average silhouette widths for Musk1

On the Carcinogenesis dataset, the non-Wide Rrr-c and Rsd results stay within a narrow band of values as the number of clusters increases – in the high end of ‘no substantial structure’ and the low end of ‘weak structure’. For higher numbers of clusters, Rrr-c(25%-75%) shows a slight improvement over the others. Both Rrr-c(Wide) and Rsd(Wide) have distinctly worse silhouette widths than their non-Wide counterparts, with Rsd(Wide) slightly outperforming Rrr-c(Wide). Rkm once again has a very low silhouette value. On the Diterpenes datasets, the silhouette values produced show a distinct relationship to the coverage settings for both Rrr-c and Rsd – the silhouette values for Diterpenes52,54 are shown in Figure 4.9. For each algorithm, as the

minimum coverage for rules increases, so do the silhouette values produced. The silhouette values for Rkm are improved from the results on the previous datasets. The silhouette values for Rsd are substantially higher than for Rrr- c (except for Rsd(Wide)), falling in the category of ‘weak structure’, and at their peak ‘reasonable structure’, as opposed to ‘no substantial structure’ and the low end of ‘weak structure’ for Rrr-c. For both algorithms, the Wide

4.3. EXPERIMENTS 89 Table 4.5: Number of unique instances under propositionalisation Dataset Number of Unique Instances

Instances _Rrr-c _Rsd _Rrr-c _Rsd (25%-75%) (25%) (Wide) (Wide) Carcinogenesis 330 313.2 301 312.6 314 Diterpenes52,3 801 796.0 599 796.3 798 Diterpenes52,54 804 796.0 593 797.0 798 Diterpenes54,3 709 703.0 537 702.6 704 DiterpenesAll 1503 1492.3 1066 1491.0 1503 Musk1 92 92 92 92 92 MutagenesisAll 230 175.0 115 172.2 141 MutagenesisRF 188 149.5 98 145.7 118

coverage range performs substantially worse than the other coverage ranges. On DiterpenesAll (shown in Figure 4.10), Rrr-c has slightly lower silhou-

ette values (‘no substantial structure’) than on the Diterpenes subsets, but the ordering of those values the coverage ranges is the same. Rsd behaves slightly differently, with Rsd(25%) now producing worse silhouette values than Rsd(5%). Rkm has a particularly high silhouette width on DiterpenesAll for

low numbers of clusters, in the ‘weak structure’ range.

One factor contributing to the high silhouette values produced by Rsd on the Mutagenesis and Diterpenes datasets may be the larger numbers of instances that have duplicates under Rsd’s propositionalisation than under that of Rrr-c (some examples of this are shown in Table 4.5. Each instance in a cluster consisting only of duplicated instances will have a silhouette value of 1 (as the average within-cluster distance is 0), as previously shown in Equa- tion 4.2). Even in clusters that do not consist solely of duplicated instances, duplicated instances contribute to lower intra-cluster distances, which leads to higher silhouette values.

On the Musk and Carcinogenesis datasets, where Rsd and Rrr-c have very similar silhouette values, they also produce very similar numbers of duplicate instances.

Although in general both the penalised error rate and the average silhouette width improve as the number of clusters increases for most of the datasets and coverage ranges, they are measuring different things. The penalised error rate depends only on the agreement of class labels with clusters, and the average silhouette width only takes into account the relative groupings of clusters, ignoring class labels.

Figure 4.9: Average silhouette widths for Diterpenes52,54

4.3. EXPERIMENTS 91

In particular, the silhouette value only examines the structure of the propositionalised representation of the dataset, and does not consider the relationship of the propositionalised instances to their class labels. For an extreme ex- ample, consider a single-attribute propositionalisation of a dataset, where each instance is represented by a randomly-assigned single Boolean value. Such a propositionalisation would have a perfect silhouette value when clustered, as each instance would have zero distance from each other instance in its cluster. However (unless the single attribute corresponded directly to the class of each instance) this propositionalisation would certainly not have a perfect Penalised Error Rate. Additionally, the silhouette value was originally intended for pur- poses such as determining the ‘best’ number of clusters to use in clustering a particular dataset, rather than cross-representation comparison.

The Penalised Error Rate can be said to reflect to some extent the quality of propositionalisation. Instances that are mutually similar should be grouped together by clustering, and with a ‘good’ propositionalisation instances of the same class should be similar (assuming that these similarities exist in the original data).

This divergence between Penalised Error Rate and silhouette value can be observed in the Diterpenes results. Rsd has a substantially higher silhouette value than Rrr-c on these datasets, but also a substantially higher Penalised Error Rate. This indicates that while Rsd’s clustering has created clusters that are more clearly separated than those produced by Rrr-c, those clusters are not as class-pure. Furthermore, although the information obtained from the silhouette value is of interest, in the two-step setting where a propositional representation is generated and then clustered, a measure of clustering that takes into account the relation of the propositionalisation to the original class labels should be preferred to one that does not, as this is a definite indicator that groupings in the propositionalisation reflect groupings in the original data.

In document Random Relational Rules (Page 124-131)