• No results found

For this study, we constructed a large number of synthetic data sets following a tech- nique used by Milligan [107]. For the study, one hundred and eight synthetic data sets were produced by identifying three parameters in the data generation process and combining them to produce data sets. We introduce an additional parameter

CHAPTER 3. CLUSTER QUALITY MEASURES EXPERIMENTATION 42 that allows us to introduce outliers, so we explore the following: datasets with differ- ent number of clusters; different dimensionality of the data; different sizes of clusters; and different number of outliers. We also extend the data construction method by using larger ranges of possible values for each parameter. Milligan has explained in detail the method for generating the data sets [108]. It is briefly summarised here and then expanded upon.

To generate data objects we must first identify the boundaries of each cluster. Points are generated within these boundaries. The boundaries of the clusters may not overlap in the first dimension. The length of the boundaries is selected from a uniform distribution running from ten to forty. The centroid of each cluster is then determined. The value of the centroid for a given dimension is the midpoint of its boundary for that dimension. The standard deviation of a cluster for a given dimension is defined as a third of the length of its boundary for that dimension. Points are generated using a multivariate normal distribution with the centroid of the distribution defined as the centroid of the cluster to be generated. The diagonal entries of the variance-covariance matrix are set to the standard deviations of each dimension of the cluster. Each point that is generated must be within 1.5 standard deviations of the centroid. The process is repeated for each cluster that is to be generated.

In our experimentation, first we consider the number of clusters in a data set: values between two and forty are used. The second parameter is the number of dimensions: values used range from two to twenty dimensions within Euclidean space so that no one dimension dominates the other dimensions. The third parameter is the proportion of objects that are members of each cluster. For this, we use three possible designs: in the first design the objects are evenly distributed between all of the clusters; in the second design a cluster consists of 10% of the objects and the rest are as evenly distributed as possible; in the third design, a cluster consists of 60% of the objects and the rest are as evenly distributed as possible. Finally, the fourth parameter is the proportion of objects that are generated as outliers.

Outliers are defined as within 9 standard deviations of the centroid of each cluster. The proportion of outliers is either: 0%, 20% or 40% of the objects generated.

The variation of these factors produces six thousand six hundred and sixty nine different data set designs. Each design is then generated three times resulting in a final set of twenty thousand and seven data sets. Each data set consists of five hundred objects.

For each data set, twenty clustering solutions are generated where the first rep- resents the optimal clustering solution and those that follow are copies where a proportion of the objects have been randomly misclassified in 1% steps. That is, in the first solution all objects are correctly assigned to the clusters; in the second solution 1% of objects are misclassified, then 2%, etc. The quality of the clustering solutions should decrease as more objects are misclassified. For each solution, the value of each internal quality measure is calculated. This process produces a set of results for each internal quality measure. Each set of results for an internal quality measure associated with a data set are normalised to the range 0 to 1. Finally minimisation measures are then inverted so they become maximisation measures to allow for easier comprehension of the results.

For each solution, the Rand Index [107] is calculated in relation to the optimal solution. It is expected that as the clusters are misclassified the Rand Index should deteriorate in value. The correlation between each set of average results for a mea- sure and a set of average results of the Rand Index for a given data set is calculated using the Pearson’s Correlation Coefficient [128] across the 20 clustering solutions. Measures that are robust with respect to the deterioration of a solution should cor- relate well to the Rand Index and the correlation should be consistent. In Figure 3.1 we show a simple example of the decrease in the value of the Rand Index as we use the misclassification process on the classic Iris data set [2]. We start with the standard three cluster solution and then degrade this solution. This is included for illustration only as we do not use this data set in this work. In this example we have misclassified the objects in 1% steps as it is a very small data set.

CHAPTER 3. CLUSTER QUALITY MEASURES EXPERIMENTATION 44 0.7 0.75 0.8 0.85 0.9 0.95 1 0 5 10 15 20 25 V alue of Rand Index Missclassification Percentage Rand Index

Figure 3.1: Example of the change of the Rand Index on the Iris data set as it is misclassified.

As each data set design was generated three times we averaged the correlations from each of the data set generations, to assess each given criterion (outliers, dimen- sions, density factor and number of clusters). This allows us to examine the results and isolate the behaviour that is the result of one of these factors.