Chapter 2 Literature Review
2.3 Data Analysis
2.3.3 Data Mining and Machine Learning Methods
2.3.3.6 Clustering
Clustering describes a methodology of attempting to identify patterns and trends in data by finding what case groupings exist, and how those groups are defined (Jain, Murty, & Flynn, 1999). By performing this task, classifications can be derived that were previously unknown, and a simple comparative analysis of the members of each cluster will provide conditions for determining cluster membership for future cases. This ability to find truly new classifications makes clustering a powerful data mining tool, although obviously it should be noted that any new classifications discovered have no associated meaning: they simply provide evidence that classes exist within the data, and it is an expert’s responsibility to determine why this should be true and what the implications of this are (Jain, et al., 1999).
k-Means Clustering
k-Means clustering is one of the most simple clustering methods, and many clustering methods use the k-means approach as a template (Berkhin, 2006). k- Means finds results by many repeated passes of the same function: each case in the dataset is assigned to the cluster that it is closest to, based on a hyper-dimensional plot of all cases, with each attribute in the dataset describing one dimension. ―Closeness‖ is a complex term in clustering, and is where most of the differentiation between methods lies: in k-means, closeness is decided by comparing the average, total, or maximum (depending on implementation) difference between the mean of all cases currently in the cluster and the current case under consideration (Hartigan, 1975; MacQueen, 1967). Initially, the clusters are decided by randomly assigning one case from the dataset to each cluster. Once all cases have been assigned to a cluster, the process is repeated, with the mean-points of each cluster constituting the midpoints of the new empty clusters. This is repeated until there is no deviation of cases from one cluster to another between successive runs; at which point, the clusters are determined to have stabilised and the results presented (Hartigan, 1975; MacQueen, 1967). The number of clusters (k) that are initially created is determined by the person running the clustering method. If this is not known, the clustering
process can be run multiple times with different numbers of clusters, to attempt to find the best results (Hartigan, 1975; MacQueen, 1967).
Clustering Limitations
The biggest flaw with k-means, and with most clustering methods, is that while they can find completely new class groupings the method requires that the user input how many of these groups to look for (Hartigan, 1975; Witten & Frank, 2005). This requires some level of understanding of what the results are likely to be before the process is run – severely dampening the benefits of discovering new class groupings. This is exacerbated by the second major flaw with clustering methods: that they are very expensive in terms of time and processing power, particularly if the number and nature of the clusters being looked for is uncertain (Hartigan, 1975; Witten & Frank, 2005). Clustering works quite effectively and relatively efficiently with a small set of attributes, cases and clusters as parameters; however as these numbers rise the processing time dramatically increases, in many cases to the point of being unusable (Witten & Frank, 2005). This time requirement can be reduced the more that is known about the clusters being searched for: restricting the search space to a small number of attributes, or weighting important attributes more than others, will dramatically improve the speed and efficiency of the process; as will reducing the cases being examined, or specifying the approximate number of clusters to search for.
Clustering methods are usually non-deterministic, with the assignation of random cases to the initial clusters determining how the final clusters will be formed: however, by having stringent stabilisation requirements it is generally assured that if there exist clusters within the data being analysed, they will be discovered (Jain, et al., 1999). This is still a downfall of the method however, as there is no guarantee that the results are the best possible results, and there will always be doubt.
A further problem is that clustering can often be inconclusive, as methods generally provide little distinction between obvious, strong clusters and weak clusters. The clustering algorithm only functions to the extent of having stabilised and defined clusters: the veracity of these clusters over larger amounts of data and how reliably they can be defined is not presented: this must be pre-determined by considering the domain, and concluding whether strong clusters are likely to exist or not (Jain, et al.,
1999). The end result of these drawbacks is that for a clustering method to be effective there usually needs to be a significant level of expert involvement and application of domain knowledge: without this, a clustering method becomes a blind search, likely to take significant time only to discover clusters which are uncertain and uninformative. This unfortunately means that while clustering can find new knowledge and entirely new classifications, to do so effectively requires that much of the nature of the classifications is already known. Clustering is therefore a method which lends itself to quantifying known relationships, or relationships that are expected to exist, rather than a method of discovering new knowledge.