2.2 Data Analytics
2.2.3 Algorithms used in MBA
2.2.3.2 Algorithms used in clustering
Clustering remains a fundamental process in data mining today and is widely used in several applications [33][94][119]. However, based on the discussion in Section 2.2.2.2, algorithms used for item or product clustering were not considered further in this study, hence the focus of this section is on algorithms for customer or transaction clustering. Several studies, including [70], [119], [162], and [175], noted that the K-Means (KM) and Fuzzy C-Means (FCM) clustering algorithms remain the most popular and widely used approaches today. Hence KM and FCM will be the focus of this section.
K-Means Clustering
K-Means clustering is well-documented in [160]. The underlying principle of KM is as follows: a given number of clusters, with initial values for the cluster centroids is defined. This is followed by all data points being assigned to its closest cluster, and an iterative process begins where the centroid is re-calculated after each data point assignment step. The process stops when there is no change in data point re- assignments [160][175].
While implementing a KM-based algorithm may be relatively straightforward, several previous works, including in [33], [140], [160], and [175] have noted that the method itself has three main drawbacks, which sometimes can be resolved:
1. Randomly choosing the initial centroids: choosing the initial centroids randomly generally produces poor results and this can be exacerbated by performing mul- tiple runs with the same data [140][160]. Whilst there are several techniques to overcome this issue, the two best techniques were: incremental updates of centroids, where the centroid is updated after each additional point rather than
once all points are added, and K-Means bisection to produce the initial cen- troids, where the initial data set is divided into two clusters and then each cluster is bisected further [160].
2. Empty clusters: it is possible to have empty clusters as a result of the initial choice of centroids. This may unlikely resolve itself during the iteration process. One way of overcoming this is to manually remove that choice of centroid and select another centroid from that cluster which has the largest data spread, thus splitting this cluster and compacting the overall clustering process [160].
3. Outliers: outliers can influence the effectiveness of the clustering process and the typical approach to this problem has been to find outlying data points in advance and eliminate them [160]. However, outliers in some applications may have significance and care should be taken not to eliminate these [160]. Some everyday examples include: financial analysis where highly profitable customers or fraudulent transactions may show-up as outliers, or in the case of potential criminal activity where large purchases of an item, e.g. nails, screws etcetera, that is otherwise purchased in smaller quantities, may be considered an anomaly that should be eliminated, when in reality it is a signal of a major threat in progress.
Fuzzy C-Means Clustering
FCM, first introduced in [18], and based on the work in [44], was created to overcome some of the problems commonly associated with the crisp clustering approach of KM. These include those noted earlier, and the need for multiple passes to improve clustering accuracy [33][70][94]. Unlike KM, FCM uses a soft clustering approach in which data points on the boundaries are not forced into a single cluster but rather they are allowed to be members of multiple clusters with varying degrees of membership, such that the total membership of a data point across all clusters equals to one. This
approach, not only improves clustering accuracy, but also closely resembles everyday life [33][119]. There are several other fuzzy clustering algorithms that exist, but FCM remains popular as its relatively stable, reliable and fast [119]. However there are three main, well-known problems with FCM:
1. CPU usage as a result of speed: the speed benefits of FCM was noted to be computationally expensive, in particular for large data sets, and there have been several variations of FCM to improve on this over the years [119]. One approach taken in [32] to optimise CPU usage proved to be effective in reducing the CPU usage by a sixth. This was achieved by using a look-up table to determine an approximate value for the Euclidean distance calculations as opposed to computing the exact value.
2. Too many iterations as a result of sub-optimally selecting the fuzzifier “m” [119][162][174]. The generally used value of 2 for the fuzzifier “m” is not optimal for all applications and a sub-optimal “m” can be time consuming due to the increased number of iterations required to reach convergence. However a large number of applications use “m” = 2 and this will be used in this study as well [33][162].
3. Choosing too many initial clusters: Winkler et al. in [174], noted that the performance of FCM was weak when the number of initial clusters were high, at approximately 100. To combat this, a polynomial was introduced to FCM which was shown to reduce the impact of choosing a large number of initial clusters [174]. It should be note that for this study, the total number of clusters in not expected to exceed twelve, hence the issue of too many clusters is not going to pose a problem.
Based on the above, it was concluded that whilst FCM is not without its problems, its accuracy is superior to KM, and hence formed the basis for clustering in this study
[33].