• No results found

Flat or Partitional Clustering

2.2 Clustering and Classification

2.2.4 General Clustering Techniques

2.2.4.2 Flat or Partitional Clustering

Flat or partitional clustering attempts to determine a number of partitions that optimise an ob- jective function, or a cluster quality measure. Cluster optimisation is an iterative process [Jain et al., 1999]. Unlike hierarchical methods, where clusters are formed in one parse, partitioning algorithms operate on a gradual improvement mechanism. Depending on the quality of the clus- ter formed, further iterations are computed, a process which is continued until either a maximum number of iterations has been reached, or until the improvement between iterations is below a specified threshold. This approach returns higher quality clusters, however, is more computa- tionally heavy than the single parse hierarchical approach. There are two main subtypes in the field of partitional clustering, namely centroid-based clustering and probabilistic clustering [Jain and Dubes, 1988].

Centroid-based clustering, K-Means: In centroid-based clustering a number of partitions are generated and for this configuration of partitions an objective function is used, which represents overall cluster quality. Seeds are used as initial cluster placements. The remaining data points are allocated to each “seed point”, thus forming a cluster. The most popular clustering algorithm of this type is the K-Means algorithm.

The term “K-Means” was first used by MacQueen [MacQueen, 1967], with the original idea it- self being proposed earlier by Steinhaus [Steinhaus, 1956]. In the K-Means, the objective func- tion is to minimise the squared distances from the mean. K centroids are chosen, each represent- ing a seed point, with each seed point being also a cluster centre, thus the name K-Means. For a set of observations, or data points, x1, x2, ..., xn, with each observation being a d-dimensional real vector, the nearest cluster centroid is calculated using a distance function. This determines cluster membership. In most cases the distance function is calculated using Euclidean distance. Other variations apply the Minkowski or Mahalanobis metrics, section (2.2.2.2). This process is repeated until all points have been assigned to a centroid. When this occurs, new k-centroids are calculated [Grira et al., 2004]. Thus N data points are converted into K disjoint subsets, Sj, each containing Njdata points in such a way that the sum of squared errors (SSE) is reduced to

a minimum, with the objective function expressed as equation (2.19): J= K

j=1n∈S

j ||xn− µj||2. (2.19)

where xn is the vector representing the n-th data point and µj is the geometric centroid of the data point in Sj.

K-Means is a straightforward algorithm. The way clustering is performed in the K-Means makes the sequence of data entry non-influential to the final clusters formed. Despite this, it suffers from a number of shortcomings, mainly revolving around the pre-selection of the seed points. In the standard approach, random seeds are selected, however numerous literature demonstrates that improved results and quicker convergence is achieved with an appropriate seed selection mechanism (see [Pavan et al., 2012]). Multiple algorithms result in various seed selections, potentially causing a combinatorial explosion problem. The K-Means also requires real-valued data, lacks scalability, is sensitive to outliers, and the objective function can be misleading when contrasted with the entire spatial context [Berkhin, 2006], all of which are considered algorithm deficiencies. Other variations apply different SSE methods, the most popular being the Fuzzy C-Means [Bezdek et al., 1984]. These fuzzy methods tend to be more successful at avoiding local minima.

Other algorithms operate on a similar iterative mechanism. However, the allocation of a data point into a cluster is based on a probability distribution rather than the distance from the mean. Probabilistic Clustering: In Expectation Maximization (EM) each data point has a probability value of belonging to a cluster [Dempster et al., 1977]. The algorithm presumes that there is a statistical distribution, that is, a probability density function (PDF), that can be approximated over a cluster distribution. The EM algorithm is also an iterative procedure that computes the Maximum Likelihood (ML) estimate in the presence of missing or hidden data. The ML es- timates the model parameters which are most likely for the data points presented. Each EM iteration consists of two steps, an expectation step (probabilistic) which assigns points to clus- ters and a maximisation step, that is estimating model parameters that maximize the likelihood for the given assignment of points.

Convergence is assured since the algorithm is guaranteed to increase the likelihood at each it- eration [Jain and Dubes, 1988]. The key lies within the ML estimation, which aim is to find parameters which maximise the probability of finding the PDFs which best describe (approxi- mate) the clusters being sought.

Traditional partitioning clustering algorithms tend to define a cluster by the proximity to the locus of the cluster, which represents a point where each parameter value is the mean of the parameter values of all the points in the cluster. This approach favours recognition of spherical shapes, whilst its weakness is recognition of non-spherical shapes and outliers. Other method- ologies exist which compensate for the drawbacks of both the K-Means and EM algorithms, these include density-based algorithms.