Partitioning Clustering Algorithms - Knowledge discovery in high dimensional data

Hierarchical clustering algorithms have been shown to result in high quality partitions especially for applications involving clustering text collec-tions. Nonetheless, their high computational requirements, usually prevents their usage in real-life applications, where the number of samples and their dimensionality is expected to be high (the computational cost is quadratic to the number of samples).

Amongst the class of divisive hierarchical clustering algorithms the Prin-cipal Direction Divisive (PDDP) algorithm is of great value due to its very low computational complexity, in comparison with other algorithms of the same class. We will further analyse PDDP algorithm in Section 4.1.5.

§ 3.2 Partitioning Clustering Algorithms

The partitioning clustering algorithm create a flat partitioning of the data instead of building hierarchies. The main advantage of this class of algo-rithms is that they can be applied on very big data sets.

The most popular of partitioning clustering algorithms is k-means, which starting from k centres iteratively assigns each data point to the cluster whose centroid minimises the Euclidean distance from the point [HW79].

The iteration terminates when none of the data points changes clusters, or equivalently, when the centroids do not change significantly. Spheri-cal k-means [DM01] is a recently proposed modification of the algorithm that reduces k-means to the partitioning of the unit hypersphere by nor-malising the data points. Algorithms that belong to the same class as the k-means, can give adequate clustering results at low cost, since their run-ning time is proportional to k · n. However, their results depend heavily on their initialisation. Another similar approach is Gaussian Mixture Mod-els (GMM) [MP00], where k multivariate normal density components are combined, by assuming that each component represents a cluster. Like k-means, an iterative algorithm is used, typically Expectation Maximization (EM), to fit the parameters of each density to the data. Then the posterior probabilities for each data point to each component of the model is indica-tive of the probability of the point belonging to each cluster. GMM may be more appropriate than the k-means clustering algorithm, when clusters have different sizes and there exists correlated variables, but more control param-eters need to be estimated. This makes their application in high dimensions almost prohibitive.

3.2.1 The k-means algorithm

The basic concept of k-means algorithm is that each cluster is represented by a particular point named center. To split the data set into k clusters, the k centers Pj, j = 1, . . . , k are initialised randomly. Then, the algorithm

26 Chapter 3. Clustering Algorithms

assigns each data point to the cluster whose center is more close to it. This assignment is based on on the following equation for every point d_i:

µj(di) =

1 if kd_i− P_jk ≤ kd_i− P_lk ∀l 6= j

0 else . (3.2)

Usually for the calculation of the distance kdi−P_lk the Euclidian distance is being used as a proximity measure. When all data points are assigned to clusters the algorithm calculates the new centers based on the centroids of the data points of each cluster:

P_j = P_n

i=1µ_j(d_i)d_i Pn

i=1µ_j(di) . (3.3)

These steps are applied recursively until the membership of the clusters no longer changes or until the error function E (Equation 3.2.1) does not change significantly (converge). In brief, the k-means algorithm can be summarized as follows:

1. Initialize the k centers in the dataset.

2. Assign each data point to its closest center.

3. Calculate the new centers.

4. Repeat steps 2 and 3 until converged is achieved.

In general the algorithm can be described as a optimization procedure of the objective function:

Although it can be proved that the process will always converge, the optimal partitioning does not necessary corresponds to the global minimum of the objective function.

Figure 3.4displays a simple 2-dimensional example of the k-means algo-rithmic procedure. The actuals clusters are being found into four algoalgo-rithmic steps.

The k-means algorithm is simple and quite efficient in most cases. There are also many variants that improve its performance and are less susceptible to initialization problems. However, k-means is not suitable in cases where clusters are not globular or vary in size and density. In addition k-means has troubles dealing with dataset that contain outliers. For this reason in such cases, outlier removal methods are employed.

3.2. Partitioning Clustering Algorithms 27

Figure 3.4: Four iterations of the k-means algorithmic procedure.

3.2.2 The Fuzzy c-means algorithm

One of the most widely used fuzzy clustering algorithms is the Fuzzy C-Means (FCM) algorithm. FCM is a clustering method that assigns each each data point to a cluster to some degree that is specified by a membership grade. This technique was originally introduced by Jim Bezdek in [Bez81]

as an improvement on earlier clustering methods and attempts to partition a finite collection of data vectors into a collection of fuzzy clusters with respect to some given criterion. A theoretical discussion of FCM can be found in [Cox05].

Given a finite set of data, the algorithm returns a list of cluster centers and a partition matrix indicating the degree to which each element belongs to a given cluster. Like the k-means algorithm, the FCM aims to minimize an objective function, like the following:

J_m=

i=1 C

j=1

u^m_ijkd_i− P_jk², 1 ≤ m ≤ ∞, (3.6)

where m is any real number greater than 1, uij is the degree of membership of di in the cluster j, di is the i-th of a-dimensional measured data, Pj is the a-dimension center of the cluster, and k · k is any norm expressing the similarity between any measured data and the center. Fuzzy partitioning is carried out through an iterative optimization of the objective function

28 Chapter 3. Clustering Algorithms

shown above, with the update of membership uij and the cluster centers Pj

by:

uij = 1

l=1(^kd_kdⁱ^−P^j^k

i−P_lk)^m−1² , (3.7) where

Pj = PN

i=1u^m_ij · d_i PN

i=1u^m_ij . This iteration stops when

max_ij{|u^l+1_ij − |u^l_ij|} ≤ , (3.8) where l is the iteration number and is a constant between 0 and 1 that controls the termination of algorithm. This procedure converges to a local minimum or a saddle point of J_m.

In document Knowledge discovery in high dimensional data (Page 35-38)