A Review: An Approach of Different Types of Clustering Methods for Data Mining

(1)

Biswajit Basak, IJRIT 100 International Journal of Research in Information Technology (IJRIT)

International Journal of Research in Information Technology (IJRIT) International Journal of Research in Information Technology (IJRIT) International Journal of Research in Information Technology (IJRIT)

www.ijrit.comwww.ijrit.comwww.ijrit.comwww.ijrit.com

ISSN 2001-5569

A Review: An Approach of Different Types of Clustering Methods for Data Mining

1Biswajit Basak, 2Soumya Bag 3Sourabh Biswas 4Sudipto Basak

1Assistant Professor, Dept o f ECE, Hooghly Engg. &Technology College, Hooghly, India 2,3B.Tech student, Dept of ECE, Hooghly Engg. &Technology College, Hooghly, India 4M.Tech student, Computer Science & Engineering., University of Kalyani, Kalyani, India

ABSTRACT:Clustering is widely used in now days in various research fields like classification, system modeling etc. It is already well known data clustering algorithm available to us. Clustering is an approach to unsupervised learning which leads to generation of class representatives or prototypical objects for subsequent development of decision system. It may be noted that the number of classes is, in general, less than or equal to number of clusters.

That means more than one cluster may comprise a single class of objects. In this paper we mainly compare different type of data clustering algorithm like K-Means, Fuzzy C-Means (FCM) and Maximum of Minimum (Maximin) etc.

and via experimental result we demonstrate our result with help of MATLAB software.

Keywords: K-Means, FCM, Maximin

I. Introduction

Clustering technique can be used for learning as well as for classification in an unsupervised manner. Basic objective of any clustering technique is to divide the data points of the feature space into a number of groups or clusters are known. Criteria usually include, started before, inter-class distance and inter-class distance and intra- class distance and density of points within a class. A lucid and through discussion on clustering techniques may be found in [Tou and Gonzales (1974), Andrews (1972), Duda and Hart (1973)] and other books on pattern recognition.

Here we present only two widely used methods-maximin algorithm and K-means (or c-means) algorithm. The first one determines the number of possible clusters that satisfy some clustering criteria; while the second one, given the number of clusters, finds the best possible grouping of data into many groups by satisfying some criteria.

Figure 1: Block Diagram of Clustering Method

(2)

Biswajit Basak, IJRIT 101

In Fig.1 the Feature Selection blocks identifying the most effective subset of the original features to use in clustering. Feature Extraction block transformations of the input features to produce new salient features. Inter pattern Similarity measured by a distance function defined on pairs of patterns. Grouping is a method to group similar patterns in the same cluster

II. Maximin algorithm

Maximin (maximum-of-minimum) distance algorithm is heuristic methods that may be used to determine the number of clusters that data points belong to so that the inter-class distance is maximized. Suppose there is a collection Ф of n feature vectors corresponding to n objects or patterns, i.e. Ф = {f0, f1,…….., fn-1} and K classes, c_0,c_1,…..,c_k-1where K<<n.

Algorithm:

-

Input: Set of feature vectors Ф = {f_0,f_{1, ……,}f_n-1}, and a constant p.

Step 1: Select any arbitrarily chosen feature vector from the feature vector set {f_i} as the first cluster center c_0.

Step 2: Find the farthest feature vector with respect to v₀ and select it as the cluster c₁. Repeat Steps 3 to 6 until no new cluster center is found:

Step 3: Calculate average distance between the cluster centers {v_i| i = 0,1,….}.

Step 4: For each of the remaining feature vector(s), find minimum distance from it to the cluster centers {v_i | i=

0,1,……}.

Step 5: Find the maximum of these minimum distances (computed in step 4) and the corresponding feature vector f_j. Let us denote this distance by d_m.

Step 6: If this distance d_m is greater than a significant fraction p (for example, say, p=0.5) of d, i.e. d_m> pd select the feature vector f_j as a new cluster center.

Output: The number (K) of clusters formed and their centers v₀⁽⁰⁾, v₁⁽⁰⁾,……,v⁽⁰⁾_k-1.

Thus the number of distinct cluster centers obtained by the method is taken as the number of clusters of clusters of the given sample set and is used as input to more sophisticated clustering algorithms.

III. K-means algorithm

As said earlier, given the number of clusters, K-means algorithm finds the Clusters through an iterative procedure that minimizes intra-cluster distances. Suppose number of clusters which the given data belong to is K. Given the data the algorithm picks up any arbitrary K feature vectors as K cluster centers and then iteratively forms the clusters around the centers and refines the cluster centers accordingly until convergence. Finally these cluster centers are used as class representatives or proto types in the decision process. Algorithm 15.2 may be described as follows:

Algorithm:-

Input: Set of feature vector Ф = {f_0.f₁,…., f_n-1}, the number of classes K and a predefined threshold Є.

(3)

Biswajit Basak, IJRIT 102

Step 1: Choose first K different feature vectors as K different cluster centers v₀⁽⁰⁾,v₁⁽⁰⁾,….,v⁽⁰⁾_k-1. The superscript denotes the iteration number. E_t← a large number.

Repeat steps 2 to 4 until E_t< Є:

Step 2: at r-th interaction for each of the feature vectors, assign feature vector {f_i} to cluster c_k if d(f_i,v_k^(t-

1))=min{d(f_i,v_j^(t-1))} for j=0,1….,k-1 where d is a distance measure ,or Euclidean distance.

Step 3: New cluster centers are computed by minimizing intra-class distances. If v_k^(t) is the center of C_k at t-th interaction, this is calculated by minimizing

D

k

=∑

_Є

||

−

^( )

||

²

(1)

For k=0,1,2….,k-1.The cluster center obtained by minimizing Dk in equation (1) is the mean value of feature vectors belonging to C_k at t-th iteration, i.e.

V_k^(t)=

⁽⁾ ∑_Є (2) Where n_k^(t) is the number of feature vectors assigned to C_k at the t-th iteration

Step 4: Calculate error E_t={||^( )− ^(
) ||}f e is very small, value E_t<Є means there are either insignificant or no changes in the cluster centers then the algorithm is terminated.

Output: The cluster centers v₀⁽⁰⁾,v₁⁽⁰⁾,……,v_k-1⁽⁰⁾.

Clusters thus obtained have minimum inter class distance for a given set of feature vectors and number of classes.

The cluster centers represent prototypes for certain classes and can be used to design the classifiers.

IV. Fuzzy C-Means Algorithm

In statistical methods of classification basically probability of a pattern belongs to a class is determined. And then, the pattern is belong to particular class for which the probability is maximum (in maximum likelihood approach).Determining probability means assigning some sort of uncertainty to the pattern with respect to classes.

Consequently the method is called fuzzy mathematical pattern recognition. Probability is mostly determined based on chance of occurrence of object while fuzzy membership is assigned based on quality of the object. For example suppose we are interested in classifying pixels in a satellite image into waterbody or land area. A pixel of size say 10m X 10m may contain fully land or fully waterbody or both land area and waterbody. In statistical pattern recognition technique each pixel is assigned even with probability much less than one only either to the class of waterbody or to the class of land. This is true for K-means clustering also where every pattern is assigned to one cluster or other. But the case of mixed content the pixel is eligible to be assigned to both the classes with certain membership say less than one. Such kind of uncertainty may be better handled by fuzzy mathematical recognition techniques. One of the most widely used fuzzy recognition methods is clustering by fuzzy c-means algorithm, alternately called soft clustering as opposed to hard clustering by K-means algorithm using decision theoretic approach.

Algorithm:-

Step 1: Choose initial cluster centers arbitrarily. The first c distinct feature vectors are taken as initial clusters, i.e.

V_i⁽⁰⁾=f_i, i=0,1,…..,c-1

(4)

Biswajit Basak, IJRIT 103

Set t=1 and choose the matrix A. For Euclidean distance A is identity matrix. Choose also T, the maximum number of iterations allowed.

Step 2: Calculate u_ik^(t)

Step 3: Calculate v_i^(t) from u_ik^(t)

Step 3: Calculate error at t-th iteration as E_t= {||v_i^(t)-v_i^(t-1)||_A} Step 5: if E_t>Є,a predefined error threshold and t<T, then Increment t and go to Step 2;

Else stop with U^*=U^(t)and V^*=V^(t).

Finally the degree of belongingness of a pattern to the clusters may be ordered based on its membership values. Or a pattern may be said to belong to a particular cluster if membership with respect to that cluster is maximum which exceeds a decision threshold. This is a generalization of hard c-means and approaches to hard c-means as m→1⁺

. V. Experimental Result

In this section we mainly demonstrate our comparison using random sets of input data. In Figure 2 we can see initial unclustered data that are taken as input are plotted. In the subsequent figure we demonstrate the output of different clustering algorithm. As we can see we take 3 cluster centers for all of our approach and we plot these centers using marker. Outputs of Maximin, K-Means and FCM algorithm are demonstrated in Figure 3, 4, 5 respectively. At last in Figure 6 we composite these three outputs into one figure which concludes our comparison.

Figure 2

(5)

Biswajit Basak, IJRIT 104

Figure 3

Figure 4

(6)

Biswajit Basak, IJRIT 105

Figure 5

Figure 6

(7)

Biswajit Basak, IJRIT 106 VI. Conclusion

Cluster analysis groups objects based on their similarity and have wide applications. It is measure of similarity can be computed for various types of data. Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, etc. Clustering can also be used for outlier detection which are useful for fraud detection. In this paper we just try to compare the well known clustering algorithm used in various research fields.

References:

[1] Marta V. Modenesi, Myrian C. A. Costa,Alexandre G. Evsukoff,,and Nelson F.F.Ebecken, Parallel Fuzzy C- Means Cluster Analysis, High Performance Computing for Computational Science - VECPAR 2006, Lecture Notes in Computer Science, 2007, Volume 4395/2007, pp. 52-65, DOI:10.1007/978-3-540-71351-7_5

[2] Raghu Krishnapuram, Anupam Joshi and Liyu Yi, A Fuzzy Relative of the k-Medoids Algorithm with Application to Web Document and Snippet Clustering, The IEEE International Fuzzy Systems Conference,

FUZZ-IEEE'99; Seoul; South Korea; 22 Aug.-25 Aug. 1999, pp. III-1281-III-1286.

[3] Yong, Y., Z. Chongxun and L. Pan, Fuzzy Cmeans clustering algorithm with a novel penalty term for image segmentation, Opto-Electronics Review 13(4), 2005, pp. 309-315.

[4] T.Velmurugan and T.Santhanam, A Comparative Analysis Between K-Methods and FUZZY C-MEANS Clustering Algorithms For Statistically Distributed Data Points, Journal of Theoritical and Applied Information Technology,Islamabad,Pakistan,15^th may,2011,Vol 27 No 1,ISSN:1992-8645

[5] Zhang, D.Q.—Chen, S.C.: A Novel Kernelized Fuzzy C-Means Algorithm With Application in Medical Image Segmentation. Artif. Intel. Med, Vol. 32, 2004,pp. 37–50.

[6] Krinidis, S.—Chatzis, V.: A Robust Fuzzy Local Information C-Means Clustering Algorithm. IEEE Trans. on Image Processing, Vol. 19, 2010, No. 5, pp. 1328–1337.

[7] V. S. Rao and Dr. S. Vidyavathi, “Comparative Investigations and Performance Analysis of FCM and MFPCM Algorithms on Iris data”, Indian Journal of Computer Science and Engineering, vol.1, no.2, 2010 pp. 145-151.

[8] A. K. Jain, M. N. Murty and P. J. Flynn, “Data Clustering: A review”, ACM Computing Surveys, vol. 31, no. 3, 1999.

[9] Theodoridis S., Koutroumbas K. (2006), Pattern Recognition. 3rd. s.l. : Elsevier, p. 635.

[10] Han J and Kamber M. Data mining: concepts and techniques, Morgan Kaufmann, San Francisco, 2001.

[11] Xiaojun LOU, Junying LI and Haitao LIU. Improved fuzzy C-means clustering algorithm based on cluster density. Journal of Computational Information Systems, 8(2), pp. 727–737, 2012.

[12] Pawan Lingras, Rui Yan and Chad West. Fuzzy c-means clustering of web users for educational sites. In: Proc.

16’ Canadian society for computational studies of intelligence conference on advances in artificial intelligence, Springer-Verlag Berlin,Heidelberg, pp. 557–562, 2003.

[13] Oyelade O J, Oladipupo O O and Obagbuwa I C. Application of k-Means clustering algorithm for prediction of students academic performance. International Journal of Computer Science and Information Security, IJCSIS, vol.

7, no. 1,pp. 292–295, 2010.

[14] Runhua Wang, Yi Tang, Guoquan Liu and Yan Li. K-means clustering algorithm application in university libraries. In: 10^th IEEE international conference on cognitive informatics & cognitive computing (ICCICC), pp.

419–422, 2011.

A Review: An Approach of Different Types of Clustering Methods for Data Mining

Biswajit Basak, IJRIT 100 International Journal of Research in Information Technology (IJRIT)

International Journal of Research in Information Technology (IJRIT) International Journal of Research in Information Technology (IJRIT) International Journal of Research in Information Technology (IJRIT)

ISSN 2001-5569

A Review: An Approach of Different Types of Clustering Methods for Data Mining

I. Introduction

Biswajit Basak, IJRIT 101

II. Maximin algorithm

-

III. K-means algorithm

Biswajit Basak, IJRIT 102

D

=∑

||

− 

||

IV. Fuzzy C-Means Algorithm

Biswajit Basak, IJRIT 103

. V. Experimental Result

Biswajit Basak, IJRIT 104

Biswajit Basak, IJRIT 105

Biswajit Basak, IJRIT 106 VI. Conclusion

References:

−