• No results found

Centronit: Initial Centroid Designation Algorithm for K-Means Clustering

N/A
N/A
Protected

Academic year: 2021

Share "Centronit: Initial Centroid Designation Algorithm for K-Means Clustering"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

EMITTER International Journal of Engineering Technology Vol.2, No.1, June 2014 ISSN:2443-1168

Copyright © 2014 EMITTER International Journal of Engineering Technology - Published by EEPIS

50

Centronit: Initial Centroid Designation Algorithm

for K-Means Clustering

Ali Ridho Barakbah1 andKohei Arai2

1Electronic Engineering Polytechnic Institute of Surabaya

Jalan Raya ITS Surabaya 60111, Indonesia E-mail: ridho@pens.ac.id

2Department of Information Science

Saga University, 1 Honjo, Saga 840-8502, Japan E-mail: arai@is.saga-u.ac.jp

Abstract

Clustering performance of the K-means highly depends on the correctness of initial centroids. Usually initial centroids for the K-means clustering are determined randomly so that the determined initial centers may cause to reach the nearest local minima, not the global optimum. In this paper, we propose an algorithm, called as Centronit, for designation of initial centroidoptimization of K-means clustering. The proposed algorithm is based on the calculation of the average distance of the nearest data inside region of the minimum distance. The initial centroids can be designated by the lowest average distance of each data. The minimum distance is set by calculating the average distance between the data. This method is also robust from outliers of data. The experimental results show effectiveness of the proposed method to improve the clustering results with the K-means clustering.

Keywords: K-means clustering, initial centroids,

K-meansoptimization.

1. Introduction

Clustering is an effort to classify similar objects in the same groups. Cluster analysis constructs good cluster when the members of a cluster have a high degree of similarity each other (internal homogeneity) and are not like members of other clusters (external homogeneity) [1][2]. It means the process to define a mapping f:DC from some data D={d1,d2, ...,dn} to some

clusters C={c1,c2, ...,cn} on similarity between di. The applications of clustering

is diversely in many fields such as data mining, pattern recognition, image classification, biological sciences, marketing, city-planning, document retrievals, etc.

(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)

Volume 2, No. 1, June 2014

EMITTER International Journal of Engineering Technology, ISSN:2443-1168 61

Table 9. Comparison of Error ratio for New Thyroid dataset

Clustering Algorithm Error (%)

Single Linkage 29.7674

Centroid Linkage 27.907

Complete Linkage 28.3721

Average Linkage 26.0465

Fuzzy c-means 14.4186

K-means (random init.) 20.9842

K-means using Centronit 13.9535

6. CONCLUSION

It is widely reported that the K-means clustering algorithm suffers from initial centroids. The main purpose is to optimize the designation of the initial centroids for K-means clustering. Therefore, in this paper we proposed Centronit as a new algorithm of initial centroid designation algorithm for K-Means Clustering. This algorithm is based on the calculating the average distance of the nearest data inside region of the minimum distance. The initial centroids can be designated by the lowest average distance of each data. The minimum distance is set by calculating the average distance between the data. This algorithm creates the unique clustering results because it does not involve the probabilistic calculation. Moreover, because of observing the data distribution, Centronit is robust for outliers. Experimental results with normal random data distribution and real world datasets perform the accuracy and improved clustering results as compared to some clustering methods.

REFERENCES

[1] G.A. Growe, Comparing Algorithms and Clustering Data: Components of The Data Mining Process, Thesis, Department of Computer Science and Information Systems, Grand Valley State University, 1999.

[2] V.E. Castro, Why So Many Clustering Algorithms-A Position Paper,

ACM SIGKDD Explorations Newsletter, Volume 4, Issue 1, pp. 65-75,

2002.

[3] H. Ralambondrainy, A Conceptual Version of The K-Means Algorithm, Pattern Recognition Letters 16, pp. 1147-1157, 1995.

[4] YM. Cheung, k*-Means: A New Generalized K-Means Clustering

Algorithm, Pattern Recognition Letters 24, pp. 2883-2893, 2003.

[5] S.S. Khan, A. Ahmad, Cluster Center Initialization Algorithm for K-Means Clustering, Pattern Recognition Letters, 2004.

[6] B. Kövesi, JM. Boucher, S. Saoudi, Stochastic K-Means Algorithm for Vector Quantization, Pattern Recognition Letters 22, pp. 603-610, 2001.

(13)

Volume 2, No. 1, June 2014

EMITTER International Journal of Engineering Technology, ISSN:2443-1168 62

[7] P.S. Bradley, U.M. Fayyad, Refining Initial Points for K-Means Clustering, Proc. 15th International Conference on Machine Learning

(ICML’98), 1998.

[8] J.M. Penã, J.A. Lozano, P. Larrañaga, An Empirical Comparison of The Initilization Methods for The K-Means Algorithm, Pattern

Recognition Letters 20, pp. 1027-1040, 1999.

[9] C.J. Veenman, M.J.T. Reinders, E. Backer, A Maximum Variance Cluster Algorithm, IEEE Transactions on Pattern Analysis and Machine

Intelligence, Vol. 24, No. 9, pp. 1273-1280, 2002.

[10] S. Ray, R.H. Turi, Determination of Number of Clusters in K-Means Clustering and Application in Colthe Image Segmentation, Proc. 4th

ICAPRDT, pp.137-143, 1999.

[11] W.H. Ming, C.J. Hou, Cluster Analysis and Visualization, Workshop on

Statistics and Machine Learning, Institute of Statistical Science, Academia

Sinica, 2004.

[12] Ali Ridho Barakbah, Kohei Arai, Identifying Moving Variance to Make Automatic Clustering for Normal Dataset, Proc. IECI Japan Workshop

2004 (IJW 2004), Musashi Institute of Technology, Tokyo, 2004.

[13] UCIRepository (http://www.sgi.com/tech/mlc/db/).

[14] C. Yi-tsuu, Interactive Pattern Recognition, Marcel Dekker Inc., New York and Basel, 1978.

[15] R.K. Pearson, T. Zylkin, J.S. Schwaber, G.E. Gonye, Quantitative Evaluation of Clustering Results Using Computational Negative Controls, Proc. 2004 SIAM International Conference on Data Mining, Lake Buena Vista, Florida, 2004.

Figure

Table 9. Comparison of Error ratio for New Thyroid dataset  Clustering Algorithm  Error (%)

References

Related documents

National Conference on Technical Vocational Education, Training and Skills Development: A Roadmap for Empowerment (Dec. 2008): Ministry of Human Resource Development, Department

consequence of this section is that any transaction between connected persons with a tax-haven is potentially subject to the application of the GAAR unless the taxpayer can prove

The addition of manyw , then, does not save bottom-of-the-scale numerals from yielding contradictory truth conditions, and so we conclude that our account of the bottom-of-the-

Other difficulties with a polydactyly theory are: embryos from nonavian theropods are not available for study; no adult archosaur has six dis- tinct digits; there is no evidence for

In view of the spread of drug resistance among the Staphylococcus aureus strains observed, it may be necessary to embark on multidrug therapy or combined

نیا هک اجنآ زا تتسا رکذ لباق همانتشتسرپ نایوجتشناد یور نر م هدتش یندومزآ یور رب همانشسرپ یارجا زا لبق ،دوب ،شهوژپ یاه همانشتسرپ ققحم دروآرب روونم هب ار یونعم شوه ی رابتعا

The study of psychological pressure (stress) and u ’ high school managements in Tehran training and education center (zone

Contribution of the long-term care insurance certificate for predicting 1-year all-cause readmission compared with validated risk scores in elderly patients with