Cluster Validation - Clustering Algorithms

Chapter 3. Clustering Algorithms

4.2 Cluster Validation

When clustering procedures are completed and the clustering results are obtained with a confirmed number of clusters and an assignment of data

points into each cluster, the next and also the final step is to evaluate the goodness of the resulting clusters, which is also known as cluster validation, and cluster validation usually is associated with the process of determining the number of clusters.

As for the motivation of cluster validation, it involves several concerns: to avoid finding clusters in noise, to compare different clusters, or to compare the effectiveness of different clustering algorithms on a specific dataset. One potentially useful validation technique is Cross-validation.

For cross-validation, firstly, randomly split the observations, and then choose one clustering technique to perform cluster analysis on each set of observations. If similar clusters develop, then such clustering result is potentially good to accept. However, if different clusters appear, then the clustering result is not generalizable. A variation on this method is to perform cluster analysis (specifically, using K-means algorithm) on the first set of observations, then use its cluster centroids as seeds to cluster the second set. This forces the same number of clusters in the cross-validation. If the cluster centroids from the first set reproduce similar assignments of data points and the clusters in the second set of observations, which have small within-cluster errors and big between-cluster errors, then this would be a good clustering.

Halkidi, et al [28] introduced the fundamental concepts of cluster validity, such as compactness and separation, and gave a systematic analysis of how cluster validity indices are used in cluster validation, including external criteria, internal criteria and relative criteria.

Brook, et al [29] developed an R package clValid which contains specific functions for validating the clustering results. There are three main types of cluster validation measures available which are ”internal”, ”stability”, and ”biological”, and such package can evaluate the cluster analysis resulted from up to 9 clustering algorithms, including hierarchical, K-means, self-organizing maps (SOM), model-based clustering, etc.

Appendix A

R Code for Figure 2.2 & Clustering Analysis

on Romano-British Pottery

## F i g u r e 2 . 2 C l u s t e r s f o r i r i s D a t a s e t on Page 6 l i b r a r y ( d e v t o o l s ) i n s t a l l g i t h u b ( ’ s i n h r k s / g g f o r t i f y ’ ) l i b r a r y ( g g p l o t 2 ) l i b r a r y ( g g f o r t i f y ) l i b r a r y ( c l u s t e r ) s e t . s e e d ( 1 ) a u t o p l o t ( f a n n y ( i r i s [ − 5 ] , 4 ) , frame = TRUE) ## C l u s t e r i n g A n a l y s i s on Romano−B r i t i s h P o t t e r y ## on Page 14−16 l i b r a r y (HSAUR) k i l n <− r e p ( 1 : 5 , c ( 2 1 , 1 2 , 2 , 5 , 5 ) ) k i l n <− a s . d a t a . frame ( k i l n ) p o t t e r y [ , 1 0 ] <− k i l n p o t t e r y d i s t <− d i s t ( p o t t e r y [ , c o l n a m e s ( p o t t e r y ) != ” k i l n ” ] , method = ” e u c l i d e a n ” ) # F i g u r e 3 . 1 : Image o f E u c l i d e a n D i s t a n c e b a s e d # D i s s i m i l a r i t y Matrix on P o t t e r y Data l i b r a r y ( l a t t i c e ) l e v e l p l o t ( a s . m a t r i x ( p o t t e r y d i s t ) , x l a b = ”Number o f Pot ” , y l a b = ”Number o f Pot ” )

p o t t e r y s i n g l e <− h c l u s t ( p o t t e r y d i s t , method = ” s i n g l e ” )

p o t t e r y c o m p l e t e <− h c l u s t ( p o t t e r y d i s t , method = ” c o m p l e t e ” )

p o t t e r y a v e r a g e <− h c l u s t ( p o t t e r y d i s t , method = ” a v e r a g e ” )

# Table 3 . 2 : R e l a t i o n s between C l u s t e r s and K i l n S i t e s # f o r Average Link c l u s t e r s <− c u t r e e ( p o t t e r y a v e r a g e , h = 4 ) x t a b s ( ˜ c l u s t e r s + k i l n , d a t a = p o t t e r y ) # F i g u r e 3 . 2 : Dendrogram o f H i e r a r c h i c a l C l u s t e r i n g # u s i n g E u c l i d e a n D i s t a n c e par ( mfrow =c ( 1 , 3 ) )

p l o t ( p o t t e r y s i n g l e , main = ” S i n g l e Link ” , sub = ” ” , x l a b = ” ” )

p l o t ( p o t t e r y c o m p l e t e , main = ” Complete Link ” , sub = ” ” , x l a b = ” ” )

p l o t ( p o t t e r y a v e r a g e , main = ” Average Link ” , sub = ” ” , x l a b = ” ” )

Appendix B

R Code for K-means Experimental Study on

Pottery Data

## K−means E x p e r i m e n t a l Study on P o t t e r y Data on ## Page 17−19 l i b r a r y ( g g p l o t 2 ) l i b r a r y (HSAUR) l i b r a r y (HSAUR2) s e t . s e e d ( 1 3 ) r e s . kmeans <− l a p p l y ( 1 : 1 0 , f u n c t i o n ( i ) { kmeans ( p o t t e r y [ , 1 : 9 ] , c e n t e r s = i ) } ) #Within SS f o r e a c h c l u s t e r ( 1 c l u s t e r t o 10 c l u s t e r s ) l a p p l y ( r e s . kmeans , f u n c t i o n ( x ) x $ w i t h i n s s )

#Table 3 . 3 : T o t a l Within−c l u s t e r Sum o f Squared D i s t a n c e r e s . w i t h i n . s s <− s a p p l y ( r e s . kmeans , f u n c t i o n ( x )

sum ( x $ w i t h i n s s ) ) r e s . w i t h i n . s s

#F i g u r e 3 . 4 : R e l a t i o n s between SSD and Number o f C l u s t e r s g g p l o t ( d a t a . frame (No . o f C l u s t e r s = 1 : 1 0 ,

SSD = r e s . w i t h i n . s s ) ,

a e s (No . o f C l u s t e r s , SSD ) ) + g e o m p o i n t ( ) + g e o m l i n e ( ) + s c a l e x c o n t i n u o u s ( b r e a k s = 0 : 1 0 ) #Table 3 . 4 : C l u s t e r Means f o r Each C l u s t e r &

#u s i n g K−means r e s . kmeans [ 3 ] ## Table 4 . 1 : P e r c e n t a g e o f E x p l a i n e d V a r i a n c e a g a i n s t ## Number o f C l u s t e r s on Page 35 r e s . between . s s <− s a p p l y ( r e s . kmeans , f u n c t i o n ( x ) ( x $ b e t w e e n s s ) / ( x $ t o t s s ) ) r e s . between . s s # F i g u r e 4 . 1 : E x p l a i n e d V a r i a n c e by C l u s t e r i n g a g a i n s t # Number o f C l u s t e r s on Page 34 g g p l o t ( d a t a . frame (No . o f C l u s t e r s = 1 : 1 0 , BSS in TOTSS = r e s . between . s s ) , a e s (No . o f C l u s t e r s , BSS in TOTSS ) ) + g e o m p o i n t ( ) + g e o m l i n e ( ) + s c a l e x c o n t i n u o u s ( b r e a k s = 0 : 1 0 )

Appendix C

R Code for Experimental Analysis on iris

Dataset

## E x p e r i m e n t a l A n a l y s i s on I r i s D a t a s e t on Page 28−30 l i b r a r y ( m c l u s t )

i m c l u s t <− Mclust ( i r i s [ , 1 : 2 ] )

# Table 3 . 6 : B r i e f R e s u l t s o f EM C l u s t e r i n g &

# Table 3 . 7 : Parameter E s t i m a t e s o f Mixing P r o b a b i l i t i e s # and Means

# Table 3 . 8 : Parameter E s t i m a t e s o f C o v a r i a n c e s summ <− summary ( i m c l u s t , p a r a m e t e r s = TRUE) summ i m c l u s t $ B I C i m c l u s t $ c l a s s i f i c a t i o n # F i g u r e 3 . 6 : The B i v a r i a t e i r i s D a t a s e t p l o t ( i r i s $ S e p a l . Length , i r i s $ S e p a l . Width , x l a b = ” S e p a l . Length ” , y l a b = ” S e p a l . Width ” , pch = ” o ” ) # F i g u r e 3 . 7 : D e n s i t y E s t i m a t e f o r B i v a r i a t e i r i s D a t a s e t i r i s D e n s <− d e n s i t y M c l u s t ( i r i s [ , 1 : 2 ] ) p l o t ( i r i s D e n s , t y p e = ” p e r s p ” , c o l = g r e y ( 0 . 8 ) ) # F i g u r e 3 . 8 : P l o t s A s s o c i a t e d w i t h t h e F u n c t i o n Mclust # f o r i r i s D a t a s e t p l o t ( i m c l u s t )

Bibliography

[1] Wikipedia, “Cluster analysis - wikipedia, the free encyclopedia,” 2015, [Online; accessed 15-February-2015]. [Online]. Available: http://en.wikipedia.org/wiki/Cluster analysis

[2] R. C. Tryon, Cluster analysis: correlation profile and orthometric (factor) analysis for the isolation of unities in mind and personality. Edwards brother, Incorporated, lithoprinters and publishers, 1939.

[3] R. B. Cattell, “The description of personality: basic traits resolved into clusters.” The journal of abnormal and social psychology, vol. 38, no. 4, p. 476, 1943.

[4] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John Wiley & Sons, 2012.

[5] O. Chapelle, B. Sch¨olkopf, A. Zien et al., Semi-supervised learning. MIT Press Cambridge, 2006.

[6] T. Lange, M. H. Law, A. K. Jain, and J. M. Buhmann, “Learning with constrained and unlabelled data,” in Computer Vision and Pattern Recog- nition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 731–738.

[7] S. Basu, A. Banerjee, and R. J. Mooney, “Active semi-supervision for pairwise constrained clustering.” in SDM, vol. 4. SIAM, 2004, pp. 333– 344.

[8] M. Bilenko, S. Basu, and R. J. Mooney, “Integrating constraints and met- ric learning in semi-supervised clustering,” in Proceedings of the twenty- first international conference on Machine learning. ACM, 2004, p. 11. [9] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern recog-

nition letters, vol. 31, no. 8, pp. 651–666, 2010.

[10] J. Gareth, An Introduction to Statistical Learning: with Applications in R. Springer, 2013.

[11] B. S. Everitt, “Unresolved problems in cluster analysis,” Biometrics, pp. 169–181, 1979.

[12] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, and R. Tib- shirani, The elements of statistical learning. Springer, 2009, vol. 2, no. 1. [13] T. Hothorn and B. S. Everitt, A handbook of statistical analyses using R.

CRC Press, 2014.

[14] Wikipedia, “Atomic absorption spectroscopy - wikipedia, the free encyclopedia,” 2015, [Online; accessed 07-March-2015]. [Online]. Available: http://en.wikipedia.org/wiki/Atomic absorption spectroscopy

[15] A. Tubb, A. Parker, and G. Nickless, “The analysis of romano-british pottery by atomic absorption spectrophotometry,” Archaeometry, vol. 22, no. 2, pp. 153–171, 1980.

[16] G. H. Ball and D. J. Hall, “Isodata, a novel method of data analysis and pattern classification,” DTIC Document, Tech. Rep., 1965.

[17] E. W. Forgy, “Cluster analysis of multivariate data: efficiency versus interpretability of classifications,” Biometrics, vol. 21, pp. 768–769, 1965. [18] J. C. Dunn, “A fuzzy relative of the isodata process and its use in detecting

compact well-separated clusters,” 1973.

[19] J. C. Bezdek, Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, 1981.

[20] S. Padhraic, “A guided tour of finite mixture models: From pearson to the web.” Presented as the ICML 01 Keynote Talk at Williams College, 2001.

[21] C. Fraley, A. E. Raftery, T. B. Murphy, and L. Scrucca, “mclust version 4 for r: Normal mixture modeling for model-based clustering, classification, and density estimation,” 2012.

[22] D. J. Ketchen and C. L. Shook, “The application of cluster analysis in strategic management research: an analysis and critique,” Strategic management journal, vol. 17, no. 6, pp. 441–458, 1996.

[23] C. Goutte, P. Toft, E. Rostrup, F. ˚A. Nielsen, and L. K. Hansen, “On clustering fmri time series,” NeuroImage, vol. 9, no. 3, pp. 298–310, 1999. [24] T. Hothorn and B. S. Everitt, A handbook of statistical analyses using R.

CRC Press, 2014.

[25] C. Goutte, L. K. Hansen, M. G. Liptrot, and E. Rostrup, “Feature-space clustering for fmri meta-analysis,” Human brain mapping, vol. 13, no. 3, pp. 165–183, 2001.

[26] G. W. Milligan and M. C. Cooper, “An examination of procedures for determining the number of clusters in a data set,” Psychometrika, vol. 50, no. 2, pp. 159–179, 1985.

[27] M. C. Cooper and G. W. Milligan, The effect of measurement error on determining the number of clusters in cluster analysis. Springer, 1988. [28] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “On clustering validation

techniques,” Journal of Intelligent Information Systems, vol. 17, no. 2-3, pp. 107–145, 2001.

[29] G. Brock, V. Pihur, S. Datta, and S. Datta, “clvalid, an r package for cluster validation,” Journal of Statistical Software (Brock et al., March 2008), 2011.

Vita

Lihao Zhang was born in Liaocheng, China in 1990. He received the Bachelor of Science degree in Mathematics and Applied Mathematics from Shandong University, China in 2013. He was accepted to the Master’s program in Statistics in The University of Texas at Austin in 2013, and then he started his graduate studies.

Permanent address: [email protected]

This report was typeset with LA_TEX† _{by the author.}

†_LA_{TEX is a document preparation system developed by Leslie Lamport as a special} version of Donald Knuth’s TEX Program.

In document Statistical clustering of data (Page 46-59)