Dependent data - Finite mixture models and model-based clustering

In this section, we consider an approach for analyzing dependent data that are marginally distributed from the mixture model (2). Suppose that we have n observations Y = (Y1, Y2, . . . , Yn)^′ consisting of univariate normally distributed observations following an autoregressive AR(1) model. The AR(1) model as-sumes a correlation structure given by

R(ρ) =







1 ρ ρ² . . . ρⁿ⁻¹

ρ 1 ρ . . . ρⁿ⁻²

ρ² ρ 1 . . . ρⁿ⁻³ . . . . ρⁿ⁻¹ ρⁿ⁻² ρⁿ⁻³ . . . 1





 .

Also assume that there are K components with means µk, k = 1, 2, . . . , K and common (marginal) variance σ². The origin of every observation, however, is not known. We again introduce missing information – group memberships, which can be given in the form of the matrix

X =







I11 I12 . . . I1K

I21 I22 . . . I2K

. . . . In1 In2 . . . InK





 ,

where every Iik represents the indicator function I(Yi∈ k − th cluster). Then, the entire sample can be written in the form Y ∼ M V N (Xβ, σ²R(ρ)) if the class memberships of observations are known. Thus, the complete likelihood as well as Q-function can be obtained and corresponding expressions for the M-step of the EM algorithm can be derived, however expressions for the EM iterations are more complicated and involve taking derivatives of R⁻¹(ρ) with respect to ρ, for which closed-form expressions may not be available. As a result, while the EM algorithm can be set up and used for parameter estimation in the same way as usual, estimation becomes far more difficult. This is especially true for the case when the dependence between observations is of a form more complicated than an AR(1) structure. Model-fitting presents another challenge as does variance estimation: note also that the methods detailed in Section2.1.3 are for independent identically distributed observations and are inapplicable for dependent data. Thus, new approaches are needed.

8. Conclusions

This paper provides a detailed overview of mixture models with specific reference to model-based clustering. In addition to descriptions of several existing and

well-known results and methods, we provide details on simulation and evaluation of clustering algorithms as well as on graphical illustration of mixtures. Two applications involving non-Gaussian mixtures are presented. We also list some available software in the field. Finally, some additional topics such as semi-supervised clustering, constrained clustering, massive datasets, diagnostics and dependent observations are presented and unresolved challenges outlined. As seen here, the field has attracted a lot of interest, but there are still many questions and issues that have to be addressed. Therefore, we hope that this survey will provide readers with a good understanding of the issues involved and spur further interest and development in this field.

Acknowledgments

The authors thank two anonymous reviewers whose detailed comments and suggestions greatly improved the quality of this survey.

References

[1] Aitkin, M., Anderson, D., and Hinde, J. (1981). Statistical mod-elling of data on teaching styles (with discussion). Journal of the Royal Statistical Society B 144, 419–461.

[2] Aitkin, M. and Rubin, D. (1985). Estimation and hypothesis testing in finite mixture models. Journal of the Royal Statistical Society B 47, 67–75.

[3] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second international symposium on information theory. 267–281.MR0483125

[4] Anderson, E. (1935). The Irises of the Gaspe peninsula. Bulletin of the American Iris Society 59, 2–5.

[5] Atlas, R. and Overall, J. (1994). Comparative evaluation of two superior stopping rules for hierarchical cluster analysis. Psychometrika 59, 581–591.MR1309659

[6] Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics 12, 171–178.MR0808153 [7] Azzalini, A. (2005). The skew-normal distribution and related

multi-variate families (with discussion). Scandinavian Journal of Statistics 32, 159–200.MR2188669

[8] Azzalini, A. and Dalla Valle, A. (1996). The multivariate skew-normal distribution. Biometrika 83, 715–726.MR1440039

[9] Baddeley, A. J. and Møller, J. (1989). Nearest-neighbour Markov point processes and random sets. International Statistical Review 2, 89–

121.

[10] Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S. (2005). Cluster-ing on the unit hypesphere usCluster-ing von Mises-Fisher distributions. Journal of Machine Learning Research 6, 1345–1382.MR2249858

[11] Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821.MR1243494 [12] Basu, S., Banerjee, A., and Mooney, R. (2002). Semi-supervised

clustering by seeding. In Proceedings of the 19th International Conference on Machine Learning. 19–26.

[13] Basu, S., Banerjee, A., and Mooney, R. (2004). Active semi-supervision for pairwise constrained clustering. In Proceedings of the SIAM International Conference on Data Mining. MR2388453

[14] Baudry, J.-P., Raftery, A., Celeux, G., Lo, K., and Gottardo, R. G.(2010). Combining mixture components for clustering. Journal of Computational and Graphical Statistics, to appear.

[15] Biernacki, C., Celeux, G., and Gold, E. M. (2000). Assessing a mixture model for clustering with the integrated completed likelihood.

IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 719–

725.

[16] Biernacki, C., Celeux, G., and Govaert, G. (1999). An improvement of the NEC criterion for assessing the number of clusters in a mixture model. Pattern Recognition Letters 20, 267–272.

[17] Biernacki, C., Celeux, G., and Govaert, G. (2003). Choosing start-ing values for the EM algorithm for gettstart-ing the highest likelihood in mul-tivariate Gaussian mixture models. Computational Statistics and Data Analysis 413, 561–575.MR1968069

[18] Biernacki, C., Celeux, G., Govaert, G., and Langrognet, F.

(2006). Model-based clustering and discriminant analysis with the MIX-MOD software. Computational Statistics and Data Analysis 51/2, 587–

600.MR2297473

[19] Blashfield, R. K. (1976). Mixture model tests of cluster analysis – Ac-curacy of 4 agglomerative hierarchical methods. Psychological Bulletin 83, 377–388.

[20] B¨ohning, D., Dietz, E., Schaub, R., Schlattmann, P., and Lind-say, B. (1994). The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Insti-tute of Statistical Mathematics 46(2), 373–388.

[21] B¨ohning, D., Dietz, E., and Schlattmann, P. (1998). Recent devel-opments in computer-assisted analysis of mixtures. Annals of the Institute of Statistical Mathematics 54, 2, 525–536.

[22] Box, G. E. P. and Draper, N. R. (1987). Empirical Model-Building and Response Surfaces. John Wiley, New York, NY.MR0861118

[23] Boyles, R. A. (1983). On the convergence of the EM algorithm. Journal of the Royal Statistical Society, Series B 45, 47–50.MR0701075

[24] Bradley, P., Fayyad, U., and Reina, C. (1998). Scaling clustering algorithms to large databases. In Proc. of the Fourth International Confer-ence on Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press, 9–15.

[25] Brodatz, P. (1966). A Photographic Album for Artists and Designers.

Dover, New York.

[26] Campbell, N. A. and Mahon, R. J. (1974). A multivariate study of variation in two species of rock crab of Genus Leptograsus. Australian Journal of Zoology 22, 417–25.

[27] Chen, C., Forbes, F., and Francois, O. (2006). FASTRUCT: Model-based clustering made faster. Molecular Ecology Notes 6, 980–983.

[28] Chen, J. and Li, P. (2008). Hypothesis testing for normal mixture models: the EM approach. submitted to Annals of Statistics.

[29] Chen, W.-C., Maitra, R., and Melnykov, V. (2010). Model-based semi-supervised clustering. In preparation.

[30] Chipman, H. and Tibshirani, R. (2006). Hybrid hierarchical clustering with applications to microarray data. Biostatistics 7(2), 286–301.

[31] Chung, A. C. S. and Noble, J. A. (1999). Statistical 3d vessel seg-mentation using a Rician distribution. In MICCAI. 82–89.

[32] Cramer, H. (1946). Mathematical methods of statistics. Princeton Uni-versity Press, Princeton, New Jersey.MR0016588

[33] Dasgupta, A. and Raftery, A. E. (1998). Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association 93, 294–302.

[34] Dasgupta, S. (1999). Learning mixtures of Gaussians. In Proc. IEEE Symposium on Foundations of Computer Science. New York, 633–644.

MR1917603

[35] Day, N. (1969). Estimating the components of a mixture of two normal distributions. Biometrika 56, 463–474.MR0254956

[36] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood for incomplete data via the EM algorithm (with discussion).

Jounal of the Royal Statistical Society, Series B 39, 1–38.MR0501537 [37] Dortet-Bernadet, J. and Wicker, N. (2008). Model-based

cluster-ing on the unit sphere with an illustration uscluster-ing gene expression profiles.

Biostatistics 9, 1, 66–80.

[38] Fayyad, U. and Smyth, P. (1999). Cataloging and mining massive datasets for science data analysis. Journal of Computational and Graphical Statistics 8, 589–610.

[39] Feng, Z. and McCulloch, C. (1996). Using bootstrap likelihood ratio in finite mixture models. Journal of the Royal Statistical Society B 58, 609–617.

[40] Figueiredo, M. A. T. and Jain, A. K. (2002). Unsupervised learning of finite mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 3, 381–396. http://dx.doi.org/

http://doi.ieeecomputersociety.org/10.1109/34.990138.

[41] Finch, S., Mendell, N., and Thode, H. (1989). Probabilistic measures of adequacy of a numerical search for a global maximum. Journal of the American Statistical Association 84, 1020–1023.

[42] Forina, M. e. a. (1991). Parvus - an extendible package for data ex-ploration, classification and correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno.

[43] Fraley, C., Raftery, A., and Wehrens, R. (2005). Incremental model-based clustering for large datasets with small clusters. Journal of Computational and Graphical Statistics 14, 529–546.MR2170200 [44] Fraley, C. and Raftery, A. E. (2006). MCLUST version 3 for R:

Normal mixture modeling and model-based clustering. Tech. Rep. 504, University of Washington, Department of Statistics, Seattle, WA. 2006.

[45] Fr¨uhwirth-Schnatter, S.(2006). Finite Mixture and Markov Switch-ing Models. SprSwitch-inger, New York.MR2265601

[46] Gabriel, K. R. (1971). The biplot graphical display of matrices with application to principal component analysis. Biometrika 58, 453–467.

MR0312645

[47] Ghosh, J. and Sen, P. (1985). On the asymptotic performance of the loglikelihood ratio statistic for the mixture model and related results. Pro-ceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer 2, 789–806.MR0822065

[48] Gilks, W., Richardson, S., and Spiegelhalter, D. (1996). Markov Chain Monte Carlo in Practice. Chapman & Hall, London.

[49] Gold, E. M. and Hoffman, P. J. (1976). Flange detection cluster analysis. Multivariate Behavioral Research 11, 217–235.

[50] Goldberger, J. and Roweis, S. (2004). Hierarchical clustering of a mixture model. NIPS 2004 .

[51] Gupta, A., Gonzalez-Farias, G., and Dominguez-Molina, A.

(2002). A multivariate skew normal distribution. Journal of Multivariate Analysis 89, 181–190.

[52] Hartigan, J. (1985). Statistical theory in clustering. Journal of Classi-fication 2, 63–76.MR0800514

[53] Hathaway, R. J. (1985). A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Statistics & Prob-ability Letters 4, 53–56.MR0790575

[54] Haughton, D. (1997). Packages for estimating finite mixtures: a review.

The American Statistician 51, 194–205.

[55] Henderson, H. and Searle, S. (1979). Vec and Vech operators for matrices, with some uses in Jacobians and multivariate statistics. The Canadian Journal of Statistics 7, 65–81.MR0549795

[56] Hennig, C. (2010). Methods for merging Gaussian mixture components.

Advances in Data Analysis and Classification 4, 3–34.

[57] Huang, J.-T. and Hasegawa-Johnson, M. (2009). On semi-supervised learning of Gaussian mixture models for phonetic classification. In NAACL HLT workshop on semi-supervised learning.

[58] Inoue, M. and Ueda, N. (2003). Exploitation of unlabeled sequences in hidden Markov models. IEEE Transactions On Pattern Analysis and Machine Intelligence 25, 1570–1581.

[59] Inselberg, A. (1985). The plane with parallel coordinates. The Visual Computer 1, 69–91.

[60] Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association 90, 773–795.

[61] Keribin, C. (2000). Consistent estimation of the order of finite mixture models. Sankhy¯a 62, 49–66.MR1769735

[62] Kiefer, N. M. (1978). Discrete parameter variation: efficient estimation of a switching regression model. Econometrica 46, 427–434.MR0483200 [63] Kuiper, F. K. and Fisher, L. (1975). A Monte Carlo comparison of

six clustering procedures. Biometrics 31, 777–783.

[64] Li, J. (2005). Clustering based on multi-layer mixture model. Journal of Computational and Graphical Statistics 14(3), 547–568.MR2170201 [65] Li, J., Ray, S., and Lindsay, B. (2007). A nonparametric statistical

approach to clustering via mode identification. The Journal of Machine Learning Research 8, 1687–1723.MR2332445

[66] Li, J. and Zha, H. (2006). Two-way Poisson mixture models for si-multaneous document classification and word clustering. Computational Statistics and Data Analysis 50, 1, 163–180.MR2196228

[67] Li, P., Chen, J., and Marriott, P. (2008). Non-finite Fisher informa-tion and homogeneity: an EM approach. Biometrika, 1–15.

[68] Likas, A., Vlassis, N., and Verbeek, J. J. (2003). The global k-means clustering algorithm. Pattern Recognition 36, 451–461.

[69] Lin, T.-C. and Lin, T.-I. (2009). Supervised learning of multivariate skew normal mixture models with missing information. Computational Statistics.

[70] Lin, T. I. (2009). Maximum likelihood estimation for multivariate skew normal mixture models. Journal of Multivariate Analysis 100, 257–265.

MR2474772

[71] Lin, T. I., Lee, J. C., and Yen, S. Y. (2007). Finite mixture mod-elling using the skew normal distribution. Statistica Sinica 17, 909–927.

MR2408641

[72] Lindsay, B. (1983). The geometry of mixture likelihoods: a general the-ory. The Annals of Statistics 11, 1, 86–94.MR0684866

[73] Lindsay, B. (1995). Mixture models: Theory, Geometry and Applications.

[74] Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. Journal of Royal Statistical Society, B 44, 226–233.

MR0676213

[75] Lu, Z. and Leen, T. (2005). Semi-supervised learning with penalized probabilistic clustering. In Advances in NIPS. Vol. 17.

[76] Magnus, J. and Neudecker, H. (1999). Matrix differential calculus with applications in statistics and econometrics, 2 ed. Wiley, New York.

MR1698873

[77] Maitra, R. (2001). Clustering massive datasets with applications to soft-ware metrics and tomography. Technometrics 43, 3, 336–346.MR1943188 [78] Maitra, R. (2009). Initializing partition-optimization algorithms.

IEEE/ACM Transactions on Computational Biology and Bioinformat-ics 6, 144–157.

[79] Maitra, R. and Faden, D. (2009). Noise estimation in magnitude MR datasets. IEEE Transactions on Medical Imaging 28, 10, 1615–1622.

[80] Maitra, R. and Melnykov, V. (2010a). Assessing significance in finite mixture models. Tech. Rep. 10-01, Department of Statistics, Iowa State University.

[81] Maitra, R. and Melnykov, V. (2010b). Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, in press.

[82] Maugis, C., Celeux, G., and Martin-Magniette, M.-L. (2009).

Variable selection for clustering with Gaussian mixture models. Biomet-rics 65, 3, 701–709.

[83] McCulloch, C. (1982). Symmetric matrix derivatives with applications.

Journal of the American Statistical Association 77, 679–682.MR0675898 [84] McIntyre, R. M. and Blashfield, R. K. (1980). A nearest-centroid technique for evaluating the minimum-variance clustering procedure. Mul-tivariate Behavioral Research 15, 225–238.

[85] McLachlan, G. (1987). On bootstrapping the likelihood ratio test statis-tic for the number of components in a normal mixture. Applied Statis-tics 36, 318–324.

[86] McLachlan, G. and Krishnan, T. (1997). The EM Algorithm and Extensions. Wiley, New York.MR1417721

[87] McLachlan, G. and Peel, D. (2000). Finite Mixture Models. John Wiley and Sons, Inc., New York.MR1789474

[88] McLachlan, G., Peel, G., Basford, K., and Adams, P. (1999). Fit-ting of mixtures of normal and t-components. Journal of Statistical Soft-ware 4:2.

[89] McLachlan, G. J. and Basford, K. E. (1988). Mixture Models:

Inference and Applications to Clustering. Marcel Dekker, New York.

MR0926484

[90] Melnykov, V. and Maitra, R. (2010). CARP: Software for fishing out good clustering algorithms. Journal of Machine Learning Research, submitted.

[91] Melnykov, V., Maitra, R., and Nettleton, D. (2010). Accounting for spot matching uncertainty in the analysis of proteomics data from two-dimensional gel electrophoresis. In preparation.

[92] Milligan, G. W. (1985). An algorithm for generating artificial test clusters. Psychometrika 50, 123–127.

[93] Minnotte, M. and Scott, D. (1993). The mode tree: a tool for visual-ization of nonparametric density features. Journal of Computational and Graphical Statistics 2(1), 51–68.

[94] Morris, J. S., Clark, B. N., and Gutstein, H. B. (2008). Pinnacle: a fast, automatic and accurate method for detecting and quantifying protein spots in 2-dimensional gel electrophoresis data. Bioinformatics 24, 529–

536.

[95] Newcomb, S. (1886). A generalized theory of the combination of observa-tions so as to obtain the best result. American Journal of Mathematics 8, 343–366.MR1505430

[96] Pan, W. and Shen, X. (2006). Penalized model-based clustering with application to variable selection. Journal of Machine Learning Research 8, 1145–1164.

[97] Pan, W., Shen, X., Jiang, A., and Hebbel, R. (2006). Semisupervised learning via penalized mixture model with application to microarray sam-ple classification. Bioinformatics 22(19), 2388–2395.

[98] Pearson, K. (1894). Contribution to the mathematical theory of evolu-tion. Philosophical Transactions of the Royal Society 185, 71–110.

[99] Peel, D. and McLachlan, G. (2000). Robust mixture modeling using the t-distribution. Statistics and Computing 10, 339:348.

[100] Price, L. J. (1993). Identifying cluster overlap with normix population membership probabilities. Multivariate Behavioral Research 28, 235–262.

[101] Qiu, W. and Joe, H. (2006a). Generation of random clusters with specified degree of separation. Journal of Classification 23, 315–334.

MR2295925

[102] Qiu, W. and Joe, H. (2006b). Separation index and partial membership for clustering. Computational Statistics and Data Analysis 50, 585–603.

MR2196285

[103] Raftery, A. E. and Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association 101, 168–178.MR2268036

[104] Ray, S. and Lindsay, B. (2005). The topography of multivariate normal mixtures. Annals of Statistics 33(5), 2042–2065.MR2211079

[105] Ray, S. and Lindsay, B. (2008). Model selection in high dimensions: a quadratic-risk-based approach. Journal of Royal Statistical Society (B) 70, 95–118.MR2412633

[106] Robert, C. and Casella, G. (1999). Monte Carlo Statistical Methods.

Springer-Verlag, New York.MR1707311

[107] Roeder, K. and Wasserman, L. (1997). Practical Bayesian density estimation using mixtures of normals. Journal of the American Statistical Association 92, 894–902.MR1482121

[108] Ruspini, E. (1970). Numerical methods for fuzzy clustering. Information Science 2, 319–350.

[109] Schwarz, G. (1978). Estimating the dimensions of a model. Annals of Statistics 6, 461–464.MR0468014

[110] Shental, N., Bar-Hillel, A., Hertz, T., and Weinshall, D. (2003).

Computing Gaussian mixture models with EM using equivalence con-straints. In Advances in NIPS. Vol. 15.

[111] Steinley, D. and Henson, R. (2005). Oclus: An analytic method for generating clusters with known overlap. Journal of Classification 22, 221–

250.MR2231173

[112] Stuetzle, W. and Nugent, R. (2010). A generalized single linkage method for estimating the cluster tree of a density. Journal of Computa-tional and Graphical Statistics, in press.

[113] Titterington, D., Smith, A., and Makov, U. (1985). Statistical Anal-ysis of Finite Mixture Distributions. John Wiley & Sons, Chichester, U.K.

MR0838090

[114] Vardi, Y., Shepp, L. A., and Kaufman, L. A. (1985). A statisti-cal model for Positron Emission Tomography. Journal of the American Statistical Association 80, 8–37.MR0786595

[115] Verbeek, J., Vlassis, N., and Krose, B. (2003). Efficient greedy learning of Gaussian mixture models. Neural Computation 15, 469–485.

[116] Verbeek, J., Vlassis, N., and Nunnink, J. (2003). A variational EM algorithm for large-scale mixture modeling. Annual Conference of the Advanced School for Computing and Imaging, 1–7.

[117] Wang, H., Segal, E., and Koller, D. (2003). Discovering molecular pathways from protein interaction and gene expression data. Bioinfor-matics 19, 264–272.

[118] Wang, S. and Zhu, J. (2008). Variable selection for model-based high-dimensional clustering and its application to microarray data. Biomet-rics 64, 440–448.MR2432414

[119] Wang, S. J., Woodward, W. A., Gray, H. L., Wiechecki, S., and Satin, S. R.(1997). A new test for outlier detection from a multivariate mixture distribution. Journal of Computational and Graphical Statistics 6, 285–299.MR1466869

[120] Wang, T. and Lei, T. (1994). Statistical analysis of MR imaging and its application in image modeling. In Proceedings of the IEEE International Conference on Image Processing and Neural Networks. Vol. 1. 866–870.

[121] Wegman, E. (1990). Hyperdimensional data analysis using parallel co-ordinates. Journal of the American Statistical Association 85, 664–675.

[122] Windham, M. P. and Cutler, A. (1992). Information ratios for validat-ing mixture analyses. Journal of the American Statistical Association 87, 1188–1192.

[123] Wolfe, J. H. (1967). NORMIX: Computatinal methods for estimating the parameters of multivariate normal mixture distributions. Technical bulletin USNPRA SRM 68-2.

[124] Wolfe, J. H. (1970). Pattern clustering by multivariate mixture analysis.

Multivariate Behavioral Research 5, 329–350.

[125] Wu, C. F. J. (1983). On convergence properties of the EM algorithm.

The Annals of Statistics 11, 95–103.MR0684867

[126] Xie, B., Pan, W., and Shen, X. (2008). Variable selection in penalized model-based clustering via regularization on grouped parameters. Bioin-formatics 64, 921–930.

[127] Xu, R. and Wunsch, D. C. (2009). Clustering. John Wiley and Sons, Inc, NJ, Hoboken.

In document Finite mixture models and model-based clustering (Page 31-39)