Experimental Study of the Approximation - Clustering on the Unit Hypersphere using von Mises-Fi

In this section we provide a brief experimental study to assess the quality of our approximation of the concentration parameterκ. Recall that our approximation (4.4) attempts to solve the implicit non-linear equation

Id/2(κ)

Id/2−1(κ)

=¯r. (A.17)

We previously mentioned that for large values of ¯r (¯r close to 1), approximation (4.1) is reason- able; for small values of ¯r (usually for ¯r<0.2) estimate (4.2) is quite good; Eqn. (4.4) yields good approximations for most values of ¯r.

A particular value of ¯r may correspond to many different combinations ofκand d values. Thus, to assess the quality of various approximations, we need to evaluate their performance across the (κ,d)plane. However, such an assessment is difficult to illustrate through 2-dimensional plots. To supplement Table 1, which showed how the three approximations behave on a sampling of points from the(κ,d) plane, in this section we present experimental results on some slices of this plane, where we either keep d fixed and varyκ, or we keepκfixed and vary d. For all our evaluations, the ¯r values were computed using (A.17).

We begin by holding d fixed at 1000, and allowκto vary from 10 to 5010. Figure 7 shows the values of computed ˆκ(estimation of κ) using the three approximations. From this figure one can see that (4.1) overestimates the trueκ, while (4.2) underestimates it. However, our approximation (4.4) is very close to the trueκvalues.

Next we illustrate the quality of approximation whenκis held fixed and d is allowed to vary. Figure 8 illustrates how the various approximations behave as the dimensionality d is varied from

d =4 till d=1454. The concentration parameterκ was set at 500 for this experiment. We see that (4.2) catches up with the true value ofκafter approximately d_≥2κ(because the associated ¯r values become small), whereas (4.4) remains accurate throughout.

Since all the approximations depend on ¯r (which implicitly depends onκand d), it is illustra- tive to also plot the approximation errors as ¯r is allowed to vary. Figure 9 shows how the three

0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 Kappa (κ) Estimated κ

Comparison of various kappa approximations (d = 1000)

True κ

(4.1) (4.2) (4.4)

Figure 7: Comparison of true and approximatedκvalues, with d=1000

0 500 1000 1500 0 200 400 600 800 1000 1200 Dimension d Estimated value of κ

Comparisions of approximations for κ=500, d varying

κ=500 (4.1) (4.2) (4.4)

Figure 8: Comparison of approximations for varying d,κ=500.

approximations perform as ¯r ranges from 0.05 to 0.95. Let f(d,¯r), g(d,¯r), and h(d,¯r)represent the approximations toκusing (4.1), (4.2) and (4.4), respectively. Figure 9 displays_|Ad(f(d,¯r))−¯r|,

|Ad(g(d,¯r))−¯r|, and|Ad(h(d,¯r))−¯r|for the varying ¯r values. Note that the y-axis is on a log-scale

to appreciate the differences between the three approximations. We see that up to ¯r_≈0.18 (dashed line on the plot), the approximation yielded by (4.2) has lower error. Thereafter, approximation (4.4) becomes better.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10−10 10−8 10−6 10−4 10−2 100

Approximation error comparison for (4.1), (4.2) and (4.4)

Average resultant parameter "r"

Log of |A d ( κ approximation) − r| (4.1) (4.2) (4.4)

Figure 9: Comparison of approximations for varying ¯r (with d=1000)

References

M. Abramowitz and I. A. Stegun, editors. Handbook of Mathematical Functions. Dover Publ. Inc., New York, 1974.

S. I. Amari. Information geometry of the EM and em algorithms for neural networks. Neural

Networks, 8(9):1379–1408, 1995.

A. Banerjee and J. Ghosh. Frequency sensitive competitive learning for clustering on high- dimensional hyperspheres. In Proceedings International Joint Conference on Neural Networks, pages 1590–1595, May 2002.

A. Banerjee and J. Ghosh. Frequency Sensitive Competitive Learning for Scalable Balanced Clus- tering on High-dimensional Hyperspheres. IEEE Transactions on Neural Networks, 15(3):702– 719, May 2004.

A. Banerjee and J. Langford. An objective evaluation criterion for clustering. In Proc. 10th Inter-

national Conference on Knowledge Discovery and Data Mining (KDD), pages 515–520, 2004.

A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences. In Proc. of

4th SIAM International Conference on Data Mining, pages 234–245, 2004.

J. Bilmes. A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report ICSI-TR-97-021, University of Berkeley, 1997.

A. Blum and J. Langford. PAC-MDL bounds. In Proc. 16th Annual Conference on Learning Theory

P. S. Bradley, K. P. Bennett, and A. Demiriz. Constrained k-means clustering. Technical report, Microsoft Research, May 2000.

D. Coleman, X. Dong, J. Hardin, D. Rocke, and D. Woodruf. Some computational issues in cluster analysis with no apriori metric. Computional Statistics and Data Analysis, 31:1–11, 1999. M. Collins. The EM algorithm. In fulfillment of Written Preliminary Exam II requirement, Septem-

ber 1997.

T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 1991.

S. Dasgupta. Learning mixtures of Gaussians. In IEEE Symposium on Foundations of Computer

Science, 1999.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38, 1977.

I. S. Dhillon, J. Fan, and Y. Guan. Efficient clustering of very large document collections. In V. Kumar R. Grossman, C. Kamath and R. Namburu, editors, Data Mining for Scientific and

Engineering Applications. Kluwer Academic Publishers, 2001.

I. S. Dhillon, Y. Guan, and J. Kogan. Iterative clustering of high dimensional text data augmented by local search. In Proceedings of The 2002 IEEE International Conference on Data Mining, 2002a.

I. S. Dhillon, E. M. Marcotte, and U. Roshan. Diametrical clustering for identifying anti-correlated gene clusters. Technical Report TR 2002-49, Department of Computer Sciences, University of Texas, September 2002b.

I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering.

Machine Learning, 42(1):143–175, 2001.

I. S. Dhillon and S. Sra. Modeling data using directional distributions. Technical Report TR-03-06, Department of Computer Sciences, University of Texas at Austin, Austin, TX, 2003.

B. E. Dom. An information-theoretic external cluster-validity measure. Technical Report RJ 10219, IBM Research Report, 2001.

M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome- wide expression patterns. Proc. National Academy of Science, 95:14863–14868, 1998.

N. I. Fisher. Statistical Analysis of Circular Data. Cambridge University Press, 1996.

J. Ghosh. Scalable clustering methods for data mining. In Nong Ye, editor, Handbook of Data

Mining, pages 247–277. Lawrence Erlbaum, 2003.

GoMiner03. http://discover.nci.nih.gov/gominer/.

T. R. Hughes, M. J. Marton, A. R. Jones, C. J. Roberts, R. Stoughton, C. D. Armour, H. A. Bennett, E. Coffey, H. Dai, D. D. Shoemaker, D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, and S. H. Friend. Functional discovery via a compendium of expression profiles. Cell, 102:109–126, 2000.

P. Indyk. A sublinear-time approximation scheme for clustering in metric spaces. In 40th Sympo-

sium on Foundations of Computer Science, 1999.

T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In M. S. Kearns, S. A. Solla, and D. D. Cohn, editors, Advances in Neural Information Processing Systems, volume 11, pages 487–493. MIT Press, 1999.

A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, New Jersey, 1988. R. Kannan, S. Vempala, and A. Vetta. On clusterings—good, bad and spectral. In 41st Annual IEEE

Symposium Foundations of Computer Science, pages 367–377, 2000.

M. Kearns, Y. Mansour, and A. Ng. An information-theoretic analysis of hard and soft assignment methods for clustering. In 13th Annual Conf. on Uncertainty in Artificial Intelligence (UAI97), 1997.

I. Lee, S. V. Date, A. T. Adai, and E. M. Marcotte. A probabilistic functional network of yeast genes. Science, 306(5701):1555–1558, 2004.

K. V. Mardia. Statistical Distributions in Scientific Work, volume 3, chapter “Characteristics of directional distributions”, pages 365–385. Reidel, Dordrecht, 1975.

K. V. Mardia and P. Jupp. Directional Statistics. John Wiley and Sons Ltd., 2nd edition, 2000. G. J. McLachlan. The classification and mixture maximum likelihood approaches to cluster analysis.

In P. R. Krishnaiah and L. N. Kanal, editors, Handbook of Statistics, volume 2, pages 199–208. North-Holland, 1982.

G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley-Interscience, 1997. G. J. McLachlan and D. Peel. Finite Mixture Models. Wiley series in Probability and Mathematical

Statistics: Applied Probability and Statistics Section. John Wiley & Sons, 2000. M. Meil˘a. Comparing clusterings by the variation of information. In COLT, 2003.

J. A. Mooney, P. J. Helms, and I. T. Jolliffe. Fitting mixtures of von Mises distributions: a case study involving sudden infant death syndrome. Computational Statistics & Data Analysis, 41: 505–513, 2003.

R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368. MIT Press, 1998.

20 Newsgroups. http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html.

K. Nigam, A. K. Mccallum, S. Thrun, and T. Mitchell. Text classification from labeled and unla- beled documents using EM. Machine Learning, 39(2/3):103–134, 2000.

D. Peel, W. J. Whiten, and G. J. McLachlan. Fitting mixtures of Kent distributions to aid in joint set identification. Journal of American Statistical Association, 96:56–63, 2001.

J. H. Piater. Visual Feature Learning. PhD thesis, University of Massachussets, June 2001. C. R. Rao. Linear Statistical Inference and its Applications. Wiley, New York, 2nd edition, 1973. E. Rasmussen. Clustering algorithms. In W. Frakes and R. Baeza-Yates, editors, Information Re-

trieval: Data Structures and Algorithms, pages 419–442. Prentice Hall, New Jersey, 1992.

G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information

Processing & Management, 4(5):513–523, 1988.

G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983. B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Reidl. Item-based collaborative filtering recom-

mendation algorithms. In World Wide Web, 10, pages 285–295, 2001. B. Sch¨olkopf and A. Smola. Learning with Kernels. MIT Press, 2001.

E. Segal, A. Battle, and D. Koller. Decomposing gene expression into cellular processes. In Proc. of

8th Pacific Symposium on Biocomputing (PSB), 2003.

R. Sharan and R. Shamir. CLICK: A clustering algorithm with applications to gene expression analysis. In Proceedings of 8th International Conference on Intelligent Systems for Molecular

Biology (ISMB), pages 307–316. AAAI Press, 2000.

K. Shimizu and K. Iida. Pearson type VII distributions on spheres. Communications in Statistics:

Theory & Methods, 31(4):513–526, 2002.

J. Sinkkonen and S. Kaski. Clustering based on conditional distributions in an auxiliary space.

Neural Computation, 14:217–239, 2001.

P. Smyth. Clustering sequences with hidden Markov models. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing, volume 9, pages 648–654. MIT Press, 1997.

M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In

KDD Workshop on Text Mining, 2000.

A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In Proc

7th Natl Conf on Artificial Intelligence : Workshop of AI for Web Search (AAAI 2000), pages

58–64. AAAI, July 2000.

C. S. Wallace and D. L. Dowe. MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions. Statistics and Computing, 10(1):73–83, January 2000.

G. N. Watson. A treatise on the theory of Bessel functions. Cambridge University Press, 2nd edition, 1995.

A. T. A. Wood. Simulation of the von-Mises Distribution. Communications of Statistics, Simulation

and Computation, 23:157–164, 1994.

Y. Zhao and G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 55(3):311–331, June 2004.

S. Zhong and J. Ghosh. A Unified Framework for Model-based Clustering. Journal of Machine

Learning Research, 4:1001–1037, November 2003a.

S. Zhong and J. Ghosh. A comparative study of generative models for document clustering. In

Workshop on Clustering High Dimensional Data : Third SIAM Conference on Data Mining,

In document Clustering on the Unit Hypersphere using von Mises-Fisher Distributions (Page 32-38)