Comparing applicability of prevalent Clustering Algorithms for Document Clustering

(1)

Comparing applicability of prevalent Clustering Algorithms

for Document Clustering

Bachelor’s Thesis submitted to

Prof. Dr. Wolfgang K. H¨ardle

and

Prof. Dr. Nadja Klein

Humboldt-Universit¨at zu Berlin School of Business and Economics Ladislaus von Bortkiewicz Chair of Statistics

by

Luisa Krawczyk

(573814)

in partial fulfillment of the requirements for the degree of

Bachelor of Science in Business Administration

(2)

Abstract

In statistical data analysis, Clustering Algorithms are supposed to find groups of similar points in a set of objects. I present two algorithms, k-means and hierarchical clustering, and compare their applicability for clustering high-dimensional sparse data as often dealt with when cluster-ing documents. To demonstrate possible cases of application, I provide clustercluster-ing applications on an example data set (the Quantlet data set) in programming language Python. This paper’s objective is to guide the reader from a casual understanding of basic statistical concepts, such as those typically acquired in undergraduate studies, to a general understanding of Clustering Algorithms, while focusing on Document Clustering applicability.

Keywords: Clustering Algorithms, k-means, Hierarchical Clustering, Document Clustering, Vector Space Model, Clustering Evaluation

(3)

List of Figures

4.1 Number of Keywords Distribution . . . 11 4.2 Number of observations per cluster when using k-means with k=15 . . . 12 4.3 Sum of squared distances as a function of the number of clusters . . . 13 4.4 Number of observations per cluster for a Hierarchical Clustering using Complete

Linkage . . . 14 4.5 Dendrogram as a result of an agglomerative hierarchical Clustering using

Com-plete Linkage for the Quantlet data set computed in R . . . 15 4.6 Silhouette Score for k-means and Hierarchical Clusterings . . . 17 4.7 Calinski-Harabaz Score for k-means and Hierarchical Clustering . . . 18

(5)

1. Introduction

Clustering has widespread applications in business analytics and market research, bioinformat-ics, medicine and many more fields. In Customer segmentation, grouping clients based on their purchase history enables retailers to target them for specific campaigns. In medicine, clustering patients might be helpful to predict the probability of having a heart attack for each subgroup. In document clustering, textual documents are clustered into groups of similar documents in terms of topics or keywords. Applications include web document clustering for search users, automatic document organization or topic extraction.

I present two well established algorithms often used in document clustering - k-means and hierarchical clustering - and apply them on an example data set which was provided to me by the Ladislaus von Bortkiewicz Chair of Statistics. The application part can be categorized as ‘keyword’ clustering, which however is similar to document clustering, when leaving out its text preprocessing. It therefore only deals with ‘the second step’ of document clustering, the clustering application on the term-document matrix.

This work aims at creating a theoretical foundation for understanding clustering algorithms and how they can be used to group documents. While detailed literature is available for most of the concepts mentioned in this paper, I want to contribute to the understanding of clustering applications by presenting all concepts essential to applying clustering algorithms on a data set, in particular a set of documents, alltogether with concepts of data preprocessing and choice of the right parameters and an appropriate distance measure.

This paper is organized as follows. Chapter 2 presents the two Clustering techniques that are to be compared, Partitional k-means and Hierarchical Agglomerative Clustering, in a general context. Chapter 3 introduces the vector-space model for document clustering and outlines basic preprocessing techniques. In chapter 4, I present application results of clusterings that I developed in Python and go into detail about practical questions as the choice of number of clusters and cluster performance evaluation. Chapter 5 contains a conclusion.

(6)

2. Overview Partitional and Hierarchical Clustering

Tech-niques

2.1 Partitional Clustering : The k-means algorithm

Partitional Clustering Techniques have in common that the number of clusters k has to be determined by the user.

The most prominent partitional Clustering technique is the k-means algorithm.

First described by polish mathematician Hugo Steinhaus in 1951 [Florek, K., Lukaszewicz, J., Perkal, J., Steinhaus, Hugo, Zubrzycki, S., 1951], the k-means algorithm is a very simple and intuitive way to partition data. The term “k-means” was first used by [MacQueen et al., 1967]. In spite of the fact that it was proposed over 50 years ago and thousands of clustering algorithms have been published since then, k-means is still widely used [Jain, 2010]. Its goal is to minimize the distance between data points and their barycentre in each cluster. It is hence a so-called distance-based clustering algorithm.

The algorithm has the following four steps:

Given a set of n observations with d dimensions, so X1, ..., Xn with each Xi ∈Rd arranged in

a d×n matrix 1 Xd,n =       x1,1 x1,2 · · · x1,n x2,1 x2,2 · · · x2,n .. . ... . .. ... xd,1 xd,2 · · · xd,n      

1. The user has to randomly assign k centroids (data points in the data space, which can but do not have to correspond to the observations). The k centroidsC1, ..., Ck are thus each d×1

vectors. The centroid matrix is

Cd,k=       c1,1 c1,2 · · · c1,k c2,1 c2,2 · · · c2,k .. . ... . .. ... cd,1 cd,2 · · · cd,k      

In case it is possible to select logical centroids, the algorithm will work faster. 2

1_{more on the representation of observations in the vector-space model in Section 3.1} 2_{k-means++ is a variation of k-means choosing the initial centroids “cleverly”}

(7)

2. For each data point Xi for i = 1, ..., n we calculate the distance to each of the k centroids.

For simplicity, let us assume the euclidean distance. 3

d(Xi, Xj) = v u u t d X m=1 (xm,i−xm,j)2

For each data point those k distances are being compared and the smallest is chosen as centroid that Xi belongs to.

∀i:L(i) = arg min

l=1,...,k

kXi−clk2

with ∀l :Cl = [i|L(i) =l]

At the end of this step, we have a label vector L(i) of length n that assigns to each Xi his

centroid cl ∈ [|1, ..., k|] as well as a Size vector S(l) = Card(Cl) of length k which tells the

number of data points assigned to each cluster.

3. For each one of the k computed clusters, we calculate a new centroid as the mean, or barycentre, of all the data points which belong to it.

∀l :Cl = 1 Card(Cl) X i∈Cl Xi

We define the sum of squared distances of iteration t to be

N(t) = n X i=1 k X l=1 kX_i(l)−Clk22

whereX_i(l) denotes allXi that belong to cluster Cl and k.k22 is the squared Euclidean distance.

4. Repeat step 2 and 3 until the reduction in sum of squared distances is smaller than a certain

i.e. until

t such that N(t−1)−N(t)< .

K-Means can easily be computed with the help of Scikit-learn, a free software machine learning library for Python programming language. However, this [ manual k-means] algorithm might be helpful to understand what is happening behind the scenes.

(8)

2.2 Hierarchical Clustering

Hierarchical Clustering techniques are connectivity-based. In contrast to partitional techniques, they do not require a predefined value for the number of clusters (k) but provide a tree of different cluster divisions, called dendrogram 4_{, for any number of clusters between 1 and n.}

We distinguish two types of hierarchical Clustering algorithms:

Agglomerative Clustering which starts with n clusters (i.e. every data point has its own cluster) and then merges the closest pair of clusters at each step.

Divisive Clusteringwhich starts with one all-inclusive cluster and then splits the groups with the largest distance at each step.

Agglomerative techniques are more common and these are the ones that I will compare to k-means.

The algorithm of an Agglomerative Clustering is composed of the following steps: 1. Calculate the distance matrix

Dn,n =       d(X1, X1) d(X1, X2) · · · d(X1, Xn) d(X2, X1) d(X2, X2) · · · d(X2, Xn) .. . ... . .. ... d(Xn, X1) d(Xn, X2) · · · d(Xn, Xn)       where d(Xi, Xj) = s d P m=1

(xm,i−xm,j)2 is the euclidean distance between Xi and Xj. 5

D is a symmetric matrix with the diagonal values being 0. Now we define each data point as being one cluster, we thus start having n clusters.

2. Identify the most similar i.e. closest two clusters (r) and (s)

For now as one cluster is equivalent to one data point: Later with (i) and (j) being clusters:

d(Xr, Xs) = min i,j∈[|1,...,n|] i6=j d(Xi, Xj) d((r),(s)) = min i,j∈[|1,...,n|] i6=j d((i),(j))

which corresponds to finding the smallest values in our distance matrix.

3. Merge clusters (r) and (s) into a single cluster (rs) and update the distance matrix by deleting the row and column corresponding to Xr and Xs as well as adding a row and column

corresponding to the newly formed cluster (rs).

4_{An example of a dendrogram is Figure 4.5 shown in Section 4.2}

(9)

Dn−1,n−1=                   d(X1, X1) · · · d(X1, Xr−1) d(X1, Xr+1)· · · d(X1, Xs−1) d(X1, Xs+1) · · · d(X1, Xn) d(X1,(rs)) d(X2, X1) · · · d(X2, Xr−1) d(X2, Xr+1)· · · d(X2, Xs−1) d(X2, Xs+1) · · · d(X2, Xn) d(X2,(rs)) . . . . . . . . . . . . . . . . . . . .. ... . . . d(Xn, X1) · · · d(Xn, Xn) d(Xn,(rs)) d((rs), X1) · · · d((rs), Xn) d(Xrs,(rs))                  

where d((rs), X1) can be calculated using different linkage criteria, for example

d((rs), X1) = min

r,s [d(Xi, Xr), d(Xi, Xs)] for single linkage.

6

4. Repeat from step 2 unless all objects are in one cluster, then stop.

In contrast to partitional techniques, where data points can change cluster affiliation from iteration to iteration, in hierarchical clustering each affiliation decision is definitive; data points that belong to a child cluster also belong to the parent cluster.

Linkage criteria

The key element of this algorithm is its dissimilarity measure, also referred to as proximity measure, which consists of two things: a distance metric between pairs of observations 7 _{and a}

linkage criterion which stipulates the dissimilarity/ proximity of two sets as a function of the pairwise distances of observations in the sets.

The most prominent linkage criteria:

Simple Linkage d(A, B) = min

Xi∈A,Xj∈Bd(Xi, Xj)

Complete Linkage d(A, B) = max

Xi∈A,Xj∈B d(Xi, Xj) Average Linkage d(A, B) = _|_A_|·|1_B_| P Xi∈A P Xj∈B

d(Xi, Xj) where |A|=Card(A)

Centroid Linkage d(A, B) =d(cA, cB) where cA is A’s centroid

Ward’s method d(A, B) = d(c1A, cB)

|A|·

1

|B|

which corresponds to the variance increase when merging the clusters.

When using hierarchical clustering, some linkage criteria can be more suitable than others, i.e. achieve a segmentation where similarity between points in one group as well as dissimilarity between groups is greater.

6_{More about this in the following section.}

(10)

3. Document Clustering

This chapter intends to clarify the elements of Document Clustering and situate it within the Data Science context. It describes how documents are transformed to a numerical matrix prior to clustering.

3.1 Concepts of Information Retrieval

3.1.1 Data preprocessing

The preprocessing techniques used prior to Document Clustering are methods of Information Retrieval. They are common in Text Mining applications, which deal with unstructured or semistructured data and serve to quantify the nature of a textual document. The tools used are referred to as Natural Language Processing.

Those preprocessing techniques include

• Tokenization: the process of parsing text data into smaller units (tokens) such as words and phrases,

• eliminating stop words: words like “the”, “and”, “or” are not very helpful in revealing the essential content of a document,

• recognizing stems of words:“cluster” and “clustering” carry very similar information and should thus be treated as the same word

• distinguishing important words from unimportant words: e.g. “according” reveals much less information about the content of an Econometrics course document than “heteroscedas-ticity”,

• dealing with synonyms (different words with the same meaning) and homonyms (words with the same spelling but with distinct meanings) etc.

Only after this often extensive preprocessing, it is possible to obtain “descriptors” i.e. sets of words that describe the content of each document.

3.1.2 Vector-space model

Documents are represented using the vector-space model. The descriptors obtained after pre-processing are then arranged in a so-called Term-Document-Matrix or extensions of it (e.g. TF-IDF for term frequency-inverse document frequency).

(11)

Definition Term-document matrix [Karypis et al., 2000] The term-document matrix is defined as: (fi,j)i=1,...,n

j=1,...,d

where fi,j is the frequency with which

term i appears in document j, nis the vocabulary size and dis the number of documents. Each column of the matrix corresponds to the jth _{document vector.}

An example is given in Table 4.1 of the next chapter.

As this paper deals with the “second step” of document clustering, I do not further detail any extensions or Text Mining Preprocessing techniques. To gain a deeper understanding of those, see [Baeza-Yates et al., 2011].

3.2 Document Clustering Techniques

Based on the term-document matrix described above, documents can finally be clustered into groups of similar documents. Documents in one group will have more terms in common with each other than with documents in other groups.

An important feature of document clustering as opposed to other clustering applications is the number of variables that often exceeds many thousand. Furthermore, term-document matrices can be very sparse, i.e. contain many zeros. Applying clustering algorithms to high-dimensional sparse data is a challenge tackled by the choice of an appropriate document clustering technique. The most common techniques used are k-means and hierarchical agglomerative clustering, which is why I will compare those two in the application part. The numerous other algorithms introduced include density-based algorithms as DBSCAN, Spectral Clustering or hybrid models that combine features from both partitional and agglomerative approaches as in [Zhao et al., 2005]. Beyond that there are numerous novel algorithms presented by researchers as [Joel Larocca Neto and Alexandre D. Santos and Celso A.A. Kaestner and Neto Alexandre and D. Santos and Celso A. A and Kaestner Alex and Alex A. Freitas and Catolica Parana, 2000] who present the Autoclass Algorithm for clustering and summarizing documents.

3.3 Distance measures

The distance between two documents in the vector-space model can be calculated in differ-ent ways. The choice of the right distance or similarity measure can be crucial in clustering applications. There is no “correct” measure, it always depends on the data set.

NB.: A similarity measure is simply the ‘opposite’ of the distance measure.

3.3.1 Euclidean distance

The most intuitive distance measure is often thought to be the Euclidean norm, or L2-Norm,

(12)

points Xi and Xj the Euclidean distance is defined as dE(Xi, Xj) = kXi−Xjk2 = v u u t d X m=1 (xm,i−xm,j)2 3.3.2 Manhattan distance

The Manhattan distance is in fact the L1-Norm defined as

dM(Xi, Xj) = kXi−Xjk1 = d X m=1 |xm,i−xm,j| 3.3.3 Cosine similarity

Introduced by [Salton et al., 1975], the cosine similarity is often thought to be more suitable for document clustering as the length of the vector, i.e. the number of terms in a document, does not impact the distance measure, but the relative distribution of terms will matter. The cosine similarity is defined as

sC(Xi, Xj) =

Xi×Xj

kXik2kXjk2

where × denotes vector multiplication and kk2 is the L2-Norm.

The Cosine distance accordingly is defined as

dC(Xi, Xj) = 1−sC(Xi, Xj)

3.3.4 Jaccard Coefficient

The Jaccard Coefficient is similar to the Cosine similarity. It compares the weight of terms documents share with the weight of their overall terms. The Jaccard similarity is defined as

dJ(Xi, Xj) =

Xi×Xj

kXik22+kXjk22−Xi×Xj

More on distance measures can be found in [Strehl et al., 2000] and [Huang, 2008] who have experimented with different measures and discussed their effectiveness in text document clus-tering. The general opinion is Cosine similarity and the Jaccard Coefficient are most suitable for Document Clustering.

(13)

4. Application

In this chapter I will present my results when using k-means and hierarchical Clustering on the Quantlet data set.

4.1 The data set

The documents I cluster are codes and programs produced by researchers at the Ladislaus von Bortkiewicz Chair of Statistics of Humboldt University Berlin. Those codes and pro-grams, called Quantlets, are published on the Chair’s GitHub page, see GitHub.com/Quantlet. Quantlets are accompanied by a metainfo file, which is a text file containing a name of the Quantlet, the name of the author, a description, some keywords and where it was published. When translating into a text mining environment, you could say that the most relevant words from each document have been carved out by the author stating his keywords. For my Clustering I thus use the (on average) 9 keywords per document which provide a term-document matrix:

document basis graphical orthogonal plot probability

ADM/HermPolyPlot/Metainfo.txt 1 2 1 1 1

ARR/ARRboxage/Metainfo.txt 0 1 0 1 0

ARR/ARRboxgscit/Metainfo.txt 0 1 0 1 0

ARR/ARRboxhb/Metainfo.txt 0 1 0 1 0

ARR/ARRcormer/Metainfo.txt 0 1 0 1 0

Table 4.1: ”Head” of the quantlet data set; first five terms for the first five documents

The resulting document clusters are supposed to share more keywords with those documents in their own cluster than with documents in other clusters.

The Quantlet term-document-matrix is one of size 2064×1286, d= 1286 terms (i.e. variables) and n = 2064 documents.

4.2 Specific properties of the Quantlet data set

The Quantlet term-document matrix has some specific properties different from usual document clustering problems.

1. Not as many terms per document as in the case of extraction by preprocessing text data. Rarely more than 1 same term per document so that computing variations of the term-document matrix, as for example a TF-IDF matrix, becomes redundant.

(14)

Figure 4.1: Number of Keywords Distribution NumberOfKeywords

2. Unequally distributed number of terms per document as each author had a different understanding of how many keywords were appropriate. As visible in Figure 4.1, most documents have between 5 and 15 keywords (88 % of the documents to be exact). In document clustering this would correspond to different lengths of each document.

Implications

Doubts concerning the applicability of the Euclidean distance can be dropped as the Quantlet term-document matrix is nearly a binary matrix with few entries different than 1. That is why I did not apply other similarity measures, as the cosine similarity or the Jaccard coefficient. Comparing their performance could still be content for further analyses.

4.3 k-means

The k-means algorithm runs at a computationally very low cost which generally makes it the choice number one clustering method. K-means performed quite well on the Quantlet data set, finding relatively equally sized clusters making sense at the first glance.

Table 4.2 shows three of ten clusters that I generated using my [ manual k-means algorithm] with Euclidean distance and random centroid initialization. As our observations i.e. file names do not reveal information about the nature of its content, I decided to look at the final centroid values (after 7 iterations). The length d centroid vector has most of its elements equal to or close to zero; namely for the terms which do not appear in any of its cluster’s documents. The highest centroid vector values identify the terms appearing in most of the documents’ keywords. For example, the term “option” appears nearly 2 times on average in every document of Cluster 1, which I labelled being a “Finance Cluster” given the prevalent terms of its documents. The

(15)

Cluster 1: “Finance” Cluster 2: “Visualization of data” Cluster 3: “Time series”

option 1.96 visualization 1.11 plot 0.24

price 1.22 plot 0.93 time 0.17

financial 0.59 graphical 0.91 series 0.15

black 0.55 representation 0.91 model 0.12

scholes 0.54 data 0.77 simulation 0.11

Table 4.2: For 3 of 10 clusters computed with k-means: The terms having highest centroid-values and their shortened centroid-centroid-values

[ k-means centroids]

Figure 4.2: Number of observations per cluster when using k-means with k=15 [ Cluster sizes]

generally lower centroid values in Cluster 3 are partly due to the higher number of observations contained in that cluster.

4.3.1 Determining the optimal number of clusters

The most often cited disadvantage of k-means as opposed to hierarchical clustering techniques is the necessity of choosing the number of clusters k beforehand. However, there are different methods to solve this problem.

Elbow Method

One of those is theElbow Method which consists in comparing the sum of within-cluster squared distances of a clustering for different values of k. When there is a steep decrease of sum of squared distances from k= ˜k−1 to k = ˜k , then ˜k is the optimal number of clusters.

I recall that the total sum of squared distances between data points and their cluster’s barycen-tre is n X i=1 k X j=1 kX_i(j)−cjk22

(16)

where cj is the cluster j , X

(j)

i all Xi that belong to cluster j and k.k22 the squared Euclidean

distance.

Figure 4.3: Sum of squared distances as a function of the number of clusters [ sum of squared distances]

However, this approach does not always yield an obvious and unique number of clusters. Figure 4.3 shows the total sum of squared distances as a function of the number of clusters for the Quantlet Clustering. We cannot clearly identify an elbow, thoughk = 3, k= 15, k = 22, k = 25 are possible candidates. The choice of number of clusters will depend on the purpose and the assessment of the user.

Silhouette Score

Another method is checking the Silhouette Score (or other performance scores) for different values of k and choose one value with a high score. 1

4.3.2 Centroid initialization problem

A more serious problem than determining the adequate number of clusters lies within the initialization of centroids. When running the hand-written algorithm described in section 2.1, each run with a different random initialization returns different clusters. Possible solutions to this problem are:

Consensus Clustering which consists of generating a set of clusterings from the same data set - using different techniques or the same algorithm run with different initializations - and combining them into a final clustering, where each data point’s cluster affiliation is decided

(17)

upon by majority vote. The goal of this combination process is to improve the quality of individual data clusterings. [VEGA-PONS and RUIZ-SHULCLOPER, 2011]

k-means++which is an extension of classical k-means proposed by [Arthur, 2007]. It contains a seeding method for “cleverly” choosing the first centroids. The initialization of k-means++ works as follows:

1. One data point is chosen randomly to be a cluster centroidc1. Let ˆk be the initialization

step. For now, ˆk = 1.

2. We computed(Xi) =d(Xi, c1) ∀i= 1, ..., n

3. Choose another data point to be a new cluster centroid c2 where the probability that Xi

is chosen is proportional to d(Xi)2. By definition, data points far away from c1 are more

likely to be chosen. Now we have ˆk = 2.

4. Repeat steps 2 and 3 while now computing d(Xi) = min c=c1,...,cˆk

d(Xi, c) as the distance ofXi

to its closest already computed centroid until ˆk =k and k centroids have been chosen. 5. Proceed with standard k-means.

4.4 Hierarchical Agglomerative Clustering

Applying standard agglomerative hierarchical clustering on the Quantlet data did not yield satisfying results as the algorithm produced one big cluster containing most of the observations while allocating all remaining clusters only few observations.

1 2 3 4 5 6 7 8 9 10 12 14

Hierarchical Clustering 15 clusters

Cluster

Number of obser

v

ations per cluster

0 500 1000 1500 (a) 15 clusters 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Hierarchical Clustering 40 clusters

Cluster

Number of obser

v

ations per cluster

0

200

600

1000

(b) 40 clusters

Figure 4.4: Number of observations per cluster for a Hierarchical Clustering using Complete Linkage

(18)

0 10 20 30 40 10 194 2 641395 738 74022 5827912 17815130 43028 23 27 42 47 16 46301 772 213 6164025834122 328 413 121 42362562 119615 125 425 42755 57144 147 148 265 343 344 347 117 145 364 298 297 29954329 46296294 11099694 692 693 360 461 467 396 570 698 134 180 13298284 283 406 422 546 582 585 269 270 756 758 101 597 187 272 5957342182349 877 878 941 98811 1473044 45727 765 37629211 715 278 596 835 317102436156 76917 561 737 293 1 393 95189 30463 409 175 410 593 594 989 640 792 4201027160 179 345 214 216 6481104773 7741061 1105414 415 762 763 9 13 18452 407 454 604 457 458 266 112 127 181 19478768 4311065440 263 697 571 615 617 108 5781258 1260 1259 1256 1257107 745 880 742 744 383 324 334 212 164 165 599 601 302 303 728 177 387 77732 33 97228 703 714 718 282 331 280 295 634 796 118 333 766 767 683 109 764 133 684 220 438 439 3941195114 159 161 210 980 978 979 290 977 386 94636 1191 1004 1006787 785 786 784 788 641 149 1461206691 618 620 126 306 424 382 685 629 252 548 953 956 826 918 943 958 959 281 750 713 818 820 819 821 573 493 3671215971 910 3781214574 7331085 1083 10847711086 1087572 576 472 478 884 883 888 338 4491286 1284 1285322 166 948 541 455 492 445 453 502 192 257 882 577 749 575 9961200 1274261 426 779 170 120 143 171 695 696 836 940 538 751 256 428 699 49611566611155 1158419 6081001375 719 315 622 627 225 621 623 369 584 321 1911216 1217138 27625 24806 805 80783131 162 8001043654 665 652 653 551 530 874 876 429 491 487 484 488 262 840 341 330 247 323 339 342 340 335 336 755 757 1881097 92 103 105 13791 93 1197 949 950 9421052915 976 753 954 955 565 563 564 416 893 894 897 528 885 348 913 327 7161235991 924 720 775 707 589 580 2961243 1244 10666691072 1070 50 1071 1090 1053960 9001091776 735 736 700 947 539 55511855531188494 804 446 459 957 352 354 351 353 289 291 250 911 197 1981182 1183 1012 1018965 972100510029629611023 1025 1232 1233879 867 870 868 869 872 871 873 864 866 863 865 844 846 781 769 748 731 732 725 726 993 994 4001029936 937 920 922 919 921 388 389 390 346 3371063 1062 1064332 312 313 201 163 227 780 154 155 795 793 113 441 794 1061187 1122 1186 1074970 898 905 904 908 909 802 7911101770 754 747 676 678 677 675 673 674 639 633 853 852 850 851 517 519 518 516 512 513 514 515 497 498 481 729 456 495 469 926 925 927 358 356 357 359 231 656 657 186 224 23081 31 1239 1237 1238 1225 1224 1211 1212 1207 1220 1221 1088 1030 1026963 964 929 895 896 9391124 1279891 889 886 887 890 892 881 859 858 857 856 854 855 803 799 798 760 761 752 702 650 651 643 644 645 6301153 1271 1262 1263581 560 558 559 544 545 499 529 527 526 524 525 477 474 475 476 471 473 405 385 403 399 397 402 398 401 326 448 451 447 450 314 2601131 1133 1134 1135 1136254 207 203 204 173 172 176 167 168 169 70968 69 1282 1280 1281 1209 1210 1198 1194 1179 1177 1175 1176 1181 1178 1180 1171 1172 1173 1170 1168 1169 1157 1162 1141 1142 1137 1125 1126 1127 1119 1174 1120 1121 1112 1113 1114 1107 1102 1103 1098 1094 1128 1129 1093 1089 1092 1080 1081 1073 1068 1067 1099 1100 1059 1060 1056 1054 1045 1050 1049 1048 1046 1047 1041 1038 1039 1033 1199 1019 1022 1020 1021 1007 1003998 999 997 995 9921017 1016 1015 1011 1014 1013 1036 1037982 944 945 938 933 932 930 931 839 837 838 831 830 825 827 790 789 782 778 724 722 723 721 712 686 689 687 688 6801044679 6811140 1167 1165 11666631138 1139658 659 655 666 667 834 649 690 638 637 632 808 809 592 923 552 554 534 537 536 535 533 531 532 522 523 520 521 510 509 511 508 506 507 486 490 485 489 464 468 465 466 443 444 442 433 437 434 435 432 500 501 421 380 381 3791253 1251 1252377 362 363 316 311 3051151 1150 1149 1148 1146 11473101230 1231267 178 153 152 150 151 128 129 249 248 246 245 244 243 242 241 240 239 238 237 236 235 234 232 23390325 82264 65 66 1229 51 48 49 393 1283 1278 1276 1277 1275 1272 1273 1270 1269 1267 1268 1265 1266 1264 1261 1254 1255 1250 1249 1247 1248 1245 1246 1242 1240 1241 1234 1228 1226 1227 1223 1222 1219 1218 1213 1208 1205 1204 1203 1201 1202 1196 1192 1193 1190 1189 1184 1163 1164 1161 1160 1159 1154 1152 1145 1143 1144 1130 1132 1123 1118 1117 1115 1116 1110 1111 1108 1109 1106 1096 1095 1082 1079 1078 1077 1076 1075 1069 1057 1058 1055 1051 1042 1040 1034 1035 1031 1032 1028 1010 1009 1008 1000990 986 984 985 983 981 975 973 974 966 967 952 934 935 928 916 912 914 906 907 903 902 899 901 875 861 862 860 849 848 847 845 842 843 841 832 833 824 823 817 816 815 814 812 813 811 810 801 797 743 741 717 710 711 706 708 704 705 701 682 672 670 671 668 664 662 660 647 642 635 636 628 624 626 619 614 613 609 610 606 607 605 603 600 591 590 588 586 587 583 579 569 567 568 556 557 549 550 547 542 543 540 505 503 504 483 480 482 479 470 460 418 412 411 408 404 391 392 373 374 372 371 370 368 365 366 355 350 320 318 292 288 285 286 277 268 271 264 255 251 253 229 223 226 221 222 219 217 218 208 202 199 200 196 193 195 189 190 182 183 174 158 157 156 142 141 139 140 135 136 116 124 115 104 100 10294 95 87 88 86 84 85 79 80 77 75 72 73 71 70 59 60 35 52 53646 829 611 612 566 968 969 828 625 987 417 631 319 598 783 436 602 273 274 275 287 209 73963307 3841236308 309 259 111 746201847418526759 215 205 2064330038123 767 37304

Figure 4.5: Dendrogram as a result of an agglomerative hierarchical Clustering using Complete Linkage for the Quantlet data set computed in R

[ whole dendrogram]

Figure 4.4 shows the allocation of observations among clusters for an Agglomerative Hierarchical Clustering using Complete Linkage and 15 or respectively 40 clusters. As the aim of our clustering is to represent Quantlets in coherent groups, this repartition is not helpful. Now supposing those remote observations are outliers, we can remove them and cluster again. Nearly equivalent to this approach, is to choose a higher number of clusters and exclude those with few observations; though this method is even better for possibly joining observations that we would have classified as outliers. But even when choosing 40 clusters, we remain with a 1441-observation cluster and one 139-1441-observation-cluster, while the other 38 clusters have less than 51 observations.

In a dendrogram, the y-axis marks the distance at which clusters merge, while the observations are placed along the x-axis for reasons of clarity. The dendrogram of Figure 4.5 shows that the first split results in 4 observations being put in one cluster and 2060 in the other. Even the fourth split leaves only 9 observations out of the large all-inclusive cluster. Cutting the dendrogram at 60 clusters still leaves 1028, so half of the documents, in one single cluster. In fact, we cannot find a satisfying split, even when ignoring outliers. The dendrogram can be represented in a truncated form; the user can either decide on a truncation distance or a maximum number of clusters to be displayed. An example code in R is provided for [ hierarchical clustering truncation].

The phenomenon of unequal cluster sizes even intensifies when choosing single or average linkage instead of complete linkage. This is due to complete linkage generally producing rather equally sized clusters.

(19)

4.5 Clustering Performance Evaluation

I already evaluated the k-means algorithm “manually” by examining the cluster centers and checking for a common topic for each cluster. An “indirect” evaluation of the two clusterings concerns their utility in their intended application. As the goal of clustering our Quantlets is to represent them in coherent groups on the GitHub web site, clusters with more or less equal sizes are a lot more appropriate. This is why the manual and indirect evaluation suggests that the k-means algorithm performs better than hierarchical clustering.

In order to evaluate the quality of a clustering quantitatively, it would be best to have true

labels for our documents, i.e. categories they belong to, as “portfolio optimisation”, “risk theory” or “Clustering”. This form of “external” evaluation would leave no doubt about which clustering techniques lead to the most correct result.

Unfortunately, real-world data, often do not provide suchtrue labels. The absence of category information distinguishes data clustering(unsupervised learning) from classification or dis-criminant analysis (supervised learning). The aim of clustering is to find structure in data and it is therefore exploratory in nature. [Jain, 2010]

The Quantlet data set does not provide the above-mentioned labels. Therefore, they require “internal” evaluation, where the clustering is summarized to a single quality score. I will present two such indices; the Silhouette Score and the Calinski-Harabaz Score.

4.5.1 Silhoutte Score

The Silhouette Score is a measure of how similar an observation is to its own cluster in com-parison to other clusters. It was introduced by [Rousseeuw, 1987].

The Silhouette value of a data point i which was attributed to Cluster A is

s(i) =      b(i)−a(i) max{a(i), b(i)} if Card(A)>1 0 if Card(A) = 1 where a(i) = 1 Card(A)−1 X j∈A,i6=j

d(i, j) is the average within-cluster dissimilarity

and b(i) = min

C6=A

X

j∈C

d(i, j) the average dissimilarity to the next closest cluster.

Therefore we have s ∈ [−1,1] where s = 1 represents a very dense and separable clustering,

s = 0 overlapping clusters and s <0 an incorrect clustering. The higher the score, the better is our clustering performance.

(20)

Figure 4.6: Silhouette Score for k-means and Hierarchical Clusterings [ SilhouetteScore]

As we can infer from Figure 4.6, the Hierarchical Clustering with Average Linkage performed best in terms of Silhouette Score. Second best is the Hierarchical Clustering with Complete Linkage, while especially performing well for small values of k. K-means, K-means++ as well as the Hierarchical Clustering with Ward Linkage seem to perform worse.

Keeping in mind the very unequal distribution of observations among clusters for the Hierar-chical Clustering with Average Linkage, the Silhouette Score seems inappropriate in choosing the “correct” Clustering technique.

4.5.2 Calinski-Harabaz Score

The Calinski-Harabaz Score, also known as Variance Ratio Criterion, is the ratio of the mean between cluster dispersion (i.e. cluster variance) and the within-cluster dispersion. [Caliski and Harabasz, 1974] The score s is defined per clustering.

Let k be the number of clusters. Then

s(k) = T r(Bk)

T r(Wk)

× N−k k−1

where N is the number of observations in the data set and Tr() the trace of matrix Bk and Wk

respectively.

The within-cluster dispersion matrix is Wk = k P j=1 P Xi∈Cj (Xi−cj)(Xi−cj)T

(21)

and the between-cluster dispersion matrix is Bk =P j

nj(cj−c)(cj −c)T

whereCj denotes the set of data points within cluster j , cj its centroid vector, nj =Card(Cj)

and c is the barycentre of the whole data set.

The score is lower bounded by 0 but has no upper bound. Higher scores indicate denseness and good separability of the clusters.

Figure 4.7: Calinski-Harabaz Score for k-means and Hierarchical Clustering [ CalinskiHarabazScore]

Figure 4.7 shows the Calinski Harabaz Score for K-means and three different linkage criteria of the Hierarchical Clustering. Interestingly, the results of this analysis are the exact opposite of those proposed by the Silhouette Score. K-means and Hierarchical Ward Linkage perform best while Complete and Average Linkage perform worse. We also observe that the score generally decreases with the number of clusters.

4.6 Results

For this specific data set it seems more adequate to rely on a manual and indirect evaluation as the two quantitative scores presented are contradictory and also the hierarchical clustering does not satisfy the intended application. The k-means algorithm appears more appropriate. I also found that for hierarchical clustering, while finding very unequal cluster sizes for the Quantlet data set, complete linkage is still rather computing more equal cluster sizes than single or average linkage.

(22)

5. Conclusion

Clustering is a powerful tool for understanding the structure of a data set, in particular a set of documents. In the previous sections I presented two commonly used clustering algorithms, k-means and agglomerative hierarchical Clustering, and introduced concepts crucial to the understanding of a document’s numerical representation in the vector-space model.

I presented an application of both, k-means and agglomerative hierarchical clustering, on the Quantlet data set and discussed their performance on two different levels; a purely manual and indirect evaluation prefers the k-means algorithm over any kind of hierarchical clustering as it returns clusters with more or less even cluster sizes while a purely quantitative internal evaluation suggests different algorithms and linkage methods depending on the score used. While the application is similar to document clustering, it should rather be labeled “keyword clustering” as the term-document matrix is built by keywords instead of term frequencies. As part of the application of k-means, I presented some numerical approaches of determining the number of clusters and discussed the initialization problem.

Summarizing literature about document clustering, most find that beyond having relatively low computational requirements, partitional techniques are better suited for document clustering than hierarchical ones. [Aggarwal et al., 1999]

(23)

References

Aggarwal, C. C., Gates, S. C., and Yu, P. S. (1999). On the merits of building categorization systems by supervised clustering. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 352–356. ACM.

Arthur, D.; Vassilvitskii, S. (2007). ”k-means++: the advantages of careful seeding”. In

Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics Philadelphia, PA, USA.

Baeza-Yates, R., Ribeiro, B. d. A. N., et al. (2011). Modern information retrieval. New York: ACM Press; Harlow, England: Addison-Wesley,.

Caliski, T. and Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1):1–27.

Florek, K., Lukaszewicz, J., Perkal, J., Steinhaus, Hugo, Zubrzycki, S. (1951). Sur la liaison et la division des points d’un ensemble fini. Colloquium Mathematicum, 2(3-4):282–285.

http://eudml.org/doc/209969.

Huang, A. (2008). Similarity measures for text document clustering. InProceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, volume 4, pages 9–56.

Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666.

Joel Larocca Neto and Alexandre D. Santos and Celso A.A. Kaestner and Neto Alexandre and D. Santos and Celso A. A and Kaestner Alex and Alex A. Freitas and Catolica Parana (2000). Document clustering and text summarization.

Karypis, M. S. G., Kumar, V., and Steinbach, M. (2000). A comparison of document clustering techniques. InTextMining Workshop at KDD2000 (May 2000).

Kowalski, G. J. (2007). Information retrieval systems: theory and implementation, volume 1. Springer.

MacQueen, J. et al. (1967). Some methods for classification and analysis of multivariate ob-servations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA.

(24)

Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65.

Salton, G., Wong, A., and Yang, C.-S. (1975). A vector space model for automatic indexing.

Communications of the ACM, 18(11):613–620.

Strehl, A., Ghosh, J., and Mooney, R. (2000). Impact of similarity measures on web-page clustering. In Workshop on artificial intelligence for web search (AAAI 2000), volume 58, page 64.

VEGA-PONS, S. and RUIZ-SHULCLOPER, J. (2011). A survey of clustering ensemble al-gorithms. In International Journal of Pattern Recognition and Artificial Intelligence, vol-ume 25, pages 337–372. Advanced Technologies Application Center, 7A 21812, Siboney, Havana 12200, Cuba, World Scientific Publishing Company. https://doi.org/10.1142/ S0218001411008683.

Zhao, Y., Karypis, G., and Fayyad, U. (2005). Hierarchical clustering algorithms for document datasets. Data mining and knowledge discovery, 10(2):141–168.

(25)

Statutory Declaration

I declare that I have developed and written the enclosed Bachelors Thesis completely by myself, and have not used sources or means without declaration in the text. Any thoughts from others or literal quotations are clearly marked. The Bachelors Thesis was not used in the same or in a similar version to achieve an academic grading or is being published elsewhere. I understand that violations of these principles will result in proceedings regarding deception or attempted deception.

Luisa Krawczyk Paris, April 23, 2019

Comparing applicability of prevalent Clustering Algorithms for Document Clustering