Clustering Algorithms for High-Dimensional Data

(1)

ÁNGEL POC LÓPEZ

CLUSTERING ALGORITHMS FOR HIGH-DIMENSIONAL DATA

Bachelor of Science Thesis

Examiner: prof. Tapio Elomaa

Examiner and topic approved on 16 January 2018

(2)

ABSTRACT

ÁNGEL POC LÓPEZ: Clustering Algorithms for High-Dimensional Data Tampere University of technology

Bachelor of Science Thesis, 22 pages May 2018

Bachelor’s Degree Programme in Information Technology Major: Computer Science (University of Zaragoza)

Examiner: Tapio Elomaa

Keywords: cluster analysis, high-dimensional data, curse of the dimensionality More and more data are produced every day. Some clustering techniques have been developed to automatically process this data, however, when this data is characteristically high-dimensional, conventional algorithms do not perform well. In this thesis, problems related to the curse of the dimensionality are discussed, as well as some algorithms to approach the problem. Finally, some empirical tests have been run to check the behavior of such approaches. Most algorithms do not really cope well with high-dimensional data. DBSCAN, some of its derivations, and surprisingly 𝑘-means, seem to be the best approaches.

(3)

PREFACE

First, I would like to thank my supervisor Tapio Elomaa for guiding me throughout this process and for being always available for having an appointment and commenting the process and the results of the thesis. Posteriorly, I would also like to thank professor Javier Fabra for making all the Erasmus paperwork between universities easier and for being always careful with me when having any doubt about the process. Last, I would like to thank my Erasmus friends, my friends from the university and from the school and above all my family for being always as supportive as they are.

Tampere, 15.5.2018

(4)

LIST OF TABLES AND FIGURES

Table 1: F-measure results for each algorithm with the different databases. Green color highlights the best results, orange denotes running time

problems and finally red symbolizes Java memory exceptions. ... 17

Figure 1: Sample databases for Density Based Clustering [3]. ... 6

Figure 2: Graphical example. 𝐶 marks the centroid, 𝑀 the medoid and the green

circles mark points with high hubness [4]. ... 7

(6)

LIST OF SYMBOLS AND ABBREVIATIONS

4C Computing Correlation Connected Clusters algorithm

CASH Clustering in Arbitrarily Oriented Subspaces based on the Hough transform algorithm

CLIQUE Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications algorithm

COPAC COrrelation PArtition Clustering

DBSCAN Density-Based Spatial Clustering of Applications with Noise algo-rithm

DOC Density-based Optimal projective Clustering algorithm DISH Detecting Subspace cluster Hierarchies algorithm

ERiC Exploring Relationships among Correlation clusters algorithm ORCLUS arbitrarily ORiented projected CLUSte generation algorithm PCA Principal Component Analysis

P3C Projected Clustering via Cluster Cores algorithm

PreDeCon subspace PREference weighted DEnsity CONnected clustering PROCLUS fast algorithm for PROjected CLUStering

SUBCLU density-connected SUBspace CLUstering

𝑘 number of clusters or the 𝑘-th cluster k-means algorithm for data clustering

𝑘-NN 𝑘 Nearest Neighbors

𝑥 data sample, depending on the index it can make reference to a

(7)

1. INTRODUCTION

According to Jain [1], we can define cluster analysis as the compilation of methods and techniques for grouping or clustering data in consonance with the mathematical charac-teristics or the patterns of the data, which can show underlying relations. In machine learning, supervised learning and unsupervised learning can be differentiated. In the for-mer one, data is labelled according to how do we want the data to be trained. In the latter one, where cluster analysis belongs, there is no previous knowledge of how the data must be labelled.

People nowadays produce increasingly data. Scientists are able to run experiments that create such an amount of information that they cannot actually process it manually, nor even extract some conclusions at a glance. In the Internet era, society creates data at every step they take in a web site or every time they share something in a social network. The total amount of data in the world was 4.4 zettabytes in 2013, in 2014 we produced 2.5 exabytes of data every day and it is expected that the amount of data will rise up to 44 zettabytes in 2020 [2].

It is important to analyze the data because it may hide some important characteristics that can be useful for further research. In biology, clustering has been widely used to extract information about microarray data, also known as gene expression data. Depending on the way the microarray data is clustered, the result can be either genes with common functions or different genetic relationships or disorders. Another important application everybody uses in their day-to-day is making use of customer recommendation systems like those that can be found in online shops or online TV platforms. It should be noted that companies spend a lot of money to get to know the different groups of people that buy their products and so they can address each of those groups individually. Again, the technique used to distinguish the different customers is typically data clustering. but those applications are only some examples among the wide variety that can be found in real life. In 2007 there were 1660 entries published in Google Scholar including the word data clustering [1].

This thesis aims to unify the existing information about data clustering for high dimen-sional data and to clarify the different approaches there exist to this day to solve a problem with the mentioned characteristics. There will also be some empirical tests to compare the performance of such approaches. This work does not intend to be another paper on 𝑘

-means but to be a brief document to learn to how handle the data clustering problem in general and to quickly help the user, in order to find a suitable method to solve his

(8)

problem. This thesis is going to take as main reference work an article about data cluster-ing of Kriegel, Kröger, and Zimek [3].

In the following section a context for cluster analysis is going to be created, as well as some definitions and notation needed to understand the thesis. In Section 3, some basic algorithms are going to be described because they will be further exploited in Section 4, as well as some approaches that lie out of the classification pursued classification. In Section 4, a classification for clustering algorithms is presented, giving some descriptions and also some sample algorithms. Section 5 follows with a set of experimental tests on some selected algorithms. Finally, Section 6 explains the conclusions at which the thesis has come up with.

(9)

2. BACKGROUND

Machine learning is a branch of computer science and artificial intelligence that attempts to teach computers how to learn a series of behaviors. It is used to predict a set of outputs given some pre-arranged inputs or features of interest. The outputs can be widely different depending on the objective of the desired application, from an integer or a set of integers to a Boolean value explaining the membership to a class or different ways to group some data.

We can mainly distinguish two classes of machine learning: supervised learning and un-supervised learning. Supervised learning consists on developing a function that maps the inputs to some desired output given a certain known label that discriminates the samples. In unsupervised learning the purpose is to find out or learn some properties of the data from the inputs without having a predefined label that separates them.

Data clustering or cluster analysis is a special case of unsupervised learning in which samples are classified in different groups given certain properties of their features. De-pending on the technique used, groups can overlap each other or not.

A classical method used in cluster analysis is 𝑘-means [1]. This method divides the data

into 𝑘 clusters depending on the closeness of the data point to a given set of means.

Al-gorithm 𝑘-means is well-known, but it shows a poor performance in more complicated

scenarios as high-dimensional data clustering. In those cases, conventional distance measures are not useful anymore and there may also be some redundant or noisy features that cannot be manually extracted. In addition, we should keep in mind that the number of samples needed to represent the different cluster distributions grows with the number of dimensions [4]. To overcome the last problem, we can use feature selection and di-mensionality reduction techniques. In these techniques we use methods to automatically analyze the different dimensions of the data and consequently select or stimulate some features and remove some other ones introducing redundancies or, directly, noise. Following, some notation is going to be presented to be used during the rest of the article. A database, 𝒟, is a set of data points (samples), which are points accommodated in a 𝑑

-dimesional space and is presented as < 𝑥₁, 𝑥₂, … , 𝑥_𝑑−1, 𝑥_𝑑 >. A feature is any of the

char-acteristics 𝑥_𝑖 of a data point, that defines the point in some way. It must be said that 𝑥_𝑖

can also be used to denote any of the 𝑛 samples of database, and 𝑥𝑖𝑗 will describe the 𝑗 -th feature of -the 𝑖-th data point of such database.

This work focuses on the clustering problem when there is high-dimensional data, which means that we do not only have a few features, but we are working in the scale of dozens, hundreds or even thousands of dimensions. Whenever a 𝑘 appears on the text it will be

(10)

associated either with the final number of clusters after processing the data or with a spe-cific cluster.

One of the most important things in data clustering is the selection of the distance measures, they will always have to be used either directly to the data points or after some transformations, to measure similarities. The most important one, because of its simplic-ity and its widely accepted use, is the Euclidean distance, which is the distance in a straight line connecting two points. To describe the Euclidean distance, we must present the Euclidean norm or ℓ2norm, which indicates the length of a vector in a 𝑑-dimensional

space:

‖𝑥‖ = √𝑥₁2_{+ 𝑥}

22+ ⋯ + 𝑥𝑑−12 + 𝑥𝑑2 .

Therefore, if we want to measure the distance between the samples 𝑥₁ and 𝑥₂, i.e.,

cal-culate the Euclidean distance, we must subtract the two vectors describing the points, 𝑥₁

and 𝑥2:

𝑑(𝑥₁, 𝑥₂) = √∑(𝑥_1𝑖−𝑥_2𝑖)2 𝑑

𝑖=1

.

However, that is not the only distance function that can be used, we can also use the

Manhattan distance, based on the Manhattan norm or ℓ1 norm, which only measures

distances on axis-parallel directions, that is, not following a straight line between points as above.

This distance measures are only useful if there is only numerical data present in the da-tasets. However, there can also be nominal or boolean attributes as well as strings or images. In the case of nominal attributes some transformations should be made, or some specific measure considering that all possible values for the variables are equidistant. When working with strings some edit distance measures can be used, in a similar way dissimilarities on DNA strings are calculated. If some images must be treated, point to point differences could be used or some hash function if it is desired to reduce the com-putational time.

The choice of these different distance measures mentioned before or another desired one in the spectrum of the existing ones, must be carefully studied, because depending on the distance function used, the shape of the final clusters may vary [3] [5].

Finally, a couple of concepts related with the 𝑘-means like algorithms are going to be

described. A medoid is a similar concept to the mean, being the closest point in a dataset to the mean, whereas a centroid is the point in the 𝑑-dimensional space that represents

(11)

3. RELATED WORK

In this section some classic approaches to the cluster analysis problem are going to be introduced as well as some innovative approaches that had never been used in our field of interest and that seem to perform well specifically in high-dimensional data clustering (nor in the case of low dimensionality problem).

3.1 𝑲

-means algorithm

This method [1] divides the data in 𝑘 clusters depending on the closeness of the data point

to a given set of means, it tries to minimize the squared error between the samples and each cluster mean.

Let 𝜇_𝑘 be the mean of the 𝑘-th cluster and the points in such cluster be 𝑐_𝑘, then, the

squared error is

𝐽(𝑐𝑘) = ∑ ‖𝑥𝑖 − 𝜇𝑘‖2 𝑥𝑖∈𝑐𝑘

,

and the goal is to minimize the global squared error, 𝐽(𝐶), over all the clusters 𝐽(𝐶) = ∑ ∑𝑥𝑖∈𝑐𝑘‖𝑥𝑖 − 𝜇𝑘‖2

𝐾

𝑘=1 .

This problem is known to be an NP-hard problem and therefore the algorithm used to optimize the function follows a greedy algorithm approach, which implies that the final solution might just be a local minimum.

There are three steps in this algorithm: 1. Initialize the 𝑘 means randomly.

2. Assign each sample to the closest mean, i.e. cluster, using a given distance func-tion.

3. Update each cluster mean given the new assigned samples.

One should iteratively repeat steps 2 and 3 until the clusters remain stable.

As a distance function is used, there may problems in high dimensional data sets because the distance between samples turns to be less distinguishable as the number of features increases [4].

(12)

3.2 Density Based Spatial Clustering of Applications with

Noise (DBSCAN)

DBSCAN [5] is based on the notion that points within a cluster have higher density than either points outside them or points in noisy areas, as human eye can perceive from Figure 1. Each point, member of a cluster, must have a certain number of points within a specific radius. The selection of the distance function between points is of highly importance be-cause it will determine the shape of the cluster. A very common choice in cluster analysis is the Euclidean Distance, because of its simplicity. The method defines a series of prop-erties that must be met to create a cluster.

Figure 1: Sample databases for Density Based Clustering [3].

A direct approach would require having minimum number of points (MinPts) in the Eps-neighborhood, 𝑁_𝐸𝑝𝑠(𝑝), of certain point 𝑝. The Eps-neighborhood of a point is the set of

points inside the circumference described from the sample with radius Eps. This approach fails because there are two types of points within a cluster, the core points (inside the cluster) and border points (at the frontier of the cluster). Using this approach on border points would certainly fail to correctly cluster the samples. Because of that we are going to describe some other properties of density-based systems.

A point 𝑝 is directly density-reachable from 𝑞 if 𝑁_𝐸𝑝𝑠(𝑞) contains 𝑝 and the number of

Eps-neighbors of 𝑞 is greater than MinPts. In addition, a point 𝑝 is density-reachable from 𝑞 if those points are joined by a chain of directly density-reachable points. Moreover, a

point 𝑝 is density-connected to 𝑞 if there is a point 𝑜 from which both 𝑝 and 𝑞 are

density-reachable. Given these definitions, we can finally define a cluster as a subset of points of a database meeting the following conditions:

1. For all points 𝑝 and 𝑞, if 𝑝 is a member of a cluster 𝐶 and 𝑞 is density-reachable

from 𝑝, then 𝑞 belongs to 𝐶.

2. All points within a cluster are density-reachable to each other.

We can implement an algorithm to create clusters meeting the previous conditions, in any case, in high-dimensional situations, it is difficult to see and distinguish low density

(13)

regions from the rest due to the data sparsity. Some approaches focused on the high-dimensional problem are based on this DBSCAN clustering system as we will show later.

3.3 Hubness application to Cluster Analysis

Hubness is a phenomenon that had been observed in high dimensional data, but nobody had tried to take profit of it applying it in the cluster analysis problem. Instead of trying to reduce dimensionality, we will take advantage of some characteristics of high-dimen-sional data spaces.

Hubness is the propensity of some data points in high-dimensional data spaces to appear with higher frequency in k-nearest neighbors (k-NN) lists of points than any other samples of the dataset [4]. K-nearest neighbor lists have been previously used to make 𝑘-NN

graphs, in which we can use graph clustering, or in density-based clustering as shown before. Hubness is an implicit property of the spaces we are dealing with; therefore it cannot be applied to low-dimensional datasets.

Data are lying on various hyperspheres centered at the distribution mean and, in high-dimensional data, the variance is low. This implies in a practical case that, points closer to the mean will be closer to all other points, and therefore, they will appear more fre-quently in more 𝑘-NN lists. Elements with low hubness are probably further from the rest

and will be outliers. It should also be kept in mind that some hubness points might be near two or more different clusters.

Figure 2: Graphical example. 𝐶 marks the centroid, 𝑀 the medoid and the green

cir-cles mark points with high hubness [4].

Then, after seeing how hubness points behave, we could think that they are the medoids

of the cluster or perhaps that they are a good approximation to those as we can see in Figure 2. Experiments have demonstrated, indeed, that some hubness points are medoids. So, this approach is based on looking for the hubness points to use them as data center approximations.

The algorithm proposed by Tomasev et al. [4] frequently shows an improvement over some centroid-based approaches on high-dimensional datasets, on both synthetic and real data, also behaving well under noisy conditions.

(14)

4. ALGORITHMS FOR HIGH-DIMENSIONAL

DATA

This section is going to get deeper in the clustering problem, distinguishing different types of clustering problems for high-dimensional data, and different ways to address them. As we have commented before, high-dimensional data clustering has some intrinsic prob-lems, such as local feature relevance or local feature correlation¸ meaning that different subsets of the features or different correlations are more important for different clusters, respectively.

Feature selection techniques are a way to address the high-dimensional problem, how-ever, techniques like Principal Component Analysis (PCA) only choose one subspace in which the clustering will be subsequently done and that counterposes some principles discussed above.

Figure 3: Problems to be solved in high-dimensional data clustering [3].

In the cluster analysis problem we have to solve both cluster and subspace searches at the same time, we can have a visual reference in Figure 3. Both problems have a huge search space (in the case of subspace searching it can be infinite) and some heuristics must be developed to reduce the time consumption. As a consequence of the application of heu-ristics, the solution can be non-optimal. Naturally, we discard the brute force approach of calculating each possible subspace projection because the number of subspaces can be infinite.

Three types of cluster searching approaches are going to be discussed: finding clusters in parallel-axis subspaces, pattern recognition approaches and finally finding clusters in

(15)

arbitrarily oriented subspaces. A great portion of the knowledge for this part has been acquired from Kriegel et al. survey [3].

4.1 Axis-Parallel Subspaces

One simple idea when working with high-dimensional data is to minimize the problem to axis-parallel subspaces. In a dataset with 𝑑 dimensions we can pass from infinite

sub-spaces to 2𝑑 − 1 possible subspaces, nevertheless, with the size of 𝑑 that is being dealt in

this thesis some heuristics have to be applied.

We can categorize the algorithms following two ways, on the one hand, depending on how the algorithm is going to be implemented, i.e. bottom-up or top-down approach, or on the other hand, based on the presumptions or simplifications made to the problem. Top-down algorithms start from the full-dimensional space and try to refine and deter-mine the features that constitute each subspace, and that after projecting will create a cluster. There are some constraints related to the problem, and to break them, researchers tend to use the locality assumption, that leads to the assumption that the subspace that stablishes a cluster can be learned from the local neighbors of the members.

Bottom-up algorithms start from all one-dimensional subspaces, and most of them evolve relying on the reverse of the monotonicity property: if a space T does not contain a clus-ter, then no space containing T will contain a cluster.

Now, according to the other classification of algorithms, the first subset to be analyzed is the Projected clustering algorithms. These algorithms attempt to bind uniquely each data sample to a cluster, they usually work as normal clustering algorithms, but the difference resides in a specialized distance function that depicts the subspaces of the clusters, finding the projection to be made. These algorithms rely on the fact that some data points are close when they are projected using certain subset of the features, but they are not close anymore when they are projected using the rest of them. Projected clustering algorithms

can use full dimensional, partitional, hierarchical and density-based techniques [6]. PRO-CLUS [7] is a 𝑘-medoid algorithm like the one discussed in Section 3, the 𝑘-means

algo-rithm. It initializes 𝑀 medoids, of which it will choose 𝑘 to start working with. The

cur-rent medoids are refined minimizing the standard deviation of the distance from the doids to the neighbors of such medoid, after that, the samples are adjudicated to the me-doids having in mind the important features of each medoid, i.e., the subspace. If some of the current medoids can be replaced by one of the medoids left improving the results, it will be changed. After optimizing the system noisy samples are detected and taken out of the clusters. There are some variations of this algorithm. PreDeCon [8] is based in the previously mentioned DBSCAN, however, in this case it is using a specialized function to detect the desired subspaces. As in DBSCAN it requires a number of parameters which are difficult to guess.

(16)

A second class of algorithms or a subclass of the previous one is the “soft” projected clustering algorithms, which assume that the number of clusters is already known and the only thing to be made is to optimize a function in which all dimensions play a role and none of them is discarded.

To continue, we present the subspace clustering algorithms. We must distinguish be-tween this subset and the projected clustering algorithms because in the clustering litera-ture there has frequently been some confusions or misconceptions. Subspace clustering algorithms aim to find all clusters in all possible subspaces. This is a problem since the aforementioned dimensionality. They start with all the one-dimensional clusters and keep on merging each other, all of the approaches are developed via bottom-up algorithms. CLIQUE [9] is the first approach proposed in subspace clustering and it divides the space in units of the same size, 𝜉, along all dimensions. Only the partitions that include a

mini-mum number of 𝜏 points will be considered as dense, and therefore, a cluster will be that

a group of adjacent dense partitions. SUBCLU [10], in contrast to CLIQUE, can find clusters in arbitrary shapes (in the relevant subspace, not the full-dimensional one). It is based in the density-connection principle of DBSCAN but in addition it adds some more definitions to apply restrictions on the subspace search. It outputs the same clusters than applying DBSCAN to those resulting subspaces.

Finally, we have the hybrid algorithms, which neither try to create a unique bind between a point and a cluster nor try to find all clusters in all subspaces, indeed, they sometimes produce overlapping clusters. Other algorithms just focus on interesting subspaces. DOC [11] is a density-based approach relying on hypercubes of a given size, that uses a Monte Carlo algorithm to approximate the clusters. Each time DOC is called, it will probably output only one cluster, so it has to be iteratively called several times. If the points of the previous clusters are not excluded, DOC clusters may overlap. It does not usually com-pute all clusters in all subspaces as subspace clustering proposed. DiSH [12] is able to discover the hierarchies between subspaces via hierarchical subspace clustering, and it can detect clusters of different dimensions, size and density. Finally, P3C [13] follows a bottom-up strategy starting with all one-dimensional intervals, which are a restriction on the values of the attributes, and reduces the problem to an Expectation-Maximization problem used to refine the clusters. It ends-up with a matrix containing the probabilities for each data sample to belong to any of the clusters.

4.2 Pattern Recognition Approaches

Pattern recognition algorithms follow a different approach, if closeness has been defined until now as the Euclidean distance after projecting the samples, now closeness will be related with similar behavior in axis-parallel subspaces. It has also been noticed that rows were mainly used for points and columns for features, but in pattern recognition ap-proaches you can generally transpose the data without affecting the output. Most of the

(17)

work for the so called biclustering algorithms, which are going to be presented later, has been done by Madeira and Oliveira [14] focusing on the application for biological data. Let A be a matrix with a set of rows X and a set of columns Y, respectively. Now, if we talk about the element 𝑎_𝑥𝑦, we will be talking about an element of the matrix

correspond-ent to the one placed at the row 𝑥 and column 𝑦. On the other side, if we talk about the 𝐴𝐼𝐽, we are defining a subset of A in which 𝐼 ⊆ 𝑋 and 𝐽 ⊆ 𝑌. Biclustering algorithms seek to find subsets of a matrix, which will be called from now on biclusters, that follow a given set of properties. These algorithms are supported by some statistical definitions like the mean. The mean of a row can be calculated as follows:

𝑎𝑖𝐽= 1

|𝐽|∑ 𝑎𝑖𝑗 𝑗∈𝐽

.

The mean of a column is analogously:

𝑎𝐼𝑗 = 1

|𝐼|∑ 𝑎𝑖𝑗 𝑗∈𝐽

.

And finally, the mean of a bicluster is:

𝑎𝐼𝐽 = 1

|𝐼||𝐽| ∑ 𝑎𝑖𝑗 𝑖∈𝐼,𝑗∈𝐽

.

We can distinguish among four different types of biclusters: constant biclusters, biclus-ters with constant values in rows or columns, biclusbiclus-ters with coherent values and finally

biclusters with coherent evolutions.

Constant biclusters are those biclusters whose values are perfectly identical. If the prob-lem is relaxed it can be said that the values are similar instead of identical. It is a clear example of an axis-parallel case. Block clustering [15] is an algorithm that solves the given problem. If the variance of every bicluster 𝐴_𝐼𝐽 is calculated:

𝑉𝐴𝑅(𝐴𝐼𝐽) = ∑ (𝑎𝑖𝑗 − 𝑎𝐼𝐽 𝑖∈𝐼,𝑗∈𝐽

),

the perfect bicluster is the one with variance equal to zero and hence we can split the matrix into two partitions minimizing the variance of the obtained biclusters and itera-tively repeating this process of splitting until 𝑘 clusters are obtained.

Biclusters with constant values on rows are another kind of categorization in which values of rows or columns follow a certain pattern. In the case of constant rows, they can be calculated with the following formula 𝑎_𝑖𝑗 = 𝜇 · 𝑟_𝑖 , where 𝜇 is a constant value for the

(18)

bicluster which is modified by 𝑟𝑖 in each row and the dot means addition or multiplication. It must be cleared out that the multiplicative model is equivalent to the additive one when taking logarithms. In the case of constant columns, it can be calculated analogously as

𝑎𝑖𝑗 = 𝜇 · 𝑐𝑖, where again 𝑐𝑖 is a variable dependent to the column. The most common approach is to apply a transformation to reduce the problem to a constant biclusters one, and therefore apply block clustering directly there.

Biclusters with coherent values present a real step-forward with respect to the biclusters previously mentioned because they present coherent values on both rows and columns at the same time. We can define them combining both the equations that characterized bi-clusters with constant values on rows and constant values on columns:

𝑎_𝑖𝑗 = 𝜇 · 𝑟_𝑖· 𝑐_𝑖 .

It has to be reminded that patterns with negative correlation cannot have coherent val-ues. Cheng and Church [16] proposed a similarity score, the mean squared residue, H, to create a model to define biclusters. In that model, a submatrix 𝐴_𝐼𝐽 is a 𝛿-bicluster if

its similarity measure is lower than 𝛿. We consider that a bicluster is perfect if 𝛿 is

equal to zero, and in that conditions, an element of the bicluster is defined as:

𝑎𝑖𝑗 = 𝑎𝑖𝐽+ 𝑎𝐼𝑗− 𝑎𝐼𝐽,

which agrees perfectly with the previous equation defining coherent biclusters. If we define H as:

𝐻(𝐼, 𝐽) = 1

|𝐼||𝐽| ∑ (𝑎𝑖𝑗 − 𝑎𝑖𝐽− 𝑎𝐼𝑗 + 𝑎𝐼𝐽) 𝑖∈𝐼,𝑗∈𝐽

.

The objective is to simultaneously calculate all the biclusters, minimizing the following function:

∑ 𝐻(𝐼, 𝐽) 𝐼,𝐽

,

that is, to minimize the residuals given by the similarity measure. However, this algo-rithm has some limitations like the obligation to indicate the number of clusters before running the algorithm, and some inefficiencies due to the fact that some areas must be masked in order to continue processing the rest of biclusters. FLOC [17] is an algorithm based in 𝛿-biclusters, but in this case some random movements are introduced to add or

remove some rows or columns to some biclusters, allowing, also, overlapping between clusters.

(19)

Biclusters with coherent evolutions differ from the rest of the aforementioned biclusters in that the latter detect changes between pairs of rows or columns no matter the quantity that it is changing, just the fact that it follows a common pattern. Ben-Dor et al. [18] de-fine the biclustering problem as an order-preserving submatrix (OPSM) problem, in which the idea is to find a permutation of the columns of a matrix so that the values of the rows of a given submatrix are ordered in a strictly increasing order. Other algo-rithms like the one proposed by Liu and Wang [19] follow the same approach but in their case some elements are allowed to have similar values.

The results of this pattern-based categorization might be interesting sometimes, but these results do not usually have a spatial interpretation, like clustered points lying on the same subspace.

4.3 Arbitrarily-Oriented Subspaces

Previous clustering systems had some limitations like the obvious one of the parallel-axis subspaces or that pattern matching methods were so simplistic and only some types of correlations were allowed. Now we are going to present some other type of algorithms, which we will call oriented clustering, generalized subspace/projectedclustering or cor-relation clustering algorithms, which do not have any constraint on the location or orien-tation of the clusters, i.e., they can be on any subspace on ℝ𝑑, and here, clusters become

hyperplanes of different dimensionalities, where pattern recognition approaches do not suit anymore.

Points lying on a common hyperplane seem to follow certain linear dependencies or cor-relations among the dimensions of such plane. If a density-based algorithm was used, some points may be clustered in an interesting subspace, but if correlations are introduced into the problem, it may be noticed that those points do not really belong to the same cluster and hence, we should devise a new approach.

A good technique to seek for relevant directions of high variance in any direction is Prin-cipal Component Analysis (PCA). PCA is applied locally to discover clusters in different subspaces, assuming that the hyperplane in which the points lie is well defined by a local set of points.

To apply PCA the covariance matrix of the dataset 𝒟 ⊂ ℝ𝑑 has to be built first: Σ_𝒟 = 1

|𝒟|· ∑(𝑥 − 𝑥𝒟) · (𝑥 − 𝑥𝒟) 𝑇 𝑥∈𝒟

.

In the resultant matrix of size 𝑑 × 𝑑 each element 𝜎_𝑖𝑗 of Σ_𝒟 represents the covariance

between the dimensions 𝑖 and 𝑗 and the diagonal element represents the variance of such

(20)

correlation, being explained by: 𝜎𝑖𝑗 = 𝜌𝑖𝑗𝜎𝑖𝜎𝑗, where 𝜌𝑖𝑗 is the correlation between the dimensions 𝑖 and 𝑗. Now, eigenvalue decomposition can be applied to Σ_𝒟 to obtain

Σ_𝒟 = 𝑉_𝒟𝐸_𝒟𝑉_𝒟𝑇,

where 𝐸𝒟 is the eigenvalues matrix of Σ𝒟, containing its eigenvalues in decreasing order, and 𝑉_𝒟 is the eigenvectors matrix, representing a new orthonormal base, where the vectors

are ordered corresponding to its eigenvalues. 𝐸_𝒟 can also be understood as the covariance

matrix of the system using the new basis, in such system, there are no correlations among dimensions anymore, consequence of all non-diagonal elements being equal to zero. While the first 𝜆 eigenvectors (strong eigenvectors) create a hyperplane where the 𝒟

points are accommodated, the last 𝑑 − 𝜆 eigenvectors (the weak eigenvectors) create a

subspace where the points cluster densely when projected.

This information can be used to make algorithms for data clustering, there are some dif-ferent approaches, differing in the selection of 𝜆 or the local points. ORCLUS [20]

fol-lows a similar path to PROCLUS [7], also initializing 𝑀(> 𝑘) seeds, it uses a specialized

distance function that measures distances among points when projected through the weak eigenvectors. The number of seeds is iteratively reduced until reaching the selected 𝑘 by

merging the most similar (closest on the projected subspace) clusters. There is a check and balance between the choice of 𝑀 and the runtime, a greater 𝑀 will lead to a better

performance but it will consequently increase the execution time. On the other hand, 4C [21] follows a density-based approach like DBSCAN [5], some seeds are iteratively ini-tialized and they evolve meeting some density-based properties based on distance func-tions created applying PCA to the Eps-neighborhood of the seed. COPAC [22] follows related ideas to 4C, but in its case it clusters points with the same number of strong ei-genvectors increasing efficiency. ERiC [23] follows a similar path to COPAC with the addition of a subspaces hierarchy.

However, not all the approaches of generalized subspace clustering are attached to PCA. There are other techniques based on the Hough Transform as CASH [24]. The Hough Transform is a highly used transform in digital image processing that maps points from the spatial domain to the parameter domain, one of its applications is line fitting in object recognition problems. Each point in the spatial domain is mapped into an infinite set of points in the parameter domain through a line, for example, or a trigonometric function, which is more used. The intersection of two lines or curves in the parameter domain shows a line between two points in the spatial domain. The point to cluster is to find intersections in the parameter domain, getting rid of the sparsity of taking measures in the spatial domain. The idea of CASH is to find high-density regions on the parameter do-main though a grid-based system for later doing a search for intersections. This algorithm has a poor runtime performance, that is exponential in 𝑑, in the worst case.

(21)

There are also some other ideas distancing from PCA analysis like exploiting self-simi-larity trough fractal dimensions also based on the locality assumption. Nevertheless, clus-tering algorithms based only in fractal dimensions do not appear to obtain good results. Finally, there are also some systems based on random sampling, consisting on taking 𝑑 + 1 random samples iteratively 𝑛, choosing 𝑛 such that there is a certain probability to get

a sample from each of the existing clusters [25]. There is another approach based on RANSAC [26], used for model fitting, and with some applications as well on image pro-cessing such as line fitting. Anyway, the latter commented approaches do not seem to conclude neither efficient or effective results [3].

(22)

5. EXPERIMENTAL TESTS AND RESULTS

Once all different approaches have been examined and described and considering that Kriegel et al. [3] did not run any empirical tests on any of the mentioned algorithms in their survey, it has been decided to run those series of tests in this thesis. Some algorithms are going to be selected and tested with different datasets of different dimensionalities to see their real behavior.

The framework selected to run the experiments is the ELKI System 1_{, which is an}

envi-ronment for Knowledge Data Mining applications focused on clustering techniques. Their developers have also developed some of the algorithms commented in this thesis. This platform is implemented in Java and it uses some specific indexing structures to organize the memory space better according to each approach and therefore increase the running time. The platform has an intuitive and simple layout with some ready-to-use algorithms and distance metrics. After running the desired experiments, ELKI has also some visual-ization tools to help the user to understand how the points cluster in the data space. Be-sides, the platform has a great number of performance measures based on different eval-uation metrics groups.

Three datasets have been selected to be working with:

• Pov-2: a toy dataset, only 2-dimensional, to see how the algorithms perform in typical and well-studied clustering problems. This dataset is composed of three clusters whose dimensions are drawn from different gaussian distributions with different parameters. There are 150 samples and none of them belongs to noise. This dataset is provided by the ELKI group.

• Subspaces-10: a 10-dimensional dataset, to test how the algorithms work when the dimensions start to increase. Here, the data is composed mainly of gaussian distributions with the addition of some uniform ones. There are four clusters and 350 samples of which the 14% are noise. This dataset also belongs to ELKI. • Synthetic-control2_{: a 62-dimensional dataset. It can already be said that this one}

has high-dimensional characteristics. The data is drawn from synthetically gener-ated control charts. There are 600 samples belonging to 6 classes with no noise. This data has been collected from a web with datasets to do research in the ma-chine learning area.

1_{ELKI: Environment for Developing KDD-Applications Supported by Index-Structures -}

https://elki-pro-ject.github.io/ - [Accessed 16 05 2018]

(23)

Higher dimensional datasets have also been tried, in the order of hundreds, but the frame-work did not seem to respond correctly when frame-working with such data, and no conclusion could be reached.

ELKI framework offers 5 main groups of evaluation metrics based on the idea of having a ground truth to compare with: pair counting measures, entropy-based measures,

BCubed-based measures, Set-matching-based measures and Editing-distance measures. According to Amigó et al. [27], the evaluation metrics must meet some properties to be able to react to any change in a cluster diminishing its quality. Cluster homogeneity ex-plains that evaluation metrics should reward items of the same cluster to be similar. Com-plementing this, cluster completeness says that two groups of items belonging to the same category should be grouped into the same cluster. Following, rag bag states that it is better to introduce noise into an ill-conditioned cluster that into a clean one. Last, cluster size versus quantity exposes that it is preferable to have a small error into a big cluster than having multiple errors in small clusters.

The only group of evaluation metrics fulfilling the mentioned properties is the one con-taining the BCubed-based measures. BCubed metrics include the precision, which is the amount of samples in the cluster grouped according to the ground truth cluster, and recall, is the proportion of items from its ground truth category appear that in its corresponding cluster. The F-measure is a proportion that relates both recall and precision, being 0 the lower bound showing the maximum error and 1 the upper bound indicating a perfect match. It must be noted that a random classifier can easily get a F-measure of 0.5 so it will be expected a good clustering method to have a larger value. The algorithms we are going to use can create overlapping clusters, but extended BCubed metrics can deal with this problem. F-measure from BCubed evaluation metrics is the one that is going to be used in the proposed tests.

Table 1: F-measure results for each algorithm with the different databases. Green color highlights the best results, orange denotes running time problems and finally red

(24)

In Table 1 the F-measure results for the run experiments can be contemplated. In the last column, the rank is computed using the average of the individual ranks for each dataset. The total average has not been used as the distributions of the data cannot be directly compared.

The parameters for the tests have been manually selected to obtain the best results possi-ble. All the algorithms have used Euclidean distance to compute closeness except those algorithms based on special distance functions.

As it can be appreciated in the table, the best method is surprisingly the simplest one, 𝑘

-means, but this result is somehow unfair because the number of desired clusters was se-lected beforehand, and in a real unsupervised case that could not be done. The following methods with best results are DBSCAN and derivations of this method such as PreDeCon, 4C or COPAC, being the last one the only able to overcome its predecessor. One can also see that some methods start to fail with only 10 dimensions and that almost all of the supposed high-dimensional approaches fail to solve the problem for 62 dimensions. It also should be noted that most of the algorithms are very hard to parametrize.

The yellow boxes indicate that there has been a waiting time of plus 30 minutes and the parameters could not be improved or that they could not even be retrieved. It is interesting to note that one of the algorithms that fails with this problem is SUBCLU, that follows the Subspace algorithms approach aiming to find all clusters in all subspaces. The red box symbolizes a memory exception, making P3C unfit to work with high-dimensional da-tasets.

It can be baffling, but some of these algorithms, like PROCLUS [7], were only tested in the original research paper with datasets of 7 to 10 dimensions and hence, the behavior seen on the table can be understood. Others, like SUBCLU [10] are tested with up to 50 dimensions, but bad parametrization makes it difficult to obtain results.

(25)

6. CONCLUSIONS

In this work the characteristics of the problem of cluster analysis for high-dimensional data have been analyzed. In addition, some algorithmic approaches have been selected and studied to overcome this problem, starting from the more simplistic or basic ones, and advancing to more specific solutions, which have been further classified in different groups and subgroups.

Some empirical tests have been run on different datasets with different characteristics to see the real behavior of each approach and most of the algorithms did not cope with real high dimensional data. It must be remembered, as said in the beginning of this thesis, that distance measures have problems in such a sparse space, and even though these algo-rithms try to approach the high-dimensional problem, all of them make use of distance measures.

Surprisingly, 𝑘-means has had the best results of the tests performed, to continue the work

of this thesis some more tests with this approach should be performed, and some research should be made of using specialized distance measures on 𝑘-means. DBSCAN has also

shown good results, and specifically COPAC, one of its derivations. It would be interest-ing to keep on workinterest-ing with more derivations of DBSCAN to see if they can be improved, as well as with COPAC itself.

Something else that has been noted is the difficulty of setting the parameters of each al-gorithm. Some approaches should be further studied to create solutions that automatically set the parameters according to some intrinsic characteristics of the data.

Finally, this thesis has focused on testing the data having a ground truth to compare with. However, in a real unsupervised learning problem that information is not available and despite the fact that there are some metrics based on the separability of the data, those metrics are also based on distance measures, and here the curse of the dimensionality is faced again. Therefore, some metrics should be developed for new data without any kind of labelling available.

(26)

REFERENCES

[1] A. K. Jain, "Data clustering: 50 years beyond K-means," Pattern Recognition Letters, vol. 31, no. 8, pp. 651-666, 2010.

[2] M. Khoso, "Northeastern University - How Much Data is Produced Every Day?,"

13 05 2016. [Online]. Available:

http://www.northeastern.edu/levelblog/2016/05/13/how-much-data-produced-every-day/. [Accessed 05 04 2018].

[3] H.-P. Kriegel, P. Kröger and A. Zimek, "Clustering High-Dimensional Data: A Survey on Subspace Clustering, Pattern-Based Clustering, and Correlation Clustering," ACM Transactions on Knowledge Discovery from Data, vol. 3, no. 1, 2009.

[4] N. Tomasev, M. Radovanovic, D. Mladenic and M. Ivanovic, "The Role of Hubness in Clustering," IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 3, pp. 739-751, 2013.

[5] M. Ester, H.-P. Kriegel, J. Sander and X. Xu, "A Density-Based Algorithm for Discovering Clusters," Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226-231, 1996.

[6] G. Moise, A. Zimek, P. Kröger and H.-P. Kriegel, "Subspace and projected clustering: experimental evaluation and analysis," Knowledge and Information Systems, vol. 21, no. 3, 2009.

[7] C. C. Aggarwal, C. Procopiuc, J. L. Wolf and J. S. Park, "Fast Algorithms for Projected Clustering," ACM SIGMOD, vol. 28, no. 2, pp. 61-72, 1999.

[8] C. Böhm, K. Kailing, H.-P. Kriegel and P. Kröger, "Density Connected Clustering with Local Subspace Preferences," Proceedings of the Fourth IEEE International Conference on Data Mining, pp. 27-34, 2004.

[9] R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," ACM SIGMOD, vol. 27, no. 2, pp. 94-105, 1998.

(27)

[10] K. Kailing, H.-P. Kriegel and P. Kröger, "Density-Connected Subspace Clustering for High-Dimensional Data," in International Conference on Data Mining, 2004. [11] C. M. Procopiuc, M. Jones, P. K. Agarwal and T. M. Murali, "A Monte Carlo

algorithm for fast projective clustering," Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pp. 418-427, 2002.

[12] E. Achert, C. Böhm, H.-P. Kriegel, P. Kröger, I. Müller-Gorman and A. Zimek, "Detection and Visualization of Subspace Cluster," Advances in Databases: Concepts, Systems and Applications., pp. 152-163, 2007.

[13] G. Moise, S. Jörg and M. Ester, "P3C: A Robust Projected Clustering Algorithm,"

Proceedings of the 6th International Conference on Data Mining (ICDM), pp. 414-425, 2006.

[14] S. C. Madeira and A. L. Oliveira, "Biclustering Algorithms for Biological Data Analysis: A Survey," IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, vol. 1, no. 1, 2004.

[15] J. A. Hartigan, "Direct Clustering of a Data Matrix," Journal of the American Statistical Association, vol. 67, no. 337, pp. 123-129, 1972.

[16] Y. Cheng and G. Church, "Biclustering of Expression Data," Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp. 93-103, 2000.

[17] Y. Yang, W. Wang, H. Wang and P. Yu, "δ-Clusters: Capturing Subspace

Correlation in a Large Data Set," Proceedings 18th International Conference on Data Engineering, pp. 517-528, 2002.

[18] A. Ben-Dor, B. Chor, R. Karp and Z. Yakhini, "Discovering Local Structure in Gene Expression Data:," Journal of Computational Biology, vol. 10, no. 3-4, 2003. [19] J. Liu and W. Wang, "OP-Cluster: Clustering by Tendency in High," Proceedings

- IEEE International Conference on Data Mining, pp. 187-194, 2003.

[20] C. C. Aggarwal and P. S. Yu, "Finding generalized projected clusters in high dimensional spaces," Proceedings of the 2000 ACM SIGMOD international conference on Management of data, vol. 29, no. 2, pp. 70-81, 2000.

(28)

[21] C. Böhm, K. Kailing, P. Kröger and A. Zimek, "Computing Clusters of Correlation Connected objects," Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pp. 455-466, 2004.

[22] E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger and A. Zimek, "Robust, Complete, and Efficient Correlation Clustering," Proceedings of the SIAM International Conference on Data Mining, pp. 413-418, 2007.

[23] E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger and A. Zimek, "On Exploring Complex Relationships of Correlation Clusters," n Proc. 19th International Conference on Scientific and Statistical Database Management, pp. 7-7, 2007. [24] E. Achtert, C. Böhm, J. David, P. Kröger and A. Zimek, "Robust Clustering in

Arbitrarily Oriented Subspaces," Proceedings of the 2008 SIAM International Conference on Data Mining, pp. 763-774, 2008.

[25] R. Haralick and R. Harpaz, "Linear Manifold Clustering," in Machine Learning and Data Mining in Pattern Recognition, Springer, 2005, pp. 132-141.

[26] M. A. Fischler and R. C. Bolles, "Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,"

Communications of the ACM, vol. 24, no. 6, pp. 381-395, 1981.

[27] E. Amigó, J. Gonzalo, J. Artiles and F. Verdejo, "A comparison of extrinsic clustering evaluation metrics based on formal constraints," Information Retrieval,

Clustering Algorithms for High-Dimensional Data

ÁNGEL POC LÓPEZ