2.2 Open Problems in Clustering
2.2.1 High Dimensionality
Modern computing capabilities are allowing the generation and storage of increasingly large datasets. As a result, it is common that real-world datasets contain observations with many attributes (dimensions). This is a well-documented problem (Hinneburg and Keim,1999;
Agrawal et al.,1998;Kriegel et al.,2009), and is commonly referred to as “the curse of di- mensionality”. The problems associated with clustering this type of data go beyond the
computational complexity associated with analysing large datasets, impeding the funda- mental assumptions required for cluster detection.Steinbach et al.(2004) present this prob- lem intuitively by considering a fixed number of uniformly distributed points contained in grids of fixed size as dimensionality increases. The number of grids contained in the space grows exponentially with dimensionality, hence, unless the number of observations in- creases at the same rate, the proportion of cells which will be empty increases also. Thus, high-dimensional datasets are very sparse.
Further, it is likely that some features are strongly correlated with others, or do not con- tain relevant information for clustering. Along these irrelevant dimensions, the data appear uniform, and i.i.d, which is not appropriate for accurate cluster identification. Practically, in sparse high-dimensional datasets with large numbers of irrelevant dimensions, measures of spatial proximity and probability density, commonly used to define similarity between observations are not meaningful. This is due to the pairwise distances between observations that should belong to the same cluster not being significantly smaller than the pairwise dis- tances between observations that should belong to different clusters, when computed over all dimensions. Further, clustering algorithms which rely on the specification or estimation of a probability density function cannot be applied, as the density is approximately zero ev- erywhere. Therefore, discarding irrelevant features through dimensionality reduction is a necessity to make cluster detection possible. This may be done as a pre-processing step or locally, as part of the partitioning procedure. The latter approach is more common since it is often the case that different features are relevant for the detection of different clusters, making a global dimensionality reduction inappropriate.
Subspace clustering (Parsons et al.,2004) typically refers to methods which assume a sub- set (or subsets) of features are relevant for cluster detection. This restricts attention to axis-
parallel subspaces in which clusters are sought. Across the different approaches to subspace clustering, it is generally assumed that dimensions which allow the location of compact clus- ters should be retained. Ak-medoid approach to this problem is adopted by the PROCLUS algorithm (Aggarwal et al.,1999). In this algorithm, the subspaces are built to have minim- imal standard deviation in the distances between the points and their closest medoid along each dimension. Distances are only calculated in the relevant subspace for each cluster. This approach tends to produce equally sized clusters with a spherical shape in their subspaces.
This underlying idea of building up subspaces in which clusters are identifiable may also be applied to alternative cluster definitions. Since non-parametric density based clustering is fundamentally limited to low-dimensional spaces, but has advantages such as being capable of locating clusters of diverse shapes, subspace clustering algorithms relying on this cluster definition are attractive. There exist a number of algorithms for this that locate subspaces in which the clusters are sufficiently dense. This definition of sufficient density to indicate an appropriate clustering, and the subsequent construction of the subspaces are the main differences between the algorithms that apply this approach.
PreDeCon (Böhm et al.,2004b) is a subspace variant of DBSCAN, which applies a mod- ified distance measure, capturing the subspace of each cluster. This distance measure in- corporates the subspace preference of each cluster at each pointxi. A given dimension
is considered relevant in the subspace ofxiif the variance of points in the Euclideanϵ-
neighbourhood ofxiis below a pre-determined threshold. The subspace modified distance measure is then a weighted Euclidean distance along the dimensions in the relevant sub- space.
SubClu (Kailing et al.,2004) determines dense clusters in the same way as DBSCAN, by setting a lower threshold on the number of points in theϵ-neighbourhood of each datum.
This definition of dense clusters is similar to that applied in PreDeCon, however, in Sub- Clu, the relevant subspaces for each cluster are built iteratively. This process begins with all one-dimensional dense clusters. The dimensionality of the subspaces for each cluster are determined such that if aδ+1-dimensional subspace (whereδis an arbitrary dimension) contains aδ-dimensional subspace that is not dense, theδ+1-dimensional subspace cannot be considered dense, hence the dimensionality of the subspace is not increased further. Like- wise, the CLIQUE algorithm (Agrawal et al.,1998) constructs dense subspaces for clusters using the same iterative procedure. However, this algorithm relies on an alternative defi- nition of regions of high density, which uses an equally spaced axis parallel grid over the observations. Any grid unit containing at leastτpoints is considered dense. This grid-based approach reduces the computational cost compared to SubClu, but is often less accurate. All of these density-based approaches have attractive properties, such as the ability locate clusters of diverse shapes, and estimate their number. However, the input parameters are not intuitive to set.
In practice, axis-parallel subspaces may be too restrictive for some datasets. There exist a variety of algorithms that extend the concepts adopted by the aforementioned subspace al- gorithms, which do not adopt this constraint, and instead permit the detection of clusters in arbitrarily oriented subspaces. We refer to such approaches as projective clustering algo- rithms. This is a convention in this thesis but in the literature both projective and subspace clustering are used interchangeably.
The most common dimensionality reduction technique for projective clustering is prin- cipal component analysis (PCA), which projects the data,X = {xi}ni=1such that maximal variability is retained, and reconstruction error is minimised (Tipping and Bishop,1999).
This is done using the covariance matrix, Σ= 1 n n
∑
i=1 (xi−µ)(xi−µ)⊤whereµ∈ Rdis the mean vector. The eigen-decomposition ofΣ,
Σ=VΛV⊤,
gives an orthonormal basis,Vwhose columns correspond to directions of decreasing vari- ance inX. SinceΛis a diagonal matrix, any correlation structure is removed in the pro- jected data,X·VwhereXis then×ddata matrix. The majority of projective clustering algorithms rely on PCA, either on subsets of points or on the whole dataset. The ORCLUS algorithm (Aggarwal and Yu,2000) is ak-medoid approach to projective clustering, and is an extension of PROCLUS to arbitrarily oriented subspaces. This clusters objects by min- imising the distances between each data point and its closest medoid along the directions of low variability for each cluster.
Likewise, the density-based subspace approach may be extended to arbitrarily oriented subspaces by algorithms such as 4C (Böhm et al.,2004a), which extends the approach of PreDeCon. In this algorithm, the similarity between two points is determined by the simi- larity of the eigen-system of theirϵ-neighbourhoods. If two points are connected by a sim- ilar correlation of attributes, they are assumed to belong in each other’s correlation neigh- bourhoods.
In this thesis, we focus on projective methods which rely on one-dimensional subspaces for clustering. The principal direction divisive partitioning algorithm (PDDP) (Boley,1998) is a divisive algorithm, which recursively projects the data onto the first principal compo-
nent (direction of maximal variability), and then bi-partitionsX at the mean of these pro- jections. This continues until the maximum scatter value in each of the clusters does not ex- ceed the scatter value of the centroids of all the clusters found so far. Two extensions of this algorithm are proposed byTasoulis et al.(2010) to incorporate a more explicit cluster def- inition. Both algorithms project the data onto the first principal component as in PDDP. However, interval PDDP (iPDDP) splits at the point of maximal distance between two consecutive projections and density enhanced PDDP (dePDDP) constructs a kernel den- sity estimate over the projections and splits at the global minimiser of this estimated density in the range between the two outer-most modes. Both of these algorithms rely on thelow- density cluster separation assumption, and locate separating hyperplanes orthogonal to the first principal component which result in the largest margin and lowest density separations respectively. For datasets with compact, convex clusters, projecting onto the direction of maximal variability enables accurate clustering results, since along this direction, the clusters are likely to be well-separated (Boley,1998). PDDP and its extensions have been shown to produce high-quality clustering results for applications such as gene expression clustering and text mining.
Although PCA projections can be useful for cluster detection in a number of areas, it is trivial to construct examples where directions of high variability are not suitable for cluster detection.Projection pursuit(PP) algorithms (Friedman and Tukey,1974;Huber,1985) en- compass the search for low-dimensional spaces, that are appropriate for pattern recognition as a more general concept. PP methods aim to locate optimal linear projections of high- dimensional datasets, based on some measure of “interestingness” (known as theprojection index) of a projection direction for the specified learning task (Jones and Sibson,1987). This approach has been applied to locate low-dimensional subspaces for clustering (Friedman
and Tukey,1974), regression (Friedman and Stuetzle,1981b), classification (Friedman and Stuetzle,1981a) and density estimation (Friedman et al.,1984). The definition of an inter- esting projection direction is not universally accepted, and therefore the majority of classi- cal dimensionality reduction techniques may be thought of within the projection pursuit framework. For example, PCA is equivalent to PP, where the projection index is defined as the variance along the selected projection direction.
More recently,Pavlidis et al.(2016) proposed a PP algorithm called the minimum density hyperplane (MDH), which defines the projection index based on the minimum of the es- timated density of the projections of the data along a univariate projection direction. This method aims to locate projection directions which are optimal for the separation of clusters, following the density-based approach to clustering, by locating minimum density bound- aries between high-density regions associated with clusters. We discuss this in detail in later chapters.