• No results found

In this chapter, we investigated three methods for producing continuous representations of mixed datasets, and their appropriateness for clustering with projective density-based and well-established algorithms. The methods investigated were MDS, mPPCA and CSE.

● ● ● ● ● ● ● 0.00 0.25 0.50 0.75 MDS CSE mPPCA Regret

dePDDP GMM k_means MDH_ens MDH_hier

Figure 3.4: Boxplot of regret based on NMI for continuous representations produced by MDS, CSE and mPPCA.

Through a systematic simulation study, we have shown that if the generative model as- sumed by mPPCA is satisfied, all three continuous representations can produce an appro- priate representation for effective cluster detection. However, for different generative mod- els, the representation from mPPCA is much less competitive than CSE and MDS, which make no assumptions about the data generating processes. Over real benchmark datasets with varying characteristics, CSE produced the most appropriate continuous representation, while MDS and mPPCA had a more varied performance. In general, the real datasets were challenging, so consistently high-quality results were not possible for any of the continuous representations, instead the ability to locate meaningful clusters was dependant on both the continuous representation and the choice of clustering algorithm.

4

Combining Hyperplane Separators for

Clustering

Abstract

We propose approaches to perform density-based clustering of high-dimensional datasets that may contain diverse (mixed) attributes, which are able to identify clusters in arbitrarily ori- ented subspaces and estimate their number. For mixed datasets, we obtain an appropriate continuous representation. Thereafter, we perform projection pursuit on the continuous data or continuous representation of the mixed data, to locate low-density linear separators that partition high-density regions associated with clusters. By combining binary partitions from multiple separators we obtain a divisive and a partitional clustering algorithm to produce a complete clustering. The resulting clusters concentrate around the modes of the estimated density of the data (or its continuous representation where necessary). Through empirical evaluation across simulated and real-world benchmark datasets with varying characteristics, we show that the proposed algorithms produce consistently high-quality results, and that their performance is competitive with alternative density-based and other state-of-the-art clustering algorithms.

4.1 Introduction

Given a set of observationsX = {xi}ni=1, the objective of clustering is to partitionX into a number of homogeneous subsets, orclusters, so that observations allocated to the same cluster are more similar to each other, than observations allocated to different clusters. As there is no unique and universally accepted definition of a cluster, there are a number of approaches to clustering, each relying on a different definition.

The non-parametric statistical approach to clustering, commonly referred to asdensity- based clustering, assumes thatX is a sample of realisations of a continuous random vari- ableXwith unknown probability density function. Clusters are then defined as regions of high probability density surrounding the modes of the density function (Hartigan,1975;

Menardi,2016).

Since the true density function is unlikely to be known in practice, its modes must be lo- cated using an non-parametric density estimate. This imposes limitations on the applicabil- ity of density-based clustering in a number of practical applications. Firstly, density estima- tion is unreliable in even moderate dimensions. This problem, commonly referred to as the curse of dimensionality, makes the detection of dense regions associated with clusters chal- lenging, unless the clusters are very well separated (Rinaldo and Wasserman,2010). In addi- tion, if the observations contain any non-continuous attributes, which is common in many applications, the construction of a continuous density estimate is inappropriate. If one were to construct a continuous estimator over such data, subsequent cluster detection would triv- ially separate observations with the same combinations of outcomes in the discrete dimen- sions. We propose an approach to overcome these restrictions. We consider an alternative formulation of density-based clustering, which remains applicable in high dimensions, as well as applying continuous representations of mixed data to allow this methodology to be

applied to datasets with large numbers of diverse attributes.

A direct consequence of defining clusters around the modes of a probability density func- tion is that cluster boundaries pass through contiguous regions of low probability density, that separate the modes. This alternative formulation, known as thelow-density separa- tion assumption, underpins well-established algorithms such as maximum margin clustering (MMC) (Xu et al.,2004) and semi-supervised support vector machines (Joachims,1999). These methods extend the maximum margin hyperplane approach, and have proved very successful in clustering and semi-supervised classification respectively. The justification for using the maximum margin hyperplane to partition unlabelled data is that it approximates the hyperplane that goes through the most sparse regions of the empirical density (Chapelle and Zien,2005;Chapelle et al.,2006).

Ben-David et al.(2009) were the first to consider the learning problem associated with estimating the hyperplane which intersects the region of lowest probability density, un- der the minimal set of assumptions thatX is an iid sample from an unknown probability distribution overRdwith continuous density. The authors quantify thedensity on a hy- perplaneas the integral of the probability density function along the hyperplane, and study the existence of universally consistent algorithms to estimate the hyperplane with minimum density. They find that the maximum hard margin classifier is a consistent estimator of the hyperplane with minimum density only in one-dimensional problems, while in higher di- mensions only a soft-margin algorithm is consistent.Pavlidis et al.(2016) propose a method to compute the hyperplane with minimum density for a finite high-dimensional sample using one-dimensional projections of the data, and establish an asymptotic connection be- tween this hyperplane and the maximum hard margin hyperplane.

isAzzalini and Menardi(2016). This work first applies multi-dimensional scaling (MDS) (Borg and Groenen,2005) to produce a low-dimensional continuous representation before using the pdfCluster algorithm (Menardi and Azzalini,2014). We also consider continuous repre- sentations produced by mixed probabilistic principal component analysis (mPPCA) (Khan et al.,2010) and constant shift embedding (CSE) (Roth et al.,2003). Due to our alternative formulation of density-based clustering, we also remove the restriction to a low-dimensional continuous embedding, which is more appropriate for datasets with larger numbers of clus- ters.

In this chapter, we address the aforementioned limitations of density-based clustering as- sociated with high-dimensional and mixed data. We develop a divisive hierarchical clustering algorithm and a partitional ensemble clustering algorithm, which use low-density separators to identify dense clusters associated with the modes of the estimated continuous probabil- ity density function. These are obtained through one-dimensional projections of the data, making this applicable in high-dimensional applications, where the construction of an es- timated density over all dimensions is infeasible. In the case of mixed observations, we first locate a continuous representation before attempting to identify clusters. Our algorithms can identify clusters in different arbitrarily orientated subspaces, as well as estimate their number.

The remainder of this chapter is organised as follows: Section 4.2 presents the method- ology for the proposed algorithms. First, we formulate the problem of projective density- based clustering for bi-partitioning, and then present our approaches for producing a full clustering based on these binary partitions. Next, Section 4.3 considers the production of a continuous representation of mixed data, allowing our algorithms to be applied in such datasets. Section 4.4 provides a comparative evaluation of the proposed algorithms against

alternative density-based and state-of-the-art clustering algorithms on simulated and real- world datasets. The chapter ends with conclusions in Section 4.5.