Mixed Data - Open Problems in Clustering - Low Density Cluster Separators for Large, High Dimen

2.2 Open Problems in Clustering

2.2.2 Mixed Data

Although there are a variety of different definitions of a cluster, it is common to assume that dissimilarity between observations is related to a measure of spatial separation, usually Eu- clidean distance. However, in real-world applications, it is often the case that observations have attributes of diverse types (mixed data). In datasets with ordinal and nominal variables, discrete features can make standard continuous distance metrics, such as Euclidean distance inappropriate to define dissimilarity between observations.

This poses a significant challenge for the majority of approaches to clustering. In centroid- based clustering, non-numeric attributes make it impossible to compute the cluster cen- troids for thek-means algorithm, and even for discrete numeric data, the evaluation of spatial distances between observations and their assigned cluster centroid is not an interpretable

in the same way as for continuous data. Likewise, the notion of nearest neighbours orϵ- neighbourhoods, used to construct the adjacency matrix of the graph in spectral clustering becomes invalid when considering spatial separation alone. Similarly, the definition of clusters as regions of high probability density requires an appropriate, continuous measure of spatial separation between the observations to construct the estimated densitypˆ. Therefore, clustering non-continuous data using algorithms that rely on spatial proximity between observations is inappropriate.

One naive approach is to discard any non-continuous features, making the assumption that the clustering structure is evident in the continuous dimensions. However, this risks removing information which is necessary for cluster detection. Another naive approach is to treat all features as if they were continuous and proceed with a conventional clustering technique. This is also problematic, as any observations with the same combination of possible outcomes in the discrete dimensions will have low spatial separation, introducing an inherent grouping structure, which may not be truly indicative of the clusters present.

In the literature, there are two main approaches to incorporating mixed data for clustering. The first of these is to use an alternative distance metric to define pairwise dissimilari- ties between observations. The most well-known distance metric for mixed variables is the Gower distance (Gower,1971) where the pairwise dissimilarity between observationsxiand

xjis defined as, D_ij = ∑ d k=1wkdij,k ∑d k=1wk (2.5) where d_ij_,_k = |xi,k−xj,k| max(x_,_k)−min(x_,_k) (2.6)

for continuous and ordinal attributes and d_ij_,_k =        1 ,x_i_,_k ̸=x_j_,_k 0 ,x_i_,_k =x_j_,_k (2.7)

for binary and categorical attributes. Also,x_i_,_kis thekth dimension of theith observation,

x_,_k = (x_1,_k, . . . ,x_n_,_k)andw_kis the user-defined weight for each variable inx, which is typically set tow_k = 1 ∀k. Using this metric, it is possible to apply any clustering algorithm which relies only on pairwise distances between observations, such as the hierarchical clustering algorithms discussed in Section 2.1.1, PAM or spectral clustering.

A similar approach has also been proposed fork-means clustering with categorical variables inHuang(1997,1998). In this paper, the distance between an observationxiwith con-

tinuous and discrete features(xC_i ,x_iD)and a cluster centroidcj= (cC_j,cD_j )is given by,

D_ij = dC

∑

k=1 (xC_i_,_k−cC_j_,_k)2+w_j dD

∑

k=1 d_ij_,_k, (2.8)

wherew_jis the weight of the categorical data for clusterj,d_Candd_Dare the number of continuous and discrete variables respectively andd_ij_,_kis defined in Eq. (2.7), replacingxj

withcj. The algorithm then aims to minimise the sum of distances between the observa-

tions and their assigned centroid, as in the classicalk-means algorithm.

This work was extended byAhmad and Dey(2007) by weighting each of the distances for the continuous attributes, based on the pairwise separations of the observations in that attribute. This assumes that attributes showing high levels of separation are more relevant for clustering than those with low levels of separation. In addition, the distance between categorical attributes is not a binary outcome, instead the probability distribution of co- occurrence of values in each attribute is considered.Ahmad and Dey(2011) also adds a local

weight for each attribute in each cluster to the distance function. This can be thought of as a subspace algorithm as the distances are weighted differently along different dimensions for each cluster.

It is also possible to apply model-based clustering to mixed datasets by assuming an appropriate finite mixture model over the clusters.Everitt(1988) take this approach, assuming a parametric model for a set of realisations of a mixed variablexwithd_Ccontinuous andd_D discrete components, denotedxCandxDrespectively. The parametric model is given by,

p(x) = k

∑

i=1

ζi MV NdC+dD(µi,Σi)

whereζ = (ζ_i, . . . ,ζ_k)is the vector of mixing proportions such that∑k_i₌₁ζ_i = 1and MV N_d_C+dD(µi,Σi)denotes adC +dD-dimensional multivariate normal distribution

with meanµ_iand covarianceΣi. However, thedD-dimensional, multivariate normal ran-

dom variables associated with the discrete attributes cannot be observed directly. Instead, the discrete observation vector is modelled as a threshold discretised form of a multivariate normal random variable. This discretisation requires multiple integrals of multivariate normal distributions which is computationally expensive. However, thereafter parameter estimation to fit the model and locate the clusters is a standard maximum likelihood estimation problem.

In document Low Density Cluster Separators for Large, High Dimensional, Mixed and Non Linearly Separable Data (Page 39-42)