• No results found

2.2 Open Problems in Clustering

2.2.2 Mixed Data

Although there are a variety of different definitions of a cluster, it is common to assume that dissimilarity between observations is related to a measure of spatial separation, usually Eu- clidean distance. However, in real-world applications, it is often the case that observations have attributes of diverse types (mixed data). In datasets with ordinal and nominal vari- ables, discrete features can make standard continuous distance metrics, such as Euclidean distance inappropriate to define dissimilarity between observations.

This poses a significant challenge for the majority of approaches to clustering. In centroid- based clustering, non-numeric attributes make it impossible to compute the cluster cen- troids for thek-means algorithm, and even for discrete numeric data, the evaluation of spa- tial distances between observations and their assigned cluster centroid is not an interpretable

in the same way as for continuous data. Likewise, the notion of nearest neighbours orϵ- neighbourhoods, used to construct the adjacency matrix of the graph in spectral clustering becomes invalid when considering spatial separation alone. Similarly, the definition of clus- ters as regions of high probability density requires an appropriate, continuous measure of spatial separation between the observations to construct the estimated densitypˆ. Therefore, clustering non-continuous data using algorithms that rely on spatial proximity between ob- servations is inappropriate.

One naive approach is to discard any non-continuous features, making the assumption that the clustering structure is evident in the continuous dimensions. However, this risks removing information which is necessary for cluster detection. Another naive approach is to treat all features as if they were continuous and proceed with a conventional clustering technique. This is also problematic, as any observations with the same combination of pos- sible outcomes in the discrete dimensions will have low spatial separation, introducing an inherent grouping structure, which may not be truly indicative of the clusters present.

In the literature, there are two main approaches to incorporating mixed data for cluster- ing. The first of these is to use an alternative distance metric to define pairwise dissimilari- ties between observations. The most well-known distance metric for mixed variables is the Gower distance (Gower,1971) where the pairwise dissimilarity between observationsxiand

xjis defined as, Dij = ∑ d k=1wkdij,kd k=1wk (2.5) where dij,k = |xi,k−xj,k| max(x,k)min(x,k) (2.6)

for continuous and ordinal attributes and dij,k =        1 ,xi,k ̸=xj,k 0 ,xi,k =xj,k (2.7)

for binary and categorical attributes. Also,xi,kis thekth dimension of theith observation,

x,k = (x1,k, . . . ,xn,k)andwkis the user-defined weight for each variable inx, which is typically set towk = 1 ∀k. Using this metric, it is possible to apply any clustering algo- rithm which relies only on pairwise distances between observations, such as the hierarchical clustering algorithms discussed in Section 2.1.1, PAM or spectral clustering.

A similar approach has also been proposed fork-means clustering with categorical vari- ables inHuang(1997,1998). In this paper, the distance between an observationxiwith con-

tinuous and discrete features(xCi ,xiD)and a cluster centroidcj= (cCj,cDj )is given by,

Dij = dC

k=1 (xCi,k−cCj,k)2+wj dD

k=1 dij,k, (2.8)

wherewjis the weight of the categorical data for clusterj,dCanddDare the number of continuous and discrete variables respectively anddij,kis defined in Eq. (2.7), replacingxj

withcj. The algorithm then aims to minimise the sum of distances between the observa-

tions and their assigned centroid, as in the classicalk-means algorithm.

This work was extended byAhmad and Dey(2007) by weighting each of the distances for the continuous attributes, based on the pairwise separations of the observations in that attribute. This assumes that attributes showing high levels of separation are more relevant for clustering than those with low levels of separation. In addition, the distance between categorical attributes is not a binary outcome, instead the probability distribution of co- occurrence of values in each attribute is considered.Ahmad and Dey(2011) also adds a local

weight for each attribute in each cluster to the distance function. This can be thought of as a subspace algorithm as the distances are weighted differently along different dimensions for each cluster.

It is also possible to apply model-based clustering to mixed datasets by assuming an ap- propriate finite mixture model over the clusters.Everitt(1988) take this approach, assuming a parametric model for a set of realisations of a mixed variablexwithdCcontinuous anddD discrete components, denotedxCandxDrespectively. The parametric model is given by,

p(x) = k

i=1

ζi MV NdC+dD(µi,Σi)

whereζ = (ζi, . . . ,ζk)is the vector of mixing proportions such that∑ki=1ζi = 1and MV NdC+dD(µi,Σi)denotes adC +dD-dimensional multivariate normal distribution

with meanµiand covarianceΣi. However, thedD-dimensional, multivariate normal ran-

dom variables associated with the discrete attributes cannot be observed directly. Instead, the discrete observation vector is modelled as a threshold discretised form of a multivariate normal random variable. This discretisation requires multiple integrals of multivariate nor- mal distributions which is computationally expensive. However, thereafter parameter esti- mation to fit the model and locate the clusters is a standard maximum likelihood estimation problem.