Mixed Type Attributes Data - Wackersreuther, Bianca (2011): Efficient Knowledge Extracti

A prominent characteristic of data mining is that it deals with very large and com- plex datasets. These data often contain millions of objects described by various types of attributes or variables, e.g. numerical, categorical, ratio or binary. Hence,

2.2 Mixed Type Attributes Data 21

integrative data mining algorithms have to be scalable and capable of dealing with different types of attributes. In terms of clustering, we are interested in algorithms which can efficiently cluster large datasets containing both numeric and categorical values because such datasets are frequently encountered in data mining applications. The traditional way to treat categorical attributes as numeric does not always produce meaningful results because many categorical domains are not ordered.

One simple idea for performing integrative clustering on heterogeneous data is to combinek-means based methods with thek-modes algorithm. The algorithm k-prototypes derives advantage from this combination, and is therefore one of the first approaches towards integrative clustering.

2.2.1 Thek-means Algorithm for Clustering Numerical Data

Thek-means algorithm [72] is one of the mostly used partitioning non-hierarchical clustering approach. The procedure follows a simple and easy way to group a given datasetDSconsisting ofnnumeric objects into a certain number ofk(< n) clusters. First, k centroids are defined, one for each cluster. Then, the algorithm takes each point of DS and associates it to the nearest centroid, until no more point is pending. Afterwards, the k new centroids (the mean value µ over the coordinates of all data points belonging to the specific cluster) are recalculated, and thus, a new association has to be determined between the points of DS and the nearest new centroid. These two steps are performed until no more location changes of the centroids are observed. Finally, k-means aims at minimizing an objective function, in this case the within clusters sum of squared errors (WCSS).

W CSS = k X j=1 n X i=1 (dist(xi,j, µj))2,

wheredist(xi,j, µj)is a chosen distance function between a data pointxi,jand the

centroidµj of clusterCj, is an indicator of the distance of thendata points from

22 2. Related Work

The main advantage ofk-means is its efficiency. It can be proven that convergence is reached inO(n)iterations. However, this method has four major drawbacks: First, the number of clustersk has to be specified in advance, second, the cluster compactness measure WCSS and thus the clustering result is very sensitive to noise and outliers. In addition,k-means implicitly assumes a Gaussian data dis- tribution, and is thus restricted to detect spherically compact clusters. The major drawback lies in its limited practicability to numeric data.

2.2.2 Conceptual Clustering Algorithms for Categorical Data

In principle the formulation of the WCSS in Section 2.2.1 is also valid for categorical and mixed type objects. The reason why thek-means algorithm cannot cluster categorical objects is that the calculation of the mean value is not defined for categorical data. These limitations can be removed by the following modifications:

• Use a simple matching distance function for categorical objects.

• Replace means as cluster representatives by modes.

• Use a frequency-based method to find the modes.

A Distance Function for Categorical Objects

Letx,ybe two categorical objects described bymcategorical attributes. The distance function between these two objects can be defined by the total mismatches of the corresponding attribute categories of the two objects. The smaller the number of mismatches is, the more similarxandy. This measure is often referred to assimple matching[56]. Formally this distance function is defined as follows:

distSimple(x, y) = m X i=1 δ(xi, yi), where δ(xi, yi) =    0, ifxi =yi, 1, ifxi 6=yi.

2.2 Mixed Type Attributes Data 23

Modes as Cluster Representatives

Consider a set of n categorical objects X described by categorical attributes, A1, A2,· · · , Am. The mode of this set is an arrayq = [q1, q2,· · · qm] of length

mthat minimizes the following formula:

D(X, q) =

i=1

distSimple(Xi, q)

Calculation of the Modes

Let nck,i be the number of objects having the k-th category ck,i in attribute Ai,

and let p(Ai = ck,i|X) be the relative frequency of categoryck,iin X. Then the

functionD(X, q)is minimized if and only ifp(Ai =qi|X) ≥p(Ai =ck,i|X)for

qi 6=ck,ifor alli∈ {1,· · · , m}[48].

This theorem defines a way to find the modeqfrom a given set of categorical objects X, and therefore it is important because it allows thek-means paradigm to be used to cluster categorical data.

The Algorithmk-modes

Conceptual clustering algorithms, like k-modes [49] implement the idea of clustering categorical data by using distSimple as distance function and by means of

the aforementioned modifications of k-means clustering. The procedure of the algorithmk-modes can be summarized as follows:

1. Selectkinitial modes, one for each cluster.

2. Allocate an object to the cluster whose mode is the nearest to it according to the distance functiondistSimple; update the mode of the cluster after each

allocation according to the theorem presented in Section 2.2.2.

3. After all objects have been allocated to clusters, retest the distance of objects against the current modes. If an object is found such that its nearest mode

24 2. Related Work

belongs to another cluster rather than its current one, reallocate the object to that cluster and update the modes of both clusters.

4. Repeat step (3) until no object has changed clusters after a full cycle test of the whole dataset.

Like thek-means algorithm, thek-modes algorithm also produces locally optimal solutions that are dependent on the initial modes and the order of objects in the dataset. Its efficiency relies on good search strategies. For data mining problems, which often involve many concepts and very large object spaces, the concepts based search methods can become a potential handicap for these algorithms to deal with extremely large datasets.

2.2.3 Thek-prototypes Algorithm for Integrative Clustering

One of the first algorithms for integrative clustering isk-prototypes [47]. The distance between two mixed type objectsxandy, which are described bymattributes An

1, An2,· · · , Anp, Apc+1, Acp+2,· · · , Acmis measured by the following formula:

distM ixed = p X i=1 (xi−yi)2+γ m X i=p+1 δ(xi, yi).

The first term is the squared Euclidean distance measure on the p numeric attributes An

i and the second term is the simple matching distance function on the

m−pcategorical attributesAc

i. The weightγis used to avoid favoring either type

of attribute. The influence of this parameter in the clustering process is discussed in the publication by Huang [47].

In document Wackersreuther, Bianca (2011): Efficient Knowledge Extraction from Structured Data. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 38-42)