Agglomerative clustering methods - Data Mining A Heuristic Approach Abbass HA (2002) pdf

Agglomerative clustering methods begin with each item in its own cluster, and then, in a bottom-up fashion, repeatedly merge the two closest groups to form a new cluster. To support this merge process, nearest-neighbor searches are conducted. Agglomerative clustering methods are often referred to as hierarchical methods for this reason.

A classical example of agglomerative clustering is the iterative determination of the closest pair of points belonging to different clusters, followed by the merging of their corresponding clusters. This process results in the minimum spanning tree (MST) structure. Computing an MST can be performed very quickly. However, because the decision to merge two clusters is based only on information provided

by a single pair of points, the MST generally provides clusters of poor quality. The first agglomerative algorithm to require sub-quadratic expected time, albeit in low-dimensional settings, is DBSCAN (Ester, Kriegel, Sander, & Xu, 1996). The algorithm is regulated by two parameters, which specify the density of the clusters to be retrieved. The algorithm achieves its claimed performance in an amortized sense, by placing the points in an R*-tree, and using the tree to perform u-nearest-neighbor queries, u is typically 4. Additional effort is made in helping the users determine the density parameters, by presenting the user with a profile of the distances between data points and their 4-nearest neighbors. It is the responsibility of the user to find a valley in the distribution of these distances; the position of this valley determines the boundaries of the clusters. Overall, the method requires Q(n log n) time, given n data points of fixed dimension.

Another subfamily of clustering methods impose a grid structure on the data (Chiu, Wong & Cheung, 1991; Schikuta, 1996; Wang et al, 1997; Zhang, Ramakrishnan, & Livny, 1996). The idea is a natural one: grid boxes containing a large number of points would indicate good candidates for clusters. The difficulty is in determining an appropriate granularity. Maximum entropy discretization (Chiu et al., 1991) allows for the automatic determination of the grid granularity, but the size of the grid generally grows quadratically in the number of data points. Later, the BIRCH method saw the introduction of a hierarchical structure for the economical storage of grid information, called a Clustering Feature Tree (CF-Tree) (Zhang et al., 1996).

The recent STING method (Wang et al., 1997) combines aspects of these two approaches, again in low-dimensional spatial settings. STING constructs a hierarchical data structure whose root covers the region of analysis. The structure is a variant of a quadtree (Samet, 1989). However, in STING, all leaves are at equal depth in the structure, and represent areas of equal size in the data domain. The structure is built by finding information at the leaves and propagating it to the parents according to arithmetic formulae. STING’s data structure is similar to that of a multidimensional database, and thus can be queried by OLAP users using an SQL- like language. When used for clustering, the query proceeds from the root down, using information about the distribution to eliminate branches from consideration. As only those leaves that are reached are relevant, the data points under these leaves can be agglomerated. It is claimed that once the search structure is in place, the time taken by STING to produce a clustering will be sub-linear. However, determining the depth of the structure is problematic.

STING is a statistical parametric method, and as such can only be used in limited applications. It assumes the data is a mixture model and works best with knowledge of the distributions involved. However, under these conditions, non- agglomerative methods such as EM (Dempster, Laird & Rubin, 1977), AutoClass (Cheeseman et al, 1988), MML (Wallace & Freeman, 1987) and Gibb’s sampling are perhaps more effective.

For clustering two-dimensional points, O(n log n) time is possible (Krznaric & Levcopoulos, 1998), based on a data structure called a dendrogram or proximity

tree, which can be regarded as capturing the history of a merge process based on nearest-neighbor information. Unfortunately, such hierarchical approaches had generally been disregarded for knowledge discovery in spatial databases, since it is often unclear how to use the proximity tree to obtain associations (Ester et al, 1996). While variants emerge from the different ways in which the distance between items is extended to a distance between groups, the agglomerative approach as a whole has three fundamental drawbacks. First, agglomeration does not provide clusters naturally; some other criterion must be introduced in order to halt the merge process and to interpret the results. Second, for large data sets, the shapes of clusters formed via agglomeration may be very irregular, so much so that they defy any attempts to derive characterizations of their member data points. Third, and perhaps the most serious for data mining applications, hierarchical methods usually require quadratic time when applied in general dimensions. This is essentially because agglomerative algorithms must repeatedly extract the smallest distance from a dynamic set that originally has a quadratic number of values.

In document Data Mining A Heuristic Approach Abbass HA (2002) pdf (Page 34-36)