6.2 Correlation Clustering Algorithms
6.2.1 PCA Based Approaches
The broad majority of correlation clustering approaches are based on an ap- plication of PCA on subsets of points (like range queries ork-nearest neighbor queries).
As the first approach togeneralized projected clustering, Aggarwal and Yu [11] proposed the algorithm ORCLUS, using ideas similar to the axis-parallel approach PROCLUS [10]. ORCLUS is ak-means like approach, pickingkc > k seeds at first, assigning the data base objects to these seeds according to a distance function that is based on an eigensystem of the corresponding cluster assessing the distance along the weak eigenvectors only (i.e., the distance in the projected subspace where the cluster objects exhibit high density).
6.2 Correlation Clustering Algorithms 67
Figure 6.4: 4C: distance between two points
The eigensystem is iteratively adapted to the current state of the updated cluster. The number kc of clusters is reduced iteratively by merging closest pairs of clusters until the user-specified numberkis reached. The closest pair of clusters is the pair with the least average distance in the projected space (spanned by the weak eigenvectors) of the eigensystem of the merged clusters (cf. Figure 6.3). Starting with a higherkc increases the effectiveness, but also the runtime. The method proposed in [39] is a slight variant of ORCLUS designed for enhancing multi-dimensional indexing. Another, presumably more efficient variant is proposed in [91].
In contrast to ORCLUS, the algorithm 4C [35] is based on a density- based clustering paradigm [47]. Thus, the number of clusters is not decided beforehand but clusters grow from a seed as long as a density criterion is fulfilled. Otherwise, another seed is picked to start a new cluster. The density criterion is a required minimal number of points within the neighborhood of a point, where the neighborhood is ascertained based on distance matrices computed from the eigensystems of two points. The eigensystem of a point ~
p is based on the covariance matrix of the ε-neighborhood of ~p in Euclidean space. A parameterδdiscerns large from small eigenvalues. In the eigenvalue matrix E~p then large eigenvalues are replaced by 1, small eigenvalues by a value κ 1. Using the adapted eigenvalue matrix E0~p, a correlation similarity matrix for ~p is obtained by V~p ·E0p~ ·V
T
~
p. This matrix is then used to derive the distance of two points, ~q and ~p, w.r.t. ~p, as the general quadratic form distance:
q
(~p−~q)T·V
~
p·E0~p·Vp~T·(~p−~q). (6.4) Applying this measure symmetrically to~qand choosing the maximum of both distances helps to decide whether both points are connected by a similar
68 6 Finding Clusters in Arbitrarily Oriented Subspaces
correlation of attributes and, thus, are similar and belong to each other’s correlation neighborhood. Figure 6.4 illustrates this idea. The ellipsoids represent the correlation neighborhoods of some sample objects. In the left example of Figure 6.4,pandq are not connected becauseqdoes not findpin its correlation neighborhood. On the right hand side, the points pand q are connected because they find one another in their correlation neighborhood.
As a hierarchical approach, HiCO [7] defines the distance between points according to their local correlation dimensionality and subspace orientation and uses hierarchical density-based clustering [16] to derive a hierarchy of correlation clusters.
COPAC [6] is based on similar ideas as 4C but disposes of some problems like meaningless similarity matrices due to sparse ε-neighborhoods instead taking a fixed number k of neighbors — which raises the question how to choose a good value for k but at least choosing k > λ ensures a meaningful definition of a λ-dimensional hyperplane. The main point in COPAC, how- ever, is a considerable speed-up by partitioning the data set based on the observation that a correlation cluster should consist of points exhibiting the same local correlation dimensionality (i.e., the same number of strong eigen- vectors in the covariance matrix of thek nearest neighbors). Thus, the search for clusters involves only the points with equallocal correlation dimensional- ity. By creating one partition for each occurring correlation dimensionality, the time complexity rapidly decreases on average by getting rid of a squared factor d2 in a d-dimensional data set.
Another related algorithm is ERiC [5], also deriving a local eigensystem for a point based on the k nearest neighbors in Euclidean space. Here, the neighborhood criterion for two points in a DBSCAN-like procedure is an ap- proximate linear dependency and the affine distance of the correlation hyper- planes as defined by the strong eigenvectors of each point. Like in COPAC, the property of clusters to consist of points exhibiting an equal local corre- lation dimensionality is exploited for the sake of efficiency. Furthermore, the resulting set of clusters is also ordered hierarchically to provide the user with a hierarchy of subspace clusters. In finding and correctly assigning complex
6.2 Correlation Clustering Algorithms 69
patterns of intersecting clusters, COPAC and ERiC improve considerably over ORCLUS and 4C.
Another approach based on PCA said to find even non-linear correlation clusters, CURLER [136], seems not restricted to correlations of attributes but, according to its restrictions, finds any narrow trajectory and does not provide a model describing its findings.
PCA is a mature technique and allows the construction of a broad range of similarity measures grasping local correlation of attributes and, therefore, to find arbitrarily oriented subspace clusters. A major intrinsic drawback common to all mentioned approaches is the notorious locality assumption. This assumption is widely accepted. But note that this innocent looking lit- tle (and often tacit) assumption boldly contradicts the basic problem state- ment: to find clusters in high dimensional space that is doomed by the curse of dimensionality. To address problems occurring due to varying density in the local neighborhood, a framework for selecting a suitable neighborhood range and to stabilize the PCA by weighting the points has been proposed in [90]. This framework allows to integrate all existing PCA-based correlation clustering approaches and shows considerable enhancements in effectiveness. However, this is not the ultimate solution for problems of high dimensional data spaces. As we will further discuss in more detail in Chapter 7, the curse of dimensionality condemns all distances to look alike and, thus, ren- ders nearest neighbor queries rather meaningless in high dimensional data. Thus, to successfully employ PCA in correlation clustering in really high dimensional data spaces may require even more effort henceforth.
The algorithms 4C, COPAC, HiCO, and ERiC as well as the framework for stabilizing PCA-based approaches are contributions of the author and will be discussed in more detail in Part III.