9.2 Foundations of Connected Correlation Clustering
9.2.2 Clusters as Correlation Connected Sets
A correlation connected cluster can be regarded as a maximal set of density- connected points that exhibit uniform correlation. We can formalize the concept of correlation connected sets by merging the concepts of density connected sets (cf. Definition 2.7) and correlation sets (cf. Definition 9.1). The intuition of our formalization is to consider those points as core points of a cluster which have an appropriate correlation dimension in their neigh- borhood. Therefore, we associate each point p with a similarity matrix Mp
which is determined by PCA of the points in the ε-neighborhood of p. For convenience, we call Vp and Ep the eigenvectors and eigenvalues of p, re-
spectively. A point p is inserted into a cluster if it has the same or a similar similarity matrix like the points in the cluster. To achieve this goal, our al- gorithm looks for points that are close to the principal axis (or axes) of those points which are already in the cluster. We will define a similarity measure
ˆ
Mp for the efficient search of such points.
We start with the formal definition of the covariance matrixMpassociated
with a pointp.
Definition 9.3 (covariance matrix) Let p∈ D. The matrix Mp = [mij] with
mij =
X
q∈Nε(p)
π{ai}(q)·π{aj}(q) (1≤i, j ≤d)
(a) (b)
Figure 9.3: Correlation ε-neighborhood of a point p according to (a) Mp
and (b)Mˆp.
VpEpVTp), as determined by PCA of Nε(p), are called the eigenvectors and
eigenvalues of the point p, respectively.
We can now define the new similarity measure which searches points in the direction of the highest variance of Mp (the major axes). Theoretically,
Mp could be directly used as a similarity measure, i.e. distMp(p, q) =
q
(p−q)Mp(p−q)T where p, q ∈ D.
Figure 9.3(a) shows the set of points which lies in an ε-neighborhood of the point, using Mp as similarity measure. The distance measure puts high
weights on those axes with a high variance, whereas directions with a low variance are associated with low weights. This is usually desired in similarity search applications where directions of high variance have a high distinguish- ing power and, in contrast, directions of low variance are negligible.
Obviously, for our purpose of detecting correlation clusters, we need quite the opposite. We want to search for points in the direction of highest variance of the data set. Therefore, we need to assign low weights to the direction of highest variance in order to shape the ellipsoid such that it reflects the data distribution (cf. Figure 9.3(b)). The solution is to change large eigenvalues
9.2 Foundations of Connected Correlation Clustering 159
into smaller ones and vice versa. We use two fixed values, 1 and a param- eter κ 1 rather than, for example, inverting the eigenvalues in order to avoid problems with singular covariance matrices. The number 1 is a natural choice because the corresponding semi-axes of the ellipsoid are then ε. The parameterκ controls the ”thickness” of theλ-dimensional correlation line or plane, i.e. the tolerated deviation.
This is formally captured in the following definition:
Definition 9.4 (correlation similarity matrix of a point)
Letp∈ D andVp, Ep the corresponding eigenvectors and eigenvalues of the
point p. Let κ ∈ R be a constant with κ 1. The new eigenvalue matrix ˆ
Ep with entries eˆi (i = 1, . . . d) is computed from the eigenvalues e1, . . . , ed
in Ep according to the following rule:
ˆ ei = 1 if Ω(ei)> δ κ if Ω(ei)≤δ
whereΩis the normalization of the eigenvalues onto [0,1]as described above. The matrixMˆp =VpEˆpVpTis called the correlation similarity matrix of point p. The correlation similarity measure associated with point p is denoted by
distp(p, q) =
q
(p−q)·Mˆp·(p−q)T.
Figure 9.3(b)shows theε-neighborhood according to the correlation sim- ilarity matrix Mˆp. As described above, the parameter κ specifies how much
deviation from the correlation is allowed. The greater the parameter κ, the tighter and clearer the correlations which will be computed. It empirically turned out that our algorithm presented in Section9.3.1is rather insensitive to the choice of κ. A good suggestion is to set κ = 50 in order to achieve satisfying results, thus — for the sake of simplicity — we omit the parameter κ in the following.
Using this similarity measure, we can define the notions of correlation core points and correlation reachability. However, in order to define correlation connectivity as a symmetric relation, we face the problem that the similarity
p
q
p
q
(a)p
q
p
q
(b)Figure 9.4: Symmetry of the correlation ε-neighborhood: (a)p∈ NMˆq
ε (q).
(b) p6∈ NMˆq
ε (q).
measure in Definition 9.4 is not symmetric, because distp(p, q) = distq(q, p)
does in general not hold (cf. Figure 9.4(b)). Symmetry, however, is impor- tant to avoid ambiguity of the clustering result. If an asymmetric similarity measure is used in DBSCAN, a different clustering result can be obtained, depending on the order of processing (e.g. which point is selected as the starting point) because the symmetry of density connectivity depends on the symmetry of direct density reachability for core points. Although the result is typically not seriously affected by this ambiguity effect, we avoid this problem easily by an extension of our similarity measure which makes it symmetric. The trick is to consider both similarity measures distp(p, q) as
well as distq(p, q) and to combine them by a suitable arithmetic operation
such as the maximum of the two.
Definition 9.5 (general correlation distance)
The general correlation distance between two points p, q ∈ D, denoted by
distcorr, is defined as the maximum of the correlation similarity measure be-
tween p and q according to p and according to q, formally:
distcorr(p, q) = max{distp(p, q), distq(q, p)}.
Lemma 9.1 The general correlation distance as defined in Definition 9.5 is symmetric.
9.2 Foundations of Connected Correlation Clustering 161
Based on this new symmetric similarity measure distcorr, we define the
correlation ε-neighborhood as a symmetric concept.
Definition 9.6 (correlation ε-neighborhood)
Let ε ∈ R. The correlation ε-neighborhood of a point o ∈ D, denoted by
NMˆo
ε (o), is defined by:
NMˆo
ε (o) = {x∈ D |distcorr(o, x)} ≤ε}.
The symmetry of the correlation ε-neighborhood is illustrated in Figure 9.4. A point p is only contained inNMˆq
ε (q) ifq is also contained in N ˆ Mp
ε (p).
Correlation core points can now be defined as follows.
Definition 9.7 (correlation core point)
Let ε, δ ∈ R and MinPts, λ ∈ N. A point o ∈ DB is called correlation core point w.r.t. ε, MinPts, δ, and λ (denoted by Corecorden(o)) if its ε- neighborhood is a λ-dimensional linear correlation set and its correlation ε- neighborhood contains at least MinPts points, formally:
Corecorden(o)⇔CorSet λ
δ(Nε(o))∧ | N ˆ Mo
ε (o)| ≥MinPts.
Let us note that in Corecorden the acronym cor refers to the correlation
parameters δ and λ. In the following, we omit the parameters ε, MinPts, δ, and λ wherever the context is clear and use den and cor instead.
Definition 9.8 (Direct correlation reachable)
Let ε, δ ∈ R and MinPts, λ ∈ N. A point p ∈ D is direct correlation reachable from a point q ∈ D w.r.t. ε, MinPts, δ, and λ (denoted by
DirReachcorden(q,p)) ifq is a correlation core point, the correlation dimension
of Nε(p) is at most λ, and p∈ NMˆq
ε (q), formally: DirReachcorden(q, p)⇔
(1) Corecorden(q)
(3) p∈ NMˆq
ε (q).
Direct correlation reachability is symmetric only for pairs of correlation core points. Both points p and q must find the other point in their corre- sponding correlationε-neighborhood.
Definition 9.9 (correlation reachable)
Let ε, δ ∈ R (δ ≈ 0) and MinPts, λ ∈ N. A point p ∈ D is correla- tion reachable from a point q ∈ D w.r.t. ε, MinPts, δ, and λ (denoted by
Reachcorden(q,p)) if there is a chain of pointsp1,· · ·pnsuch that p1 =q, pn=p
and pi+1 is direct correlation reachable from pi, formally: Reachcorden(q, p)⇔
∃p1, . . . , pn ∈ D :p1 =q ∧ pn =p ∧
∀i∈ {1, . . . , n−1} :DirReachcorden(pi, pi+1).
It is easy to see that correlation reachability is the transitive closure of direct correlation reachability.
Definition 9.10 (correlation connected)
Let ε, δ ∈ R (δ ≈ 0) and MinPts, λ ∈ N. A point p ∈ D is correla- tion connected to a point q ∈ D w.r.t. ε, MinPts, δ, and λ (denoted by
Connectcorden(q,p)) if there is a point o ∈ D such that both p and q are cor-
relation reachable from o, formally:
Connectcorrden(q, p)⇔
∃o ∈ D : Reachcorrden(o, q) ∧ Reach corr den(o, p).
Correlation connectivity is a symmetric relation. A correlation connected cluster can now be defined as a maximal correlation connected set.
Definition 9.11 (correlation connected cluster)
9.2 Foundations of Connected Correlation Clustering 163
is called a correlation connected cluster w.r.t. ε, MinPts, δ, and λ if all points in C are correlation connected and C is maximal w.r.t. correlation reachability, formally:
Clustercorden(C)⇔
(1) Connectivity: ∀o, q ∈C :Connectcorden(o, q)
(2) Maximality: ∀p, q ∈ D :q∈C∧Reachcorden(q, p)⇒p∈C.
The following two lemmata are important for validating the correctness of our clustering algorithm. Intuitively, they state that we can discover a cor- relation connected set for a given parameter setting in a two-step approach, analog to DBSCAN. First, choose an arbitrary correlation core pointo from the database. Second, retrieve all points that are correlation reachable from o. This approach yields the correlation connected cluster containing o. Lemma 9.2
Let p∈ D. If p is a correlation core point, then the set of points which are correlation reachable from p is a correlation connected cluster, formally:
Corecorden(p)∧C ={o ∈ D |Reach cor den(p, o)} ⇒Cluster cor den(C). Proof. (1) C 6=∅:
By assumption, Corecorden(p) and thus, CorDim(N ˆ M
ε (p))≤λ.
⇒ DirReachcorden(p, p)
⇒ Reachcorden(p, p)
⇒ p∈C.
(2) Maximality:
Let x∈C and y ∈ D and Reachcorden(x, y).
⇒ Reachcorden(p, x)∧Reach cor den(x, y)
⇒ Reachcorden(p, y) (since correlation reachability is a transitive relation).
⇒ y∈C.
(3) Connectivity:
∀x, y ∈C :Reachcorden(p, x)∧Reach cor den(p, y)
Lemma 9.3
Let C ⊆ D be a correlation connected cluster. Let p ∈ C be a correlation core point. Then C equals the set of points which are correlation reachable from p, formally:
Clustercorden(C)∧p∈C∧Core cor
den(p)⇒C ={o ∈ D |Reach cor
den(p, o)}.
Proof.
Let C¯ ={o ∈ D |Reachcorden(p, o)}. We have to show that C¯ =C:
(1) C¯ ⊆C: obvious from definition of C¯.
(2) C ⊆C¯: Let q ∈C. By assumption, p∈C and Clustercorden(C).
⇒ ∃o∈C :Reachcorden(o, p)∧Reach cor den(o, q)
⇒Reachcorden(p, o)(since both o andp are correlation core points, and corre-
lation reachability is symmetric for correlation core points)
⇒ Reachcorden(p, q) (transitivity of correlation-reachability)
⇒ q∈C¯. 2