6.3 The Notion of Correlation-Connected Clusters
6.3.2 Clusters as Correlation-Connected Sets
A correlation-connected cluster can be regarded as a maximal set of density- connected points that exhibit uniform correlation. We can formalize the concept of correlation-connected sets by merging the two concepts: density- connected clusters (cf. Definition 2.7) and correlation sets (cf. Definition
6.3 The Notion of Correlation-Connected Clusters 91 6.1). The intuition of our formalization is to consider those points as core points of a cluster which have an appropriate correlation dimension in their neighborhood. Therefore, we associate each pointP with a similarity matrix MP which is determined by PCA of the points in theε-neighborhood of P.
For convenience, we callVP andEP the eigenvectors and eigenvalues ofP,
respectively. A point P is inserted into a cluster if it has the same or a similar similarity matrix like the points in the cluster. To achieve this goal, our algorithm looks for points that are close to the principal axis (or axes) of those points which are already in the cluster. We will define a similarity measure MˆP for the efficient search of such points.
We start with the formal definition of the covariance matrixMP associ-
ated with a pointP.
Definition 6.3 (covariance matrix)
Let P ∈DB. The matrix MP = [mij]with
mij =
X
S∈Nε(P)
sisj (1≤i, j≤d)
is called the covariance matrix of the point P. VP and EP (with MP =
VPEPVTP) as determined by PCA of MP are called the eigenvectors and
eigenvalues of the pointP, respectively.
We can now define the new similarity measureMˆP which searches points
in the direction of highest variance of MP (the major axes). Theoretically,
MP could be directly used as a similarity measure, i.e.
distMP(P, Q) = q
(P−Q)MP(P−Q)T whereP, Q∈DB.
Figure 6.4(a) shows the set of points which lies in an ε-neighborhood of
P usingMP as similarity measure. The distance measure puts high weights
on those axes with a high variance whereas directions with a low variance are associated with low weights. This is usually desired in similarity search applications where directions of high variance have a high distinguishing power and, in contrast, directions of low variance are negligible.
Obviously, for our purpose of detecting correlation clusters, we need quite the opposite. We want to search for points in the direction of highest
major axis of the data set
P (a) P İ İ/N (b)
Figure 6.4: ε-neighborhood of a pointP according toMP (a) andMˆP (b).
variance of the data set. Therefore, we need to assign low weights to the direction of highest variance in order to shape the ellipsoid such that it reflects the data distribution (cf. Figure 6.4(b)). The solution is to change large eigenvalues into smaller ones andvice versa. We use two fixed values, 1 and a parameterκ1 rather than, for example, inverting the eigenvalues in order to avoid problems with singular covariance matrices. The number 1 is a natural choice because then the length of the corresponding semi-axes of the ellipsoid is epsilon. The parameterκ controls the ”thickness” of the
λ-dimensional correlation line or plane, i.e. the tolerated deviation. This is formally captured in the following definition:
Definition 6.4 (correlation similarity matrix)
Let P ∈ DB and VP, EP the corresponding eigenvectors and eigenvalues
of the point P. Let κ ∈IR+ be a constant withκ 1. The new eigenvalue matrix EˆP with entries ˆei (i = 1, . . . d) is computed from the eigenvalues
e1, . . . , ed in EP according to the following rule:
ˆ ei = 1 if Ω(ei)> δ κ if Ω(ei)≤δ
whereΩis the normalization of the eigenvalues onto[0,1]as described above. The matrixMˆP =VPEˆPVPTis called thecorrelation similarity matrix. The
correlation similarity measure associated with pointP is denoted by distP(P, Q) =
q
6.3 The Notion of Correlation-Connected Clusters 93
P
Q
(a)P ∈ NMˆQ ε (Q).P
Q
(b)P 6∈ NMˆQ ε (Q).Figure 6.5: Correlationε-neighborhood.
Figure 6.4(b) shows theε-neighborhood according to the correlation sim- ilarity matrixMˆP. As described above, the parameterκspecifies how much
deviation from the correlation is allowed. The greater the parameterκ, the tighter and clearer the correlations which will be computed. It empirically turned out that our algorithm presented in Section 6.4 is rather insensitive to the choice of κ. A good suggestion is to set κ = 50 in order to achieve satisfying results. Thus, for the sake of simplicity, we omit the parameterκ
in the following.
Using this similarity measure, we can define the notions of correla- tion core points and correlation-reachability. However, in order to define correlation-connectivity as a symmetric relation, we face the problem that the similarity measure in Definition 6.4 is not symmetric, i.e.distP(P, Q) =
distQ(Q, P) does in general not hold (cf. Figure 6.5(b)). Symmetry, however,
is important to avoid ambiguity of the clustering result. If an asymmetric similarity measure is used in DBSCAN, a different clustering result can be obtained, depending on the order of processing, e.g. which point is selected as the starting point. Although the result is typically not seriously affected by this ambiguity effect, we avoid this problem easily by an extension of our similarity measure. The trick is to consider both similarity measures
distP(P, Q) as well asdistQ(P, Q) and to combine them by a suitable arith-
metic operation such as the maximum of the two. Based on these consider- ations, we define the correlationε-neighborhood as a symmetric concept:
Definition 6.5 (correlation ε-neighborhood)
Let ε ∈IR+0. The correlation ε-neighborhood of a point O ∈ DB, denoted by NMˆO
ε (O), is defined by:
NMˆO
ε (O) ={X ∈DB|max{distO(O, X), distX(X, O)} ≤ε}.
The correlation ε-neighborhood is illustrated in Figure 6.5. Correlation core points can now be defined as follows.
Definition 6.6 (correlation core point)
Let ε, δ ∈ IR+0 and k, λ ∈ IN. A point O ∈ D is called correlation core point w.r.t. ε, k, δ, and λ (denoted by Coreλ,δε,k(O)) if its ε-neighborhood is a λ-dimensional linear correlation set and its correlation ε-neighborhood contains at leastk points, formally:
Coreλ,δε,k(O)⇔CorSet λ δ(Nε(P))∧ | N ˆ MO ε (O)| ≥k.
Definition 6.7 (direct correlation-reachability) Letε, δ∈IR+0 andk, λ∈ IN. A point P ∈DB is direct correlation-reachable from a point Q∈DB w.r.t. ε, k, δ, and λ (denoted by DirReachλ,δε,k(Q,P)) if Q is a corre- lation core point, the correlation dimension of Nε(P) is at least λ, and
P ∈ NMˆQ ε (Q), formally: DirReachλ,δε,k(Q, P)⇔ (1) Coreλ,δε,k(Q) (2) CorDim(Nε(P))≤λ (3) P ∈ NMˆQ ε (Q).
Correlation-reachability is symmetric for correlation core points. Both pointsP andQmust find the other point in their corresponding correlation
ε-neighborhood.
Definition 6.8 (correlation-reachability)
Let ε, δ ∈ IR+0 (δ ≈ 0) and k, λ ∈ IN. A point P ∈ DB is correlation- reachablefrom a pointQ∈DBw.r.t.ε,k,δ, andλ(denoted byReachλ,δε,k(Q,P)) if there is a chain of pointsP1,· · ·Pn such that P1 =Q, Pn=P and Pi+1 is
6.3 The Notion of Correlation-Connected Clusters 95
direct correlation-reachable from Pi, formally:
Reachλ,δε,k(Q, P)⇔
∃P1, . . . ,Pn ∈ D :P1 =Q ∧ Pn=P ∧
∀i∈ {1, . . . , n−1} :DirReachλ,δε,k(Pi, Pi+1).
The correlation-reachability is the transitive closure of direct correlation- reachability.
Definition 6.9 (correlation-connectivity)
Let ε∈IR+0 and k∈IN. A point P ∈ D is correlation-connectedto a point Q ∈ D if there is a point O ∈ D such that both P and Q are correlation- reachable from O, formally:
Connectλ,δε,k(Q, P)⇔
∃o∈ D : Reachλ,δε,k(O, Q) ∧ Reach λ,δ
ε,k(O, P).
Correlation-connectivity is a symmetric relation. A correlation-connected cluster can now be defined as a maximal correlation-connected set.
Definition 6.10 (correlation-connected set)
Let ε, δ ∈ IR+0 and k, λ ∈ IN. A non-empty subset C ⊆ DB is called a
density-connected set w.r.t. ε, k, δ, and λ if all points in C are density- connected and C is maximal w.r.t. density-reachability, formally:
ConSetλ,δε,k(C)⇔
(1) Connectivity: ∀O, Q∈ C:Connectλ,δε,k(O, Q)
(2) Maximality: ∀P, Q∈DB :Q∈ C ∧Reachλ,δε,k(Q, P)⇒P ∈ C.
The following two lemmata are important for validating the correctness of our clustering algorithm. Intuitively, they state that we can discover a correlation-connected set for a given parameter setting in a two-step ap- proach: First, choose an arbitrary correlation core point O from the data- base. Second, retrieve all points that are correlation-reachable fromO. This approach yields the density-connected set containingO.
Lemma 6.1
Let P ∈DB. If P is a correlation core point, then the set of points which are correlation-reachable from P is a correlation-connected set, formally:
Coreλ,δε,k(P)∧ C ={O ∈DB|Reachλ,δε,k(P, O)} ⇒ConSetλ,δε,k(C).
Proof.
(1) C 6=∅:
By assumption,Coreλ,δε,k(P) and thus, CorDim(Nε(P))≤λ.
⇒ DirReachλ,δε,k(P, P)
⇒ Reachλ,δε,k(P, P)
⇒ P ∈ C.
(2) Maximality:
Let X∈ C and Y ∈DB and Reachλ,δε,k(X, Y).
⇒ Reachλ,δε,k(P, X)∧Reachλ,δε,k(X, Y)
⇒ Reachλ,δε,k(P, Y) (since correlation reachability is a transitive relation). ⇒ Y ∈ C.
(3) Connectivity:
∀X, Y ∈ C:Reachλ,δε,k(P, X)∧Reachλ,δε,k(P, Y)
⇒ Connectλ,δε,k(X, Y) (via P).
Lemma 6.2 Let C ⊆ DB be a correlation-connected set. Let P ∈ C be a correlation core point. ThenC equals the set of points which are correlation- reachable from P, formally:
ConSetλ,δε,k(C)∧P ∈ C ∧Core λ,δ ε,k(P)
⇒ C={O∈DB|Reachλ,δε,k(P, O)}.
Proof.
Let C¯={O ∈DB|Reachλ,δε,k(P, O)}. We have to show that C¯=C: (1) C ⊆ C: obvious from definition of¯ C.¯
(2) C ⊆C: Let¯ Q∈ C. By assumption, P ∈ C andConSetλ,δε,k(C). ⇒ ∃O∈ C :Reachλ,δε,k(O, P)∧Reach
λ,δ ε,k(O, Q)