A p o q
(a) p and q are density-connected viaoin AB. p o A B q
(b) p and q are not density- connected in AB.
Figure 3.3: Monotonicity of density-connectivity (the circles indicate the
ε-neighborhoods,k= 4).
3.4
The Algorithm SUBCLU
SUBCLU is based on a bottom-up, greedy algorithm to detect the density- connected clusters in all subspaces of high-dimensional data. The algorithm is presented in Figure 3.4. The following data structures are used:
• DBS denotes the databaseDB projected onto the subspace S.
• CS denotes the set of all density-connected clusters ofDB in the sub-
spaceS w.r.t.εandk, and can be computed by the method DBSCAN, i.e. CS :=DBSCAN(DBS, ε, k). Note that we assume here that the
noise set is not included inCS.
• Sl denotes the set of all l-dimensional subspaces, containing at least one cluster, i.e. Sl :={S⊆A| |S|=land CS6=∅}.
• Cl denotes the set of sets of all clusters inl-dimensional subspaces, i.e.
Cl :={CS|S ⊆A and |S|=l}.
We begin with generating all one-dimensional clusters by applying DB- SCAN to each one-dimensional subspace (STEP 1 in Figure 3.4).
For each detected cluster, we have to check whether this cluster is (or parts of it are) still existent in higher dimensional subspaces. Due to Lemma 3.1, no other cluster can exist in higher dimensional subspaces. Thus, we
SUBCLU(SetOfPointsDB, Realε, Integerk)
// STEP 1Generate all 1D clusters
S1:=∅ // set of 1D subspaces containing clusters
C1:=∅ // set of all sets of clusters in 1D subspaces
for eachai∈ Ado
C{ai}:=DBSCAN(DB{ai}, ε, k) // set of all clusters in subspacea
i; ifC{ai}6=∅then // at least one cluster in subspace{a
i}found S1:=S1∪ {ai};
C1:=C1∪ C{ai};
end if end for
// STEP 2Generate(l+ 1)-D clusters froml-D clusters l:= 1;
whileCl6=∅
// STEP 2.1Generate(l+ 1)-D candidate subspaces CandSl+1 := GenerateCandidateSubspaces(Sl);
// STEP 2.2Test candidates and generate(l+ 1)-D clusters for eachcand∈CandSl+1do
// Searchl-dim subspace ofcandwith minimal number of points in the clusters bestSubspace:= ArgMin
s∈Sl∧s⊆cand P
Ci∈Cs|Ci|
Ccand:=∅;
for eachclustercl∈ CbestSubspace do Ccand =Ccand∪ DBSCAN(clcand, ε, k); ifCcand6=∅then Sl+1:=Sl+1∪cand; Cl+1:=Cl+1∪ Ccand; end if end for end for l:=l+ 1 end while
3.4 The Algorithm SUBCLU 43
GenerateCandidates(SetOfSubspacesSl)
// STEP 2.1.1Generate(l+ 1)-D candidate subspaces CandSl+1:=∅;
for eachs1∈Sldo for eachs2∈Sldo
if s1.attr1=s2.attr1∧. . .∧s1.attrl−1=s2.attrl−1∧s1.attrl< s2.attrl theninsert{s1.attr1, . . . , s1.attrl, s2.attrl}intoCandSl+1;
end for end for
// STEP 2.1.2Prune irrelevant candidates subspaces for eachcand∈CandSl+1do
for eachs⊂candwith|s|=ldo
if s /∈SlthendeletecandfromCandSl+1;
end if end for end for
Figure 3.5: The procedureGenerateCandidates.
search for each l-dimensional subspace S ∈ Sl all other l-dimensional sub- spacesT ∈ Slhaving (l−1) attributes in common and join them to generate
(l+ 1)-dimensional candidate subspaces (STEP 2.1.1 of the procedureGen-
erateCandidates in Figure 3.5). The set of (l + 1)-dimensional candidate
subspaces is denoted by CandSl+1.
For each candidate subspace S ∈ CandSl+1, Sl must contain each l-
dimensional subspace T ⊂ S, |T| = l. (cf. Lemma 3.1). Consequently, we can prune all candidates having at least one l-dimensional subspace not included in Sl (STEP 2.1.2 of procedureGenerateCandidates in Figure 3.5). This reduces the number of (l+ 1)-dimensional candidate subspaces.
In the last step (STEP 2.2 in Figure 3.4), we generate the (l + 1)- dimensional clusters and the corresponding (l+ 1)-dimensional subspaces, containing these clusters. To do so, we use the l-dimensional subclusters and the list of (l+ 1)-dimensional candidate subspaces. For each candidate subspace cand∈ CandSl+1, we take one l-dimensional subspace T ⊂cand
and simply call the procedure DBSCAN(clcand, ε, k) for each cluster cl in
T (cl∈ CT) to generateCcand. To minimize the cost of the runs of DBSCAN
minimum number of points is in the cluster, i.e. bestSubspace:= ArgMin s∈Sl∧s⊆cand X Ci∈Cs |Ci|.
This minimize the number of necessary range queries during the runs of DBSCAN inS. IfCS 6=∅, we add it toC
l+1 and addS toSl+1.
Steps 2.1 to 2.3 are recursively executed as long as the set ofl-dimensional subspaces containing clusters is not empty.
The most time consuming part of our algorithm is the execution of all the partial range queries on arbitrary subspaces of the data space. As DBSCAN is applied to different subspaces, an index structure for the full-dimensional data space is not applicable. Therefore, we apply the approach of inverted files. Our algorithm provides an efficient index support for range queries on each single attribute in logarithmic time. For range queries on more than one attribute, we apply the range query to each separate attribute (index structure) and generate the intersection of all intermediate results to obtain the final result.