The Algorithm SUBCLU - Kailing, Karin (2004): New Techniques for Clustering Complex Obje

A p o q

(a) p and q are density-connected viaoin AB. p o A B q

(b) p and q are not density- connected in AB.

Figure 3.3: Monotonicity of density-connectivity (the circles indicate the

ε-neighborhoods,k= 4).

3.4 The Algorithm SUBCLU

SUBCLU is based on a bottom-up, greedy algorithm to detect the density- connected clusters in all subspaces of high-dimensional data. The algorithm is presented in Figure 3.4. The following data structures are used:

• DBS denotes the databaseDB projected onto the subspace S.

• CS _{denotes the set of all density-connected clusters of}_DB _{in the sub-}

spaceS w.r.t.εandk, and can be computed by the method DBSCAN, i.e. CS _:=_DBSCAN₍_DBS_{, ε, k}_{). Note that we assume here that the}

noise set is not included inCS_.

• S_l denotes the set of all l-dimensional subspaces, containing at least one cluster, i.e. Sl :={S⊆A| |S|=land CS6=∅}.

• Cl denotes the set of sets of all clusters inl-dimensional subspaces, i.e.

C_l :={CS_|_S _⊆_A _and _|_S_|₌_l_}_.

We begin with generating all one-dimensional clusters by applying DB- SCAN to each one-dimensional subspace (STEP 1 in Figure 3.4).

For each detected cluster, we have to check whether this cluster is (or parts of it are) still existent in higher dimensional subspaces. Due to Lemma 3.1, no other cluster can exist in higher dimensional subspaces. Thus, we

SUBCLU(SetOfPointsDB, Realε, Integerk)

// STEP 1Generate all 1D clusters

S1:=∅ // set of 1D subspaces containing clusters

C1:=∅ // set of all sets of clusters in 1D subspaces

for eachai∈ Ado

C{ai}_:=_DBSCAN₍_DB{ai}_{, ε, k}₎ _{// set of all clusters in subspace}_a

i; ifC{ai}₆₌_∅_then _{// at least one cluster in subspace}_{_a

i}found S1:=S1∪ {ai};

C1:=C1∪ C{ai};

end if end for

// STEP 2Generate(l+ 1)-D clusters froml-D clusters l:= 1;

whileCl6=∅

// STEP 2.1Generate(l+ 1)-D candidate subspaces CandSl+1 := GenerateCandidateSubspaces(Sl);

// STEP 2.2Test candidates and generate(l+ 1)-D clusters for eachcand∈CandSl+1do

// Searchl-dim subspace ofcandwith minimal number of points in the clusters bestSubspace:= ArgMin

s∈S_l∧s⊆cand P

Ci∈Cs|Ci|

Ccand_:=_∅_;

for eachclustercl∈ CbestSubspace do Ccand =Ccand_∪ DBSCAN(clcand, ε, k); ifCcand₆₌_∅_then Sl+1:=Sl+1∪cand; Cl+1:=Cl+1∪ Ccand; end if end for end for l:=l+ 1 end while

3.4 The Algorithm SUBCLU 43

GenerateCandidates(SetOfSubspacesSl)

// STEP 2.1.1Generate(l+ 1)-D candidate subspaces CandSl+1:=∅;

for eachs1∈Sldo for eachs2∈Sldo

if s1.attr1=s2.attr1∧. . .∧s1.attrl−1=s2.attrl−1∧s1.attrl< s2.attrl theninsert{s1.attr1, . . . , s1.attrl, s2.attrl}intoCandSl+1;

end for end for

// STEP 2.1.2Prune irrelevant candidates subspaces for eachcand∈CandSl+1do

for eachs⊂candwith|s|=ldo

if s /∈SlthendeletecandfromCandSl+1;

end if end for end for

Figure 3.5: The procedureGenerateCandidates.

search for each l-dimensional subspace S ∈ S_l all other l-dimensional sub- spacesT ∈ Slhaving (l−1) attributes in common and join them to generate

(l+ 1)-dimensional candidate subspaces (STEP 2.1.1 of the procedureGen-

erateCandidates in Figure 3.5). The set of (l + 1)-dimensional candidate

subspaces is denoted by CandSl+1.

For each candidate subspace S ∈ CandSl+1, Sl must contain each l-

dimensional subspace T ⊂ S, |T| = l. (cf. Lemma 3.1). Consequently, we can prune all candidates having at least one l-dimensional subspace not included in S_l (STEP 2.1.2 of procedureGenerateCandidates in Figure 3.5). This reduces the number of (l+ 1)-dimensional candidate subspaces.

In the last step (STEP 2.2 in Figure 3.4), we generate the (l + 1)- dimensional clusters and the corresponding (l+ 1)-dimensional subspaces, containing these clusters. To do so, we use the l-dimensional subclusters and the list of (l+ 1)-dimensional candidate subspaces. For each candidate subspace cand∈ CandSl+1, we take one l-dimensional subspace T ⊂cand

and simply call the procedure DBSCAN(clcand, ε, k) for each cluster cl in

T (cl∈ CT_{) to generate}_Ccand_{. To minimize the cost of the runs of DBSCAN}

minimum number of points is in the cluster, i.e. bestSubspace:= ArgMin s∈Sl∧s⊆cand X Ci∈Cs |Ci|.

This minimize the number of necessary range queries during the runs of DBSCAN inS. IfCS ₆₌_∅_{, we add it to}_C

l+1 and addS toSl+1.

Steps 2.1 to 2.3 are recursively executed as long as the set ofl-dimensional subspaces containing clusters is not empty.

The most time consuming part of our algorithm is the execution of all the partial range queries on arbitrary subspaces of the data space. As DBSCAN is applied to different subspaces, an index structure for the full-dimensional data space is not applicable. Therefore, we apply the approach of inverted files. Our algorithm provides an efficient index support for range queries on each single attribute in logarithmic time. For range queries on more than one attribute, we apply the range query to each separate attribute (index structure) and generate the intersection of all intermediate results to obtain the final result.

In document Kailing, Karin (2004): New Techniques for Clustering Complex Objects. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 55-58)