• No results found

ious algorithms such as DBSCAN [EKSX96], DBCLASD [XEKS98], DEN- CLUE [HK98], and OPTICS [ABKS99]. All these methods search for regions of high density in a feature space that are separated by regions of lower den-

2.2 Foundations of Density-Based Clustering 19

Figure 2.2: Convex (left) and arbitrarily (right) shaped clusters. sity. The approaches presented in this thesis are particularly based on the formal definitions of density-connected clusters underlying the algorithm DBSCAN [EKSX96]. As illustrated in Figure 2.2, the density-connected clustering notion is capable of finding clusters of arbitrary shapes. In the following, we introduce the concepts necessary to find all density-connected clusters of a given data set.

Definition 2.1 (ε-neighborhood)

Letε∈IR+0,o∈DB. Theε-neighborhoodofo, denoted byNε(o), is defined by

Nε(o) ={x∈DB |dist(o, x)≤ε}.

Based on the two input parameters (ε and k), dense regions can be defined by means of core objects:

Definition 2.2 (core object)

Let ε ∈ IR+0, k ∈ IN. An object o ∈ DB is called core object, denoted by Coreε,k(o) if its ε-neighborhood contains at least k objects, formally:

Coreε,k(o)⇔ | Nε(o)| ≥k.

Clusters contain core objects, located inside a cluster, and border ob- jects, located at the border of the cluster (see Figure 2.3(a) for an illus- tration). In addition, a cluster should form a dense region and thus, all objects within a cluster should be “connected”. Using the concept of con- nectivity, any core object o can be used to expand a cluster. To find all density-connected objects ofo the following concepts are used.

Definition 2.3 (direct density-reachability)

Let ε ∈IR0+, k∈ IN. An object p ∈DB is directly density-reachable from q∈DB ifq is a core object and p is an element of Nε(q), formally:

DirReachε,k(q, p)⇔Coreε,k(q) ∧ p∈ Nε(q).

The concept of direct density-reachability is depicted in Figure 2.3(b). As we want to be independent of the order of processing, we can only use direct density-reachability for core objects. For border objects, this relation is not symmetric.

To find all density-connected objects, we can now build the transitive closure of direct density-reachability.

Definition 2.4 (density-reachability)

Let ε∈IR+0, k∈IN. An object p ∈DB is density-reachable from q ∈DB if there is a sequence of objects p1, . . . ,pn, p1 =q, pn =p such that pi+1 is

directly density-reachable from pi, formally:

Reachε,k(q, p)⇔

∃p1, . . . ,pn∈DB : p1=q ∧ pn=p ∧

∀i∈ {1, . . . , n−1} : DirReachε,k(pi, pi+1).

Density-reachability is illustrated in Figure 2.3(c). Density-reachability is still not symmetric in general. Thus, we finally introduce the concept of density-connectivity.

Definition 2.5 (density-connectivity)

Let ε∈ IR+0, k ∈IN. An object p ∈ DB is density-connected to an object q ∈DB if there is an object o such that both p and q are density-reachable fromo, formally:

Connectε,k(q, p)⇔

∃o∈DB : Reachε,k(o, q) ∧ Reachε,k(o, p).

Density-connectivity is a symmetric relation. Thus, searching for all density-connected points is independent from the order of processing. The concept is visualized in Figure 2.3(d).

2.2 Foundations of Density-Based Clustering 21

p

q

k=5

(a) q is a core object and p a border object.

p

q

k=5

(b) p is direct density-reachable from q; q is not direct density- reachable from p.

o

q

p

k=5

(c) p is density-reachable from q.

k=5

p

q

o

(d) p and q are density-connected.

Figure 2.3: Concepts of DBSCAN. Definition 2.6 (density-connected set)

Let ε ∈ IR+0, k ∈ IN. A non-empty subset C ⊆ DB is called a density- connected set if all objects in C are density-connected in S, formally:

ConSetkε,(C)⇔ ∀o, q ∈C : Connectkε,(o, q).

Finally, a density-connected cluster is defined as a set of density-connected objects which is maximal w.r.t. density-reachability.

Definition 2.7 (density-connected cluster)

Let ε∈IR+0 and k∈IN. A non-empty subset C ⊆DB is called adensity- connected clusterw.r.t. εandkif all objects inC are density-connected and C is maximal w.r.t. density-reachability, formally:

ConClusterε,k(C)⇔

(1) Connectivity: ∀o, q∈ C :Connectε,k(o, q).

DBSCAN(SetOfObjectsDB, Realε, Integerk)

// each point inDBis marked as unclassified generate new clusterIDcid;

for eachp∈DBdo

ifp.clusterID = UNCLASSIFIEDthen ifExpandCluster(DB,p,cid,ε,k)then

cid:=cid+ 1; end if

end if end for

Figure 2.4: The algorithm DBSCAN.

Using these concepts, DBSCAN is able to detect arbitrarily shaped clus- ters by one single pass over the data. To do so, DBSCAN uses the fact that a density-connected cluster can be detected by finding one of its core objects

oand computing all objects which are density-reachable fromo. The pseudo code of DBSCAN is depicted in Figure 2.4.

The method ExpandCluster which computes the density-connected clus- ter starting from a given core point, is shown in Figure 2.5.

The correctness of DBSCAN can be formally proven (cf. lemmata 1 and 2 in [EKSX96], proofs in [SEKX98]). Although DBSCAN is not in a strong sense deterministic (the run of the algorithm depends on the order in which the points are stored), both the run-time as well as the result (number of detected clusters and association of core objects to clusters) are determinate. Note that according to the definitions above border objects may be border objects to more than one density-connected cluster. In this version a border object is added to the first cluster where it is a border object. Dependent on the application domain other solutions are possible. The worst case time complexity of DBSCAN isO(nlogn), assuming an efficient index andO(n2) if no index exists.