2.3 Foundations of Density-Based Clustering
2.3.1 Clusters as Density Connected Sets
The key idea of “flat” density-based clustering is that for each point of a cluster the neighborhood of a given radius has to contain a minimum num- ber of points, i.e. the density in the neighborhood has to exceed a density threshold. This threshold is determined by two user defined input parame- ters ε (specifying the size of the neighborhood) and MinPts specifying the minimum number of points the neighborhood must contain.
Definition 2.1 (ε-neighborhood)
defined by
Nε(p) ={o ∈ D |dist(p, o)≤ε}.
As claimed above, a point should be inside a cluster if its neighborhood contains at least a given number of points.
Definition 2.2 (core point)
A point q ∈ D is a core point w.r.t. ε ∈ R and MinPts ∈ N, denoted by
Coreden(q), if its ε-neighborhood contains at least MinPts points, formally: Coreden(q)⇔ |Nε(p)| ≥MinPts.
Let us note, that the acronym den in the definition refers to the density parameters ε and MinPts. In the following, we omit the parameters ε and MinPts wherever the context is clear and use den instead. The core point concept is visualized in Figure2.4(a).
A naive approach could require the core point property for each member of a cluster. However, this approach fails because there are some points on the border of the cluster (border points) that do not fit the core point property but are intuitively part of a cluster. In fact, a cluster has two properties: density and connectivity. The first one is captured through the core point property. The second one is captured through the following concepts.
Definition 2.3 (direct density reachable)
A point p ∈ D is direct density reachable w.r.t. ε ∈ R and MinPts ∈ N
from q ∈ D, denoted by DirReachden(q,p), if q is a core point and p is in
the ε-neighborhood of q, formally:
DirReachden(q, p)⇔Coreden(q) ∧ p∈ Nε(q).
The concept of direct density reachability is depicted in Figure 2.4(b). Obviously, directly density reachable is a symmetric relation for pairs of core points. However, it is not symmetric in general.
2.3 Foundations of Density-Based Clustering 21
Definition 2.4 (density reachable)
A point p ∈ D is density-reachable from q ∈ D w.r.t. ε ∈R and MinPts ∈
N, denoted by Reachden(q,p), if there is a chain of points p1, . . . , pn ∈ D, p1 =q, pn =p such that pi+1 is directly density reachable from pi, formally:
Reachden(q, p)⇔
∃p1, . . . , pn ∈ D :p1 =q ∧ pn=p ∧
∀i∈ {1, . . . , n−1} :DirReachden(pi, pi+1).
Density reachability is illustrated in Figure 2.4(c). It is the transitive enclosure of direct density reachable but it is not symmetric in general (again only for pairs of core points). Thus, we have captured the connectivity of core points so far. But two border points of the same cluster C are not density reachable from each other. However, there must be a core point in C from which both border points are reachable. Therefore, the following definition captures general connectivity of points within a cluster.
Definition 2.5 (density connected)
A point q ∈ D is density-connected to another point p ∈ D w.r.t. ε ∈ R
and MinPts ∈ N, denoted by Connectden(q,p), if there is an object o ∈ D
such that both p and q are density reachable from o, formally:
Connectden(q, p)⇔
∃o ∈ D : Reachden(o, q) ∧ Reachden(o, p).
Density connected is in general a symmetric relation. The concept is visualized in Figure2.4(d).
Now, the density-based notion of a cluster can be defined using the in- troduced concepts. Intuitively, a cluster is defined to be a set of density connected points which is maximal w.r.t. density reachability. The points in
D not belonging to any of its density connected sets are defined as noise.
Definition 2.6 (density connected set)
q
MinPts=5p
q
MinPts=5o
q
r
MinPts=5 MinPts=5p
q
o
(a)Coreden(q)
q
MinPts=5p
q
MinPts=5o
q
r
MinPts=5 MinPts=5p
q
o
(b)DirReachden(q,p)q
q
o
q
p
MinPts=5 MinPts=5p
q
o
(c)Reachden(q,p)q
o
q
q
MinPts=5 MinPts=5p
q
o
(d)Connectden(q,p)Figure 2.4: Illustration of density-based clustering concepts and MinPts∈N, if all objects in C are density-connected, formally:
ConSetden(C)⇔ ∀p, q ∈ C :Connectden(p, q)
Definition 2.7 (density connected cluster)
A non-empty subsetC ⊆ D is called adensity connected clusterw.r.t. ε∈R
and MinPts∈N, denoted by Clusterden(C), if C is a density connected set
and C is maximal w.r.t. density-reachability, formally:
Clusterden(C)⇔
(1) Connectivity: ConSetden(C)
(2) Maximality: ∀p, q ∈ D:q∈ C ∧Reachden(q, p)⇒p∈ C.
We will use the terms “density-based” and “density connected” through- out the rest of the thesis interchangeable for the clustering notion as defined in Definition2.7. Note, that the density connected clustering notion is able to detect clusters of arbitrary shape and size as long as they exceed the threshold. A flat density-based decomposition of a database is defined as follows.
2.3 Foundations of Density-Based Clustering 23
algorithmDBSCAN(SetOfObjectsD, Realε, Integer MinPts) // each point inDis marked as unclassified
generate new clusterIDcid;
for eachp∈ D do
if p.clusterID = UNCLASSIFIEDthen
if ExpandCluster(D,p,cid, ε,MinPts)then
cid:=cid+ 1
end if end if end for
Figure 2.5: The DBSCAN algorithm. Definition 2.8 (flat density-based decomposition)
Let ε∈R and MinPts∈N. A flat density-based decomposition of D w.r.t.
ε and MinPts is a decomposition Dden of D into k ≥ 1 subsets, such that k−1 subsets are density connected clusters and the k-th (possible empty) set contains the noise points, formally:
Dden ={C1, . . . , Ck−1, N} where
¬Clusterden(N)∧ ∀i:i∈ {1, . . . , k −1} ∧Ci 6=∅ ∧Clusterden(Ci).
Using the previously described concepts, the algorithm DBSCAN is pro- posed in [EKSX96] computing a flat density-based decomposition w.r.t. the user-specified parametersεandMinPtsby one single pass over the data. For that purpose, DBSCAN uses the fact, that a density connected set can be detected by finding one of its core points pand computing all objects which are density reachable from p. The pseudo code of DBSCAN is depicted in Figure2.5. The methodExpandClusterwhich computes the density connected cluster starting from a given core point, is given in Figure 2.6.
The correctness of DBSCAN can be formally proven (cf. Lemmata 1 and 2 in [EKSX96], proofs in [SEKX98]). Although DBSCAN is not in a strong sense deterministic (the run of the algorithm depends on the order in which the points are stored), both the run-time as well as the result (number of detected clusters and association of core objects to clusters) are determi- nate. The worst case time complexity of DBSCAN is O(nlogn) assuming an efficient spatial index (e.g. [BKK96] or [BBJ+00]) and O(n2) if no index
booleanExpandCluster(SetOfObjectsD, Objectstart, Integercid, Realε, IntegerMinPts) SetOfObjectsseeds:=Nε(start);
if |seeds|<MinPtsthen
start.clusterID := NOISE;
return false;
end if
for eacho∈seedsdo
o.clusterID :=cid;
end for
removestartfromseeds;
whileseeds6=∅do
o:= first point inseeds;
neighbors:=Nε(o);
if|neighbors| ≥MinPtsthen for eachp∈neighborsdo
if p.clusterID∈ {UNCLASSIFIED, NOISE} then if p.clusterID = UNCLASSIFIEDthen
insertpintoseeds;
endif
p.clusterID :=cid;
endif end for end if
removeofromseeds;
end while return true;
Figure 2.6: MethodExpandCluster.