• No results found

8.4 Density-Based Subspace Ranking

8.4.3 Ranking Interesting Subspaces

Our approach to rate the interestingness of subspaces is based on the core point property. This property can be used for deciding about the interesting- ness of a subspace. Obviously, if a subspace contains no core point, it contains no dense region (cluster) and therefore contains no relevant information.

Observation 8.1 The number of core points of a data setD (w.r.t. a given

εand MinPts) is proportional to the number of different clusters inDand/or the size of the clusters in D and/or the density of clusters in D.

This observation can be used to rate the interestingness of subspaces. However, simply counting all the core points for each subspace delivers not enough information. Even if two subspaces contain the same number of core points, the quality may differ a lot. Dense regions also contain border points, i.e. points which are not core points themselves but lie within the ε-neighborhood of a core point and are thus a vital part of the dense region. Therefore, it is not only interesting how many core points a subspace contains but also how many objects lie within theε-neighborhood of these core points.

Definition 8.9 (count-value of a subspace)

The count-value (w.r.t. ε ∈ R and MinPts ∈ N) of a subspace S ⊆ A, denoted by count[S], is the sum of all points lying in the ε-neighborhood of

8.4 Density-Based Subspace Ranking 137

all core points (w.r.t. ε∈R and MinPts∈N) in the subspace S, formally:

count[S] = X

p∈D,CoreSden(p)

|NS ε (p)|.

If we measure the interestingness of a subspace according to its count[S] value and rank all subspaces according to this quality value, a severe problem is not addressed.

Recall from Observation 7.2 in Section 7.1 that the ε-neighborhoods of the core points tend to exceed the boundaries of the data space with increas- ing dimensionality. As a consequence, naturally with each dimension, the number of expected points in the ε-neighborhood of a core point decreases. Thus, this naive quality value favors lower dimensional subspaces over higher dimensional ones. A first solution to overcome this problem is that we in- troduce a scaling co-efficient that takes the dimensionality of the subspace into account. We take the ratio between thecount[S] value and the “virtual” count value of S we would get if all data objects were uniformly distributed inS.

For that purpose, we compute the volume of ad-dimensional ε-neighbor- hood, denoted byVoldε. If distis the L∞-norm,Voldε is a hypercube and can be computed by Voldε = (2ε)d or if dist is the Euclidian distance (L2-norm),

Voldε is a hyper-sphere and can be computed as given below: Voldε = √ πd Γ(d/2 + 1) ·ε d where Γ(x+ 1) =x·Γ(x), Γ(1) = 1 and Γ(12) =√π.

If we further assume that the points are normalized within [0,MAX]d, i.e.

MAX is the maximum value of each attribute, and are uniformly distributed, we expect n objects in the volume MAXd. The number of points expected to be in Voldε is

Voldε·n

MAXd.

Since the number of range queries in a particular subspace is n, we scale the count-value by n times the number of points expected to be in Voldε.

Thus, the quality of a subspace can be computed as given in the following definition.

Definition 8.10 (subspace quality)

The quality of a subspace S ⊆ A, measuring the interestingness of S is defined by: Quality(S) = count[S] n·Vol dim[S] ε ·n MAXdim[S]

This quality value would still favor lower dimensional subspaces. Due to the above mentioned phenomenon, the ε-neighborhoods of many points most likely exceed the boundaries of the data space when the dimension- ality increases. As a consequence, the estimation of the volume of these ε-neighborhoods using Voldimε [S] is inadequate in higher dimensional spaces.

In [BBKK97] the authors show that the average volume of the intersection of the data space and a hyper-sphere with radius ε can be expressed as the integral of a piecewise defined function, integrated over all possible positions of theε-neighborhood, i.e the core points. For our implementation, we choose a less complex, commonly used heuristics to eliminate this effect based on periodical extensions of the data space (cf. Section8.4.4 for details). Using these heuristics, the quality criterion is robust against the dimensionality of the subspace.

For two subspaces U, V ⊆ A with U ⊃ V this quality criterion has two complementary effects which are summarized in the following lemmata:

Lemma 8.6

Let U ⊃V. Then the following inequality holds:

count[U]≤count[V].

Proof. ∀p, x∈ D:

CoreUden(p)∧x∈ N U ε (p)

8.4 Density-Based Subspace Ranking 139

⇔ CoreUden(p)∧dist(πU(p), πU(x))≤ε U⊃V, Lemma8.1

=⇒ CoreVden(p)∧dist(πV(p), πV(x))≤ε

⇔ CoreVden(p)∧x∈ N V ε (p).

Thus, each objectxcontributing tocount[U]also contributes to count[V]. On the other hand, the reverse implication does obviously not hold in general. In summary, we have count[U]≤count[V]. 2 Lemma 8.7

Let U ⊃V.

count[U] =count[V]⇒Quality(U)≥Quality(V).

Proof. Through simple algebraic transformations we get

Quality(S)= count[S]·MAX dim[S]

n2·Voldim[S] ε

.

Since U ⊃ V, and we can assume MAX ≥ 2ε, it follows that MAXdim[S] grows faster with increasing dimensionality than Voldimε [S]. Thus, we can conclude from the assumption (count[U] = count[V]) that Quality(U) ≥

Quality(V). 2

The lemmata state that while navigating through the subspaces bottom- up, the count value decreases (cf. Lemma 8.6) until at a certain point the core points loose their core point property due to the addition of irrelevant features. The consequence of adding irrelevant features is that the quality decreases. On the other hand, as long as this is not the case, i.e. the count values are stable, the features are relevant for the clusters and the quality increases (cf. Lemma 8.7). Obviously, this is a desirable behavior of the quality measure.