4.8 Other Uses of Histograms
5.3.1 The Optimal RR for Selectivity Estimation
Histogram buckets span (axis-aligned hyper-)rectangles in attribute-value space. A cluster is a set of points: our aim is to transform it to a rectangle while making sure that the transformation “falsifies” the cluster as little as possible.
Figure 5.1: A cluster and a candidate RR
We denote the set of all rectangles in the data space as <. < can be finite or infi- nite, depending on the data domain. The transformed rectangles serve as histogram buckets. In the histogram we essentially substitute the cluster C with RR. Figure 5.1 shows a cluster C and a candidate RR. We now look at clusters not as a discrete set of points but as regions with an extent in space and density, to bring rectangles and clusters into the same domain.
Definition 5.1 (Cluster Density)
Given a clusterC, we denote by |C| the volume of its extent. The density of the clus- ter,dens(C), is the number of objects in the cluster divided by |C|. 2
5.3. CLUSTER TRANSFORMATION
Because C 6= RR in general, substituting C with RR introduces an estimation er- ror. Suppose that the density of the cluster is dens(C), and outside of the cluster it is roughly 0, and the density of RR is dens(RR). Then, as a result of substituting C with RR, the following density changes occur:
• RR − C has density 0, but instead we estimate its density to be dens(RR) • C − RR has density dens(C), instead we estimate its density to be 0.
• C ∩ RR has density dens(C), instead we estimate its density to be dens(RR). The overall estimation error resulting from the substitution of a fixed C with RR is given by the function (RR, dens(RR)). It is the sum of errors of the three regions mentioned above: (RR, dens(RR)) = Z RR∪C |est(u) − real(u)| du = Z RR∩C |dens(RR) − dens(C)| du + Z RR−C dens(RR)du + Z C−RR dens(C)du = (|dens(RR) − dens(C)|) |RR ∩ C| + dens(RR) |RR − C| + dens(C) |C − RR| (5.1)
Definition 5.2 (Optimal RR)
A rectangleRR with density dens(RR) is called optimal (w.r.t. <), denoted by RR = opt(<) if
(RR, dens(RR)) = min
r∈< (r, dens(r))
2 We first prove that the density of opt(<) is upper-bounded by the density of the cluster:
Lemma 5.3.1. For any cluster C with density dens(C), if RR = opt(<), thendens(RR) ≤ dens(C)
Proof. Let us assume that the opposite is true, for some α > 0 dens(RR) = dens(C) + α, RR is optimal, which means
(RR, dens(R)) = α |RR ∩ C| + dens(C) |C − RR| + (dens(C) + α) |RR − C| is minimal. Take dens0(RR) = dens(C) − α,
0(RR, dens0(R)) = α |RR ∩ C| + dens(C) |C − RR| + (dens(C) − α) |RR − C| 0(RR, dens0(R)) < (RR, dens(R)), which contradicts the assumption that is minimal.
Figure 5.1 illustrates why dens(RR) should not exceed dens(C). RR possibly contains regions which are not in C, and does not necessarily cover all of C. So instead of some part of C with high density, RR contains a part which has density 0.
Using Lemma (5.3.1), we can simplify Equation (5.1)
(RR, dens(RR)) =dens(C) · |C| + dens(RR) · (|RR − C| − |RR ∩ C|) (5.2) We can now find the expression for the optimal value of dens(RR).
Lemma 5.3.2. For a fixed rectangle RR, the value of dens(RR) which minimizes (RR, dens(RR)) is given by:
dens(RR) = (
dens(C) if|RR ∩ C| > |RR − C|
0 otherwise
Proof. In Equation (5.2), the part depending on dens(RR) is dens(RR) · (|RR − C| − |RR ∩ C|)
In case |RR − C| > |RR ∩ C|, it is positive. To minimize it, we put dens(RR) = 0. In case |RR ∩ C| > |RR − C|, it is negative, and we put dens(RR) = dens(C), which is the largest value for dens(RR) according to Lemma (5.3.1).
The first implication from this lemma is that if |RR − C| > |RR ∩ C| then the rectangle RR is not useful and can be omitted. RR is useless when the space con- tained in RR not belonging to C is larger than the common part of C and RR (Figure 5.1). However, when |RR ∩ C| > |RR − C|, then the best strategy is to minimize the estimation for the region |RR ∩ C|. This is achieved by putting dens(RR) = dens(C). Below, we always consider RRs which satisfy |RR ∩ C| > |RR − C|, and their density = dens(C). Finding the optimal RR is not straightfor- ward, however. Before turning to optimal RRs, we discuss some “obvious” RRs, such as minimal bounding rectangle.
Definition 5.3 (Enclosing Rectangles)
We denote the set of all rectangles which encloseC by <+C. <+
C = {R|R ∈ <, C ⊆ R} (5.3)
2
Obviously, the minimal bounding rectangle of C is in <+C.
Definition 5.4 (Enclosed Rectangles)
We denote the set of all rectangles enclosed inC by <−C <−
5.3. CLUSTER TRANSFORMATION
2
The maximal inbound rectangle of a cluster is in <−C. Lemma 5.3.3. <+contains a unique optimalRR or is empty.
Proof. We construct a rectangle R0 such that ∀R ∈ <+C, R0 ⊆ R. For dimension j,
project all points on j, find the minimum and maximum – those would be the sides of the rectangle parallel to dimension j. Repeating this for all dimensions we will obtain the rectangle. Obviously, any rectangle in <+C contains R0. If R0 satisfies the
condition |R0∩ C| > |R0− C| then R0 = opt(<+C), otherwise <+C does not contain
any RRs.
Consider again Figure 5.1 for an example in 2-dimensional space. To find the minimal rectangle in <+, take the up-most point of the cluster and draw a line parallel
to the x-axis, do the same with the lowest point. Now, take the rightmost point and draw a line parallel to the y-axis, same with the leftmost point. The rectangle which is bounded by those 4 lines is R0.
We now proceed as follows: We first present an algorithm which finds opt(<−C). It constructs a convex hull of the cluster and fits the largest RR into it. In practice, this approach has limitations. In particular, it is too expensive for large clusters. As an alternative, we describe a heuristic which is both fast and effective.