6.3 Decompositioning of High-Resolution Spatial Objects
6.3.4 Grouping into Gray Containers
Our grouping algorithm takes the expected access cost of the gray containers into account. The expected cost cost(Cgray) related to a gray container Cgray depend on the average access probability of Cgray and on the cost related to the evaluation of the exact byte sequence B(Cgray).
First, the access probability is computed by assuming that we know the average query distribution for each dimension. Then, the evaluation cost are introduced which heavily depend on the used data compressor. Finally, our cost-based grouping algorithm GroupCon is introduced which is used for storing complex objects in an ORDBMS.
Query Distribution. For many application areas, e.g. in the field of CAD and GIS, the average query distribution can be predicted very well. It is obvious that queries in rather dense areas, e.g. a cockpit in an airplane or a big city like New York, are much more frequently inquired than less dense areas. Furthermore, often small selective queries are posted. This assumed distribution function influences our de- compositioning algorithm.
First, we transform an arbitrary d-dimensional box query into a -dimensional normalized data space D* (cf. Figure 71 for one-dimensional query intervals Qi). We start with normalizing the coordinates of our d-dimensional query container to ensure
Figure 71: Query distribution functions Pi(x,y).
a) Complex query distribution P1(x,y), b) Simple query distribution P2(x,y)
Q1=[x1,y1] x1 y1 D* x1 y1 0 1 Q2=[x2,y2] b) a) k*D* Q1=[x1,y1] x1 y1 x1 y1 0 1 Q2=[x2,y2] x2 y2 x2 y2 low value of P1(x,y) high value of P1(x,y) P2(x,y) = 0 D* 2×d
Decompositioning of High-Resolution Spatial Objects 135
that all data lies within the hyper cuboid . For clarity, we will first examine the one-dimensional case looking at intervals and their point transformation into the upper triangle D*:= of the two-dimensional hyper cuboid. An interval Q = [x, y] therefore corresponds to the point with . Examples are visualized in Figure 71. To each of these two-dimensional points Q=(x,y) we assign a numerical value P(Q) where holds. As the probability is equal to one that a query is somewhere located in the upper triangle D*, the following equation has to hold:
Figure 71 shows two different query distribution functions. A potential query Q2 is very unlikely in Figure 71a and does not occur at all in Figure 71b. On the other hand, query Q1 is very likely in both cases.
Let us note, that we used the simple query distribution function of Figure 71b throughout our experiments. In all considered application areas the common query objects only comprise a very small portion of the data space D*. Therefore, we intro- duce the parameter k*, which restricts the extension of the possible query objects. For the computation of the access probability we only consider query objects whose ex- tensions do not exceed k* D* in each dimension.
Access Probability. The access probability P(Cgray) related to a container object
Cgray denotes the probability that an arbitrary query object has an intersection with the d-dimensional hull H(Cgray). All possible query intervals that intersect C0 are visualized by the shaded area A(C0) in Figure 72a. The area displays all intervals whose lower bounds are smaller or equal to b and whose upper bounds are larger or equal to a. These query intervals are exactly the ones that have a non empty intersec- tion with C0. The probability that an interval C0= [a0, b0] is intersected by an arbi- trary query interval is:
Assuming that the dimensions of the data space are independent from each other, the derived access probability for the one-dimensional data space can easily be ex- panded to an arbitrary number of dimensions. The probability for the multi-dimen-
Xdi=1[ , ]0 1 x y, ( )∈[0 1, ]2 x≤y { } x y, ( ) x≤y 0≤P Q( )≤1 P x y( , )dxdy D* ∫ ∫ = 1 ⋅ P C( )0 P x y( , )dxdy A C∫( )0 ∫ P x y( , )dxdy D* ∫ ∫ --- = P x y( , )dxdy A C∫( )0 ∫ =
sional case is equal to the product of all one-dimensional probabilities which can be derived for each dimension individually.
Evaluation Cost. Furthermore, the expected query cost depend on the cost related to the evaluation of the byte sequence stored in the BLOB of an intersected gray container Cgray. The evaluation of the BLOB content requires to load the BLOB from disk and decompress the data. Consequently, the evaluation cost depends on both the size V(Cgray) of the uncompressed BLOB and the size Vcomp(Cgray) << V(Cgray) of the compressed data. Additional, the evaluation cost costeval depend on a constant
related to the retrieval of the BLOB from secondary storage, a constant
related to the decompression of the BLOB, and a constant related to the intersec-
tion test. The cost and heavily depend on how we organize B(Cgray) within our BLOB, i.e. on the used compression algorithm. A highly effective but not very time efficient packer, e.g. BZIP2, would cause low loading cost but high decom- pression cost. In contrast, using no compression technique, leads to very high loading cost but no decompression cost. Our QSDC is an effective and very efficient com- pression algorithm which yields a good trade-off between the loading and decom- pression cost. Finally, solely depend on the used system. The overall evaluation cost are defined by the following formula:
b) a) D* C2 a=a1 b1 a2b2=b C1 C0 a0 b0 D* a0 b0 0 1
Figure 72: Computation of average access probabilities of gray containers. a) Intersection area for the one-dimensional container C0=[a0,b0], b) Intersection area for the decomposed container objects C1 and C2
0 1
A(C0)
A(C1)
A(C2)
cloadI/O cdecompcpu
ctestcpu cdecompcpu cloadI/O
ctestcpu
teval(Cgray) = cos
Decompositioning of High-Resolution Spatial Objects 137
Grouping Algorithm. Orenstein [Ore 89] introduced the size- and error bound decomposition approach. Our first grouping rule “the number of gray containers should be small” can be met by applying the size-bound approach, while applying the error-bound approach results in the second rule” the dead area of all gray containers should be small”. For fulfilling both rules, we introduce the following top-down grouping algorithm for gray containers, called GroupCon (cf. Figure 73). GroupCon is a recursive algorithm which starts with an approximation Ogray = (id, 〈Cgray〉), i.e. we approximate the object by one gray container. In each step of our algorithm, we look for the maximum gap g within the bounding box of the actual gray container. We carry out the split along this gap, if the average query cost caused by the decomposed containers is smaller than the cost caused by our input container Cgray. The expected cost related to a gray container Cgray can be computed as described in the foregoing paragraph. A gray container which is reported by the GroupCon algorithm is stored in the database and no longer taken into account in the next recursion step. Data compressors which have a high compression rate and a fast decompression method, result in an early stop of the GroupCon algorithm generating a small number of gray intervals. Our experimental evaluations suggest that this grouping algorithm yields results which are very close to the optimal ones for many combinations of index-structures, data compression techniques and data space resolutions.
ALGORITHM GroupCon (Cgray, P)
BEGIN
container_list := split_at_maximum_gap(Cgray);
costgray := P(Cgray.) • costeval(Cgray,); costdec := 0;
FOR EACH c in container_list DO
costdec := costdec + P(c) • costeval(c); END FOR;
IF costgray > costdec THEN
FOR EACH c in container_list DO GroupCon (c, P); END FOR; ELSE report (Cgray); END IF; END.