Grouping into Gray Containers - Decompositioning of High-Resolution Spatial Objects

6.3 Decompositioning of High-Resolution Spatial Objects

6.3.4 Grouping into Gray Containers

Our grouping algorithm takes the expected access cost of the gray containers into account. The expected cost cost(C_gray) related to a gray container C_graydepend on the average access probability of C_gray and on the cost related to the evaluation of the exact byte sequence B(C_gray).

First, the access probability is computed by assuming that we know the average query distribution for each dimension. Then, the evaluation cost are introduced which heavily depend on the used data compressor. Finally, our cost-based grouping algorithm GroupCon is introduced which is used for storing complex objects in an ORDBMS.

Query Distribution. For many application areas, e.g. in the field of CAD and GIS, the average query distribution can be predicted very well. It is obvious that queries in rather dense areas, e.g. a cockpit in an airplane or a big city like New York, are much more frequently inquired than less dense areas. Furthermore, often small selective queries are posted. This assumed distribution function influences our decompositioning algorithm.

First, we transform an arbitrary d-dimensional box query into a -dimensional normalized data space D* (cf. Figure 71 for one-dimensional query intervals Q_i). We start with normalizing the coordinates of our d-dimensional query container to ensure

Figure 71: Query distribution functions P_i(x,y).

a) Complex query distribution P₁(x,y), b) Simple query distribution P₂(x,y)

Q₁=[x₁,y₁] x₁ y₁ D* x₁ y1 0 1 Q₂=[x₂,y₂] b) a) k*D* Q₁=[x₁,y₁] x₁ y₁ x₁ y1 0 1 Q₂=[x₂,y₂] x₂ y₂ x2 y2 low value of P1(x,y) high value of P1(x,y) P2(x,y) = 0 D* 2×d

Decompositioning of High-Resolution Spatial Objects 135

that all data lies within the hyper cuboid . For clarity, we will first examine the one-dimensional case looking at intervals and their point transformation into the upper triangle D*:= of the two-dimensional hyper cuboid. An interval Q = [x, y] therefore corresponds to the point with . Examples are visualized in Figure 71. To each of these two-dimensional points Q=(x,y) we assign a numerical value P(Q) where holds. As the probability is equal to one that a query is somewhere located in the upper triangle D*, the following equation has to hold:

Figure 71 shows two different query distribution functions. A potential query Q₂ is very unlikely in Figure 71a and does not occur at all in Figure 71b. On the other hand, query Q₁ is very likely in both cases.

Let us note, that we used the simple query distribution function of Figure 71b throughout our experiments. In all considered application areas the common query objects only comprise a very small portion of the data space D*. Therefore, we intro- duce the parameter k*, which restricts the extension of the possible query objects. For the computation of the access probability we only consider query objects whose ex- tensions do not exceed k* D* in each dimension.

Access Probability. The access probability P(C_gray) related to a container object

C_gray denotes the probability that an arbitrary query object has an intersection with the d-dimensional hull H(C_gray). All possible query intervals that intersect C₀are visualized by the shaded area A(C₀) in Figure 72a. The area displays all intervals whose lower bounds are smaller or equal to b and whose upper bounds are larger or equal to a. These query intervals are exactly the ones that have a non empty intersec- tion with C₀. The probability that an interval C₀= [a₀, b₀] is intersected by an arbitrary query interval is:

Assuming that the dimensions of the data space are independent from each other, the derived access probability for the one-dimensional data space can easily be ex- panded to an arbitrary number of dimensions. The probability for the multi-dimen-

Xdi=1[ , ]0 1 x y, ( )∈[0 1, ]2 x≤y { } x y, ( ) x≤y 0≤P Q( )≤1 P x y( , )dxdy D* ∫ ∫ = 1 ⋅ P C( )₀ P x y( , )dxdy A C∫( )₀ ∫ P x y( , )dxdy D* ∫ ∫ --- = P x y( , )dxdy A C∫( )₀ ∫ =

sional case is equal to the product of all one-dimensional probabilities which can be derived for each dimension individually.

Evaluation Cost. Furthermore, the expected query cost depend on the cost related to the evaluation of the byte sequence stored in the BLOB of an intersected gray container C_gray. The evaluation of the BLOB content requires to load the BLOB from disk and decompress the data. Consequently, the evaluation cost depends on both the size V(C_gray) of the uncompressed BLOB and the size V_comp(C_gray) << V(C_gray) of the compressed data. Additional, the evaluation cost cost_eval depend on a constant

related to the retrieval of the BLOB from secondary storage, a constant

related to the decompression of the BLOB, and a constant related to the intersec-

tion test. The cost and heavily depend on how we organize B(C_gray) within our BLOB, i.e. on the used compression algorithm. A highly effective but not very time efficient packer, e.g. BZIP2, would cause low loading cost but high decom- pression cost. In contrast, using no compression technique, leads to very high loading cost but no decompression cost. Our QSDC is an effective and very efficient com- pression algorithm which yields a good trade-off between the loading and decompression cost. Finally, solely depend on the used system. The overall evaluation cost are defined by the following formula:

b) a) D* C₂ a=a₁ b1 a2b₂=b C1 C₀ a₀ b₀ D* a₀ b0 0 1

Figure 72: Computation of average access probabilities of gray containers. a) Intersection area for the one-dimensional container C₀=[a₀,b₀], b) Intersection area for the decomposed container objects C₁ and C₂

0 1

A(C₀)

A(C1)

A(C₂)

c_loadI/O c_decompcpu

c_testcpu c_decompcpu c_loadI/O

c_testcpu

t_eval(C_gray) = cos

Decompositioning of High-Resolution Spatial Objects 137

Grouping Algorithm. Orenstein [Ore 89] introduced the size- and error bound decomposition approach. Our first grouping rule “the number of gray containers should be small” can be met by applying the size-bound approach, while applying the error-bound approach results in the second rule” the dead area of all gray containers should be small”. For fulfilling both rules, we introduce the following top-down grouping algorithm for gray containers, called GroupCon (cf. Figure 73). GroupCon is a recursive algorithm which starts with an approximation O_gray= (id, 〈C_gray〉), i.e. we approximate the object by one gray container. In each step of our algorithm, we look for the maximum gap g within the bounding box of the actual gray container. We carry out the split along this gap, if the average query cost caused by the decomposed containers is smaller than the cost caused by our input container C_gray. The expected cost related to a gray container C_graycan be computed as described in the foregoing paragraph. A gray container which is reported by the GroupCon algorithm is stored in the database and no longer taken into account in the next recursion step. Data compressors which have a high compression rate and a fast decompression method, result in an early stop of the GroupCon algorithm generating a small number of gray intervals. Our experimental evaluations suggest that this grouping algorithm yields results which are very close to the optimal ones for many combinations of index-structures, data compression techniques and data space resolutions.

ALGORITHM GroupCon (C_gray, P)

BEGIN

container_list := split_at_maximum_gap(C_gray);

cost_gray := P(C_gray.) • cost_eval(C_gray,); cost_dec := 0;

FOR EACH c in container_list DO

cost_dec := cost_dec + P(c) • cost_eval(c); END FOR;

IF cost_gray > cost_dec THEN

FOR EACH c in container_list DO GroupCon (c, P); END FOR; ELSE report (C_gray); END IF; END.

In document Pfeifle, Martin (2004): Spatial Database Support for Virtual Engineering. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 152-156)