• No results found

Basic Concepts of UB-Trees

The UB-Tree

4.1 Basic Concepts of UB-Trees

The UB-Tree [Bay97, Mar99] is a clustering index for multidimensional point data, which inherits all the properties of the B+-Tree [BM72]. Logarithmic performance guarantees are given for the basic operations of insertion, deletion, and point query and a page utilization of 50% is guaranteed. It utilizes the Z-curve to map the multidimensional space to a one-dimensional space, partitions it and indexes it with a B+-Tree.

Its sophisticated algorithms for multidimensional range queries [Mar99, Ram02] offer excellent properties for multidimensional applications like DWH, GIS, archiving systems, temporal data management, etc. We have shown that integrating the UB-Tree into a

63

Figure 4.1: A UB-Tree Partitioning the World into Z-regions

RDBMS providing a B+-Tree can be done with reasonably small effort [RMF+00, Ram02], since the UB-Tree is the multidimensional extension of the B+-Tree. The integration offers the advantages already discussed in Section 3.2.3.

The Tetris algorithm [MZB99, Zir03] for sorted reading of multidimensional ranges allows efficient pipe-lining and avoids external sorting thus delivering the first results in-stantly and being faster in the overall performance. Building a UB-Tree from a pre-sorted input is handled by the Temptris algorithm [Zir03]. Both algorithms utilize a sweep line to separate data into processed/stable and to-be-processed data. Combining them enables the efficient calculation of aggregation-networks required by DWHs.

Efficient processing of data organized in hierarchies is discussed in [MRB99, Pie03].

[Pie03] describes a generic model, its application to the UB-Tree, the integration of the technique into Transbase and performance evaluations.

Thus, the term UB-Tree refers not only to the mapping and partitioning of a multi-dimensional universe, but also to the advanced query processing algorithms as described above. The combination of these allow for an efficient management and query processing of multidimensional point data.

4.1.1 Z-regions

The UB-Tree introduces the new idea of partitioning the data space into disjoint but adjacent Z-intervals , which are mapped to data pages. The Z-intervals are indexed by a B+-Tree. This allows for a region based access to the data, which is also the major difference compared to former Z-curve-based indexing techniques, i.e., [OM84, Ore90].

The Z-intervals partition the universe completely and without overlap. Due to this it

is sufficient to index only the ending resp. the starting Z-address of Z-intervals, since the starting address of a region can be calculated by incrementing the ending address of the previous region. In the following we assume that we are indexing the last address within a region. A Z-interval represents a subspace of the universe and we will refer to this space as a Z-region (see Definition 3.8 on page 31). Furthermore, as the two terms are equivalent we will use the more intuitive one where appropriate.

Page overflows during insertion are handled by splitting the affected page in the middle w.r. to the tuples, i.e., we calculate a splitting Z-address σ partitioning the tuples on the page into two equal sets w.r. to their Z-addresses. The first half of the tuples from the old page are moved to a newly created page corresponding to the Z-interval indexed by the separating Z-address σ. The old page is updated, the new page stored and the new split address referring to the new page is inserted into the B+-Tree. An underflow during deletion is handled by merging two adjacent regions, i.e., by the default algorithm of the B+-Tree.

Formally, the structure of the UB-Tree can be defined as a set of k Z-intervals as follows:

Definition 4.1 (UB-Tree-Structure)

A UB-Tree is the partitioning of a multidimensional universe Ω into a set of intervals U(Ω) = {[σ1, 1], [σ2, 2], . . . , [σk, k]} on the Z-curve where σi+1= i+1 ∀i ∈ [1, k −1].

The intervals are indexed by a B+-Tree indexing in the interval end and correspond

to data pages. 

This is a B+-Tree, where the key of a point is calculated by the Z-function Z(~p ).

Furthermore, the split address (the separators stored in the index part) are chosen to take the region shape into account, i.e., to create regions with as few fringes as possible.

The basic operations of insertion, deletion, and update of a tuple ~p are handled by calculation Z(~p ) and doing a B+-Tree search to find the data page containing ~p, i.e., Z(~p ) ∈ [σi, i]. Storing the tuples on data pages in address-order allows for efficient binary-search and split while adding some additional cost for maintaining the order during insertion and update.

Example 4.1 (Insertion into a UB-Tree)

Insertion of a new tuple at position (0, 15) (at bottom left corner labeled n) into the two dimensional UB-Tree on a 16x16 universe as depicted in Figure 4.2(a) is performed as follows. We calculate the Z-address of this point and then locate the Z-region containing this point, which is region 6 (Figure 4.2(b)). Now we retrieve the page corresponding to this region and insert the point. With a page capacity of two points, a split is necessary, creating region 10. The separator between point e and s is chosen to create two regions with as few fringes as possible, i.e., a square for region 10 and a rectangle for region 6. Finally, the two pages are stored and a new separator for region 10 is inserted (Figure 4.2(c)). 

dimension 1

Figure 4.2: Z-regions of a UB-Tree for a given Data Distribution

4.1.2 Range Query Processing

In order to query a UB-Tree we need additional algorithms utilizing its structure.

Range query processing is performed as described in Section 3.2.3. The core of the processing is its efficient N J I-algorithm developed by [Mar99] and in detail described and analyzed by [Ram02]. Additionally, [Ram02] provides a N J O-algorithm to avoid post-filtering of Z-regions lying completely within the query box. Their complexity is linear to the address length enabling the efficient processing of range queries independently on the number of dimensions.

An algorithm handling a set of range queries was developed by the author during his master thesis [FMB99]. Whenever an attribute is restricted to more than one interval this results in a set of query boxes. Processing them sequentially can cause multiple page accesses to the same pages and accesses not in Z-order. The presented algorithm avoids this by processing the query box set simultaneously.