Maintenance Performance - 6.12.2 {R}-Tree coverage

6.12.2 {R}-Tree coverage

6.12.9 Maintenance Performance

Maintenance performance depends on:

Point search

Locality of page split and page merge resp. page underflow handling

When providing complexities we usually include constants where possible in order to bound them more exactly.

Point search: Insertion and deletion for the BUB-Tree are both bound by the I/O complexity O(h), and it might be O(1) for deletion in the best case when all page reads are served from cache causing only a write. As index entries are stored in address order binary search can be utilized, thus requiring O(2log₂n) comparisons in the worst case. If finding a match or gap earlier the cost will be even lower. The factor of 2, stems from comparing the address with each bound of the SFC-segment, while for the UB-Tree there is only one comparison with the separator [Ram02].

As R-Tree index entries are not sorted, all entries of an index page have to be inspected during search, thus its cost is always O(n).

Considering the I/O performance without a split or merge, the BUB-Tree requires h page reads and one data write, and when adjusting the bounds of index entries, in the worst case h − 1 index page writes. Thus the worst case I/O performance is O(2h).

Without overlapping MBBs, the R-Tree also requires h+1 I/Os. However, with overlap in the worst case the whole index has to be read. Only looking at each intersection MBB ensures there are no duplicates during insertion and for deletion all intersecting nodes have to be visited anyway.

Page Split I/O: The actual page split has been discussed before, so we are focusing on the I/O here. Without reinsertion, the caused I/O is the same for both indexes.

After splitting a page the two resulting pages are written and one index entry is updated and a new one inserted into the parent node. If there are no further splits and adjustments, this causes 3 page writes. If adjustments propagate up to the root, h+1 writes are necessary.

If page splits propagate to the root, it results in the worst case, i.e., 2h + 1 page writes, two on each level and one for the additional root page.

However, the reinsertion of R^∗-Trees causes additional I/O, i.e., a random insertion for each reinserted tuple, where the number of reinserted tuples is determined by the reinsertion rate e.g., n · 30%.

Page Underflow I/O: To the best of our knowledge, the only paper addressing a page merge for the R-Tree is the original paper [Gut84]. The proposed solution for handling page underflows is the removal of the page with the underflow and reinsertion of its tuples.

This causes the I/O and CPU load for bn · U c − 1 tuples, in worst case reading the whole index for each or causing a page split for each.

In contrast to this, page underflows for the UB-Tree and also the BUB-Tree are handled gracefully, i.e., they are local operations. An underflow will be handled by moving tuples from to a neighboring page to the page with the underflow. If all tuples fit on one page, the other page can be deleted. In best case three page writes are necessary, two for the

data pages and one for the index page requiring adjustments of the bounds to the data pages. In worst case 2h − 1 pages have to be written, i.e., the data pages belong to different subtrees only sharing the root page.

Summary: Taking all this into account, the {R}-Tree cannot be regarded as the mul-tidimensional extension of the B-Tree! It fails w.r. to maintenance! It is not suited for dynamic applications, as already pointed out by [Ram02]!

6.12.10 Query Performance

[ABH⁺04] introduces with the PR-Tree the first R-Tree variant with guaranteed worst case performance. For a result set size of r and k data pages it causes in worst case O(k¹⁻¹^d+_n^r) data page accesses. A d dimensional PR-Tree tree maps MBBs to 2d space and utilizes a kd-Tree for an overlap-free partitioning. Levels of the kd-tree [Ben75, Ben79] are split along successive dimensions at the indexed points. Thus the upper bound comes directly from this partitioning and the PR-Tree is the mapping of the kd-Tree to secondary storage.

However, the PR-Tree is not meant for dynamic applications, i.e., performance guarantees cannot be given anymore after random insertions.

Standard R-Trees cannot provide better worst case guarantees, but that all pages might have to be read. This holds for both, point and range queries and stems from the possibility of overlapping MBBs. In best case only one page has to be read, the root page. The average performance highly depends on the actual overlap of MBBs and the partitioning of the data.

For the BUB-Tree, a point query requires at most h page reads, i.e., one path to a data page and in the best case a single page read, i.e., the point does not exist and only the root page is accessed. For range queries in the worst case the complete data base has to be read once, even for an empty result. This happens when a query decomposes into a set of SFC-segments intersecting all regions of the tree.

Example 6.7 (Pathological UB-Tree/BUB-Tree)

A UB-Tree/BUB-Tree requires to access all pages, if all data points are located along a hyper plane and the query is also a hyper plane of the same orientation, but slightly shifted. However, a query corresponding to an orthogonal hyper plane will only intersect a single region thus causing h page accesses.

This data distribution is pathological as points do not differ w.r. to one or more dimensions. Therefore, one should reduce the dimensionality of the UB-Tree/BUB-Tree, i.e., attributes which do not differ should not be indexed. If there are more hyper planes containing points, the partitioning will enhance, as regions will extent and will not be aligned w.r. to the hyper plane of the data anymore.

Also, selection dimensions for indexing usually requires to take the query workload into account, i.e., only restricted dimensions should be indexed.

(a) UB-Tree (b) BUB-Tree (c) R^∗-Tree

Figure 6.8: Partitioning of Skewed Data for different Indexes

Figure 6.8 depicts such a data distribution for UB-Tree, BUB-Tree and R^∗-Tree. The R^∗-Tree in this figure consists of a stack of non overlapping MBB. In fact the MBBs

of the R-Tree are just line segments.

6.13 Experiments

In the following we present experiments covering the different aspects of performance. The tested indexes are the UB-Tree, BUB-Tree and R^∗-Tree as implemented in RFDBMS (Sec-tion 4.2.3 on page 68). All indexes use the same code for tuple handling, pages processing, post filtering of data page content, parsing of flat files, etc., and differ only in their actual tree structure and related algorithms. An overview of the labels used for measures is in Appendix C.

Where not explicitly stated, BUB-Tree and R^∗-Tree have a minimum page utilization of 40% conforming to [BKS⁺90] in order to allow them to perform better partitioning of the data during a split. The BUB-Tree thresholds (Section 6.6 on page 132) were set in percentage of the address, e.g., Rmin = 50% where 50% corresponds to the prefix of the address with size 50% w.r. to the address length. Specifying the threshold for the region size w.r. to the address prefix is more convenient than in relation to the universe and there is a one to one relation between them. Each bit in the address corresponds to halving the data space w.r. to its parent space. A bets split is only considered for regions that have at least one bit set in their prefix. The other BUB-Tree thresholds were: G^R_min = 50% and G^Ω_min was ignored.

For the experiments we used a BUB-Tree implementation within RFDBMS. Z-regions were stored as intervals in the index pages, i.e., index entries look like ([σ, ], pagelink).

Best-splits were only used for data pages, but not for index pages. Chances to prune much dead space become fewer in higher levels of the index, since only gaps between lower level nodes can occur. Thus allowing a best-split for index pages would not provide any significant increased dead space pruning abilities, but only increase the number of index

Label Description ub = UB-Tree bub = BUB-Tree

rs = R^∗-Tree without reinsertion

rs-30 = R^∗-Tree with reinsertion of 30% conforming to [BKS⁺90]

Table 6.4: Labels used for indexes

pages and thus increase the cache requirements. Therefore, we have decided to avoid this and keep the number of index pages small.

Indexes are labeled as listed in Table 6.4.

In document Advanced Concepts and Applications of the UB-Tree (Page 163-167)