Page Split - The BUB-Tree - Advanced Concepts and Applications of the UB-Tree

The BUB-Tree

6.6 Page Split

In order to minimize the covered dead space in a BUB-Tree it is important to take special care of page splits caused by a page overflow during insertion or during bulk loading.

(a) (b) (c) (d) (e)

Figure 6.4: Growth of a BUB-Tree with Segments

The goal is to minimize the dead space covered by the regions of the index. Splitting in the middle of a page might not satisfy this goal, since the position of the greatest gap can lie somewhere else. However, splitting at the best position might result in a page utilization below 50% for one of the resulting pages. So, there is a tradeoff between minimizing the index-covered dead space and maximizing the page utilization. Handling this trade-off efficiently is the topic of this section.

A gap on a data page is defined as the difference of the addresses of two consecutive points ~p_i and ~p_i+1 which are neighbors w.r. to address order, i.e., @~pk|S(~p_i) < S(~p_k) <

S(~p_i+1):

∆^D_i = S(~p_i+1) − S(~p_i) (6.1) For index pages with the entries [σ_i, _i], a gap is defined as difference of the end of one region to the start of the next region, i.e.,:

∆^I_i = σ_i+1− _i (6.2)

A page with n tuples resp. index entries including the newly inserted one is processed as follows. In case of n = 2 the split is trivial, as there is no choice, so each tuple goes on its own page. For n > 2 and honoring a minimum page utilization U_min with 0 ≤ U_min ≤ 50%

it is necessary to calculate the gaps between tuples in the range [s, e] where s and e are the position of the tuples on the page and are defined as follows:

s = 1 + b(n − 2) ∗ U_minc (6.3)

e = n − b(n − 2) ∗ U_minc (6.4)

The best split is then between the tuples resp. index entries at position p and p + 1 with p defined as:

p|∆_p = max{∆_s, . . . , ∆_e−1} (6.5) If there is more than one maximum gap, we prefer the one causing fewer fringes, similar to the path selection during insertion. Again, this can be accomplished by comparing the addresses, i.e., choosing the gap where the bounding addresses of the gap share a longer common prefix. This is basically the -split algorithm of [Mar99]. If we are not able to

make a decision w.r. to fringes, we choose the gap which is nearer to the page middle, i.e., which will result in a more even page utilization.

In the worst case, i.e., for U_min = 0, calculating the best split position requires to inspect each pair of neighboring tuples. Please note, also with U_min = 0, there will always be at least one tuple on a page, since splits occur only in the gaps between tuples and Umin

is only a threshold, but not the actual page utilization.

So far we have not addressed how to deal with the tradeoff between page utilization and covered dead space. Setting Umin = 0 can lead to a highly degenerated index w.r. to page utilization, containing only pages with a single tuple on it and one filled page.

Example 6.4 (Pathological Degeneration of a BUB-Tree)

Starting with an empty BUB-Tree we insert tuples in address order as follows. We start with α₁ = ^|Ω|₂1 as the address of the first tuple. The next tuple tuple i = 2 has the address α2 = α1+ ^|Ω|₂2, the third ^|Ω|₂3, etc. Inserting tuples in this order and with these addresses and allowing U_min = 0 will result in splitting always between the first and second tuple and causing a split with every second insertion, after the first page

has been filled to 100%.

The tradeoff is handled gracefully when additionally taking the following rules, into account to trigger a split at the best position instead of the page middle. [σ, ] bounds the region corresponding to the page that should be split, consider the following measures:

1. ^−σ_|Ω| ≥ R_min: Search for a best-split only if the region exceeds a certain size w.r. to the universe size. For small regions, the normal UB-Tree split is performed without causing further overhead by searching for a best split position.

2. When running a search for the best split position honoring U_min:

(a) _−σ^∆ⁱ ≥ G^R_min: Trigger a best-split only if the gap covers more than a given percentage of the region size. If the points in a region are uniformly distributed w.r. to their addresses, then all gaps have approximately the same size ∆_i ≈ ^−σ_n . In this case there is no need to give up the 50% page utilization guarantee and thus a 50% split is performed.

(b) _|Ω|^∆ⁱ ≥ G^Ω_min: Trigger a split only if the gap size exceeds a certain minimum size w.r. to the universe size. This is for fine tuning the splits, but it is not generally necessary.

(c) If no best split is triggered, a 50% split will be performed. This preserves the page utilization guarantees of UB-Trees where possible.

The overall cost to find the best split position is bound by the number of elements n on a page and it is inverse linear to the desired page utilization, i.e., O(1) for Umin = 50%

and O(n) for U_min = 0, i.e., its worst case complexity depends on the number of gaps to inspect, i.e., it is linear to O(e − s).

It is also possible to limit the described split algorithm to data pages, but by doing this we would loose chances to prune search paths earlier during an index traversal.

Finally, during filling a page might have accommodated more than just one big gap.

Therefore, one could allow a multi-split, i.e., not only one but b_U¹

minc splits. When obtaining the gaps we perform splits according to their priority, i.e., first the biggest gap, then on the two new pages another split if the region size and minimum page utilization allow it.

The gaps are only calculated once, but the regions size for each new region.

6.7 Deletion

Deletion is handled by a point query and on success we delete the point from the found data page. Deletion of the first or last tuple of a page triggers an update of the index entry leading to this page when minimal dead space coverage is desired. If optimal coverage after deletion is not necessary, the bounds may also be kept as they are until a page merge occurs.

Page underflows are not triggered by a page utilization below 50%, but when it is below the minimal page utilization U_min of the specific BUB-Tree.

A merge should be made with the neighboring region that is nearer w.r. to address order, i.e., for a page i we compare the distance to the previous page σ_i − _i−1 and the distance to the next page σ_i+1− _i and choose the page with the smaller distance. The complete algorithm for selecting the right page for a merge is given in Table 6.2.

There are two case after the merge. If the merges page has an utilization greater than 100%, a page split is performed and the referring index entries are updated.

In the other case, all tuples fit on the merged page and the index entry referring to it is corrected to the new bounds of the page and the other one is removed if all tuples fit on one page. The update and removal may propagate up to the father level of the index when they occur at the first resp. last entry of an index page. If the new page exceed the maximum page size, then it is split again and the index entries of both pages are updated.

In document Advanced Concepts and Applications of the UB-Tree (Page 148-151)