• No results found

Building and Reorganizing BB-Trees

42

0

0

3

8

7

23 61

0 1

0 1

3 3

9 7

2 1

Figure 5.3: When inserting a new data object (3 8 7) with TID 42 into the BB-Tree from Figure 5.1, the regular BB 2 morphs into a super BB that contains k regular nodes and partitions data objects according to dimension 2.

Example

Consider again the example from Figure 5.1. When we insert the data object (3 8 7) into the BB-Tree, bucket 2 overflows and morphs into the super BB shown in Figure 5.3. Here, the super BB chooses dimension 2 as delimiter to partition the data objects into k = 3 regular BB according to the delimiter values 2 and 6.

5.3 Building and Reorganizing BB-Trees

Initially, BB-Trees consist of one regular BB and an empty IST, as no inner nodes are needed in the case of a single leaf node. After bmax objects have been inserted, this regular BB overflows

and morphs into a super BB holding k regular BB, but still leaving the IST empty. With (k − 1) ∗ bmax more inserts, this super BB also overflows and triggers a rebuild of the index, creating the first level of inner nodes.

All operations on the BB-Tree, except for the very first, operate on a structure that was the result of an index rebuild. A rebuild of the IST consists of the following four steps:

• In the first step, we determine how many regular BB are needed to manage the current amount of indexed data, while leaving enough capacity for new inserts. From this number,

5.3 Building and Reorganizing BB-Trees

we also derive the necessary number of levels of inner nodes. By default, when rebuilding, we set the number of BB to n/(10% ∗ bmax) allowing each leaf node to ingest further

90% ∗ bmax data objects until morphing into a super BB. This parameter may be changed depending on the expected workload. For insert-heavy workloads, we recommend using a low value that leads to seldom rebuilds enabling a high write performance, whereas read- heavy workloads can benefit from a high value leading to frequent rebuilds and ensuring that the IST always reflects the current data distribution.

• We randomly sample Rsamples∗ n data objects to obtain representatives of the whole data

set. By scanning the sampled data, we estimate the number of distinct values of each dimension. The dimensions are sorted by their cardinality and assigned to the h levels of the new IST in descending order. If h is larger than the dimensionality of the data space, we assign dimensions multiple times in a round-robin fashion. For instance, when indexing a two-dimensional data set, where the second dimension contains more distinct values than the first one, with a BB-Tree of height five, we assign the dimensions to the IST levels as follows: (1 0 1 0 1).

• We determine the delimiter values for the inner nodes, starting at the root node and recursively working down to the lower tree levels. To this end, using the sample data, we compute an equi-depth frequency histogram covering the values of the delimiter dimension of the current level. We choose delimiter values such that each interval covers an almost- equal number of objects. We use an efficient greedy assignment algorithm, processing the values from left to right, and always find the next delimiter value such that approximately 1/k of the objects are covered by the current interval.

Note that, by using an equi-depth histogram, we can find partitions of almost-equal size even in the case of dimensions, where some values occur with a higher frequency than others. However, if the differences in the frequencies of the values are very large, we inevitably end up with intervals of different size. Clearly, this procedure also fails for low- cardinality dimensions containing less than k distinct values, as described in Section 5.4.

• In the last step of the rebuild, according to the derived IST, all data objects are inserted into their new respective BB.

A periodic reorganization of the IST is mandatory to support updates, because BB-Trees store the entire IST in an immutable array. However, index reorganization obviously is an expensive operation. A random sample must be determined which is scanned multiple times, a new IST is constructed, and data objects must be moved to their new location. We chose pragmatic and fast methods for these steps, which come at certain drawbacks. First, splitting a subtree by one dimension into intervals of equal size is not always possible, as in the case of low-cardinality dimensions (see Section 5.4). Second, we globally assign dimensions to tree levels, which again can lead to imbalances when dimensions are strongly correlated. Third, we compute the IST structure only on a sample. If the sample is small, the tree is found quickly yet might not optimally represent the data. Contrary, if the sample is large, building the tree needs more time yet probably leads to a better tree structure.

We make two notes regarding these issues. First, they are shared by most other updateable MDIS. For instance, the structure of kd-trees strongly depends on the order of the insertions. The K-D-B-tree turns kd-trees into balanced search trees, but at the price of complicated and slow update operations. Second, though we cannot give formal guarantees, for the data sets we used in our evaluation, we never observed any notable imbalance. We are thus confident that unbalanced BB-Trees with regions largely differing in terms of covered objects, which are possible in theory, remain very rare in practice.