BOSC-Based B + Tree - Efficient Metadata Update Techniques for Storage Systems

We have successfully ported three tree-based database index implementations (B+ _{tree, R tree and K-D-B tree) from TPIE [161, 162] and an existing hash}

table-based database index implementation [163] to the BOSC storage system prototype. TPIE is a software environment written in C++ that is designed specifically to minimize the disk I/O cost in the face of very large data sets.

The common steps shared by the porting efforts of these database index implementations are

• Constructing a data structure that contains all the necessary information required to modify and to query a target disk block,

• Developing an update commit function that performs a requested modification, which could be a delete, an insert or an update operation, on a disk block that is brought into memory, and

• Developing a query function that scans the per-block request queues before retrieving target disk blocks when servicing a query request. Of course, the actual data structure layout and internal logic for commit/scan functions are different for different index implementations. However, in general they can be easily adapted from their original implementations without significant changes.

We will focus on the B+ implementation, and other index structures are similar to adapt to use BOSC.

To service a modification (write) command, a database index implementation first determines the disk block holding the target index page, then constructs an update request record and finally calls BOSC’s update API with the target disk block’s address, the associated update request record and its commit function as input arguments. To service a query (read) command, a database index implementation first determines the disk block holding the target index page, and then it constructs a query request record and calls BOSC’s query API with the target disk block’s address, the query request record and its query function as input arguments.

The BOSC-based B+_{tree assumes all internal tree nodes and a small subset}

of leaf nodes are memory-resident. To service a modification query that inserts, deletes, or updates an index record, the BOSC-based B+tree first traverses the internal nodes to identify the leaf node containing the target index record, then constructs a disk update request record, and finally calls BOSC’s disk update API using the target leaf node’s disk block address, the associated update request record and the corresponding commit function as input arguments. Upon receiving such a disk update request, BOSC logs the request to the log disks first, commits the update to the target leaf node immediately if it is currently cached in memory, and queues the update request record in the corresponding in-memory request queue associated with the target leaf node otherwise.

To ensure atomicity, the BOSC-based B+ _{tree acquires a lock on a leaf}

node before modifying it, whereas releases the lock after BOSC logs the associated disk update request and queues it in the associated request queue. It is safe to release the lock associated with the target leaf node of a modification query before physically committing the requested modification to disk,

because BOSC guarantees the effects of a modification query’s associated disk update request be visible to all subsequent queries that access the same leaf node, even in the presence of power failures.

An implicit assumption underlying the design of BOSC is that each disk update request modifies only its target disk block. However, this assumption does not always hold for the BOSC-based B+_{tree, because a modification to a}

tree node, e.g., an insertion of a new index record, may trigger a restructuring of the tree and thus modifications to other tree nodes. If a disk update request that triggers additional disk updates is not processed immediately at the time when it is queued but deferred until the time when it is committed to disk, a disk block’s in-memory request queue may grow unbounded, because the triggered restructuring may be recursive. This makes the update commit pro- cessing time of a disk block less predictable, and increases the response time of read query requests because servicing read query requests requires scanning of per-block update request queues.

To mitigate the performance overhead due to disk update requests that trigger additional disk updates, the BOSC-based B+ _{tree maintains a count}

for the number of index records in each leaf node, and proactively triggers the split of a leaf or internal node when the number of records in a tree node exceeds a threshold. If the leaf node to be split does not have any index records on disk, all the node’s index records are in the associated update request queue and the BOSC-based B+ tree performs the split without incurring any disk accesses. If the leaf node to be split has some index records on disk, the BOSC- based B+ _{tree defers the split operation until the time when these records are}

brought into memory by the background BOSC thread.

Take the case of B+ _{tree for example. Its update commit function includes}

a component that examines the target disk block’s pending update request queue to determine whether an update request will trigger a structural change or not, and if so, enacts the change by generating additional update requests if necessary, for instance, allocating a new block, modifying another block to point to the new block, copying some part of the current block to the new block, etc. Note that B+ _{tree’s commit function does not need to perform}

any disk I/O while enacting a structural change associated with a disk update request, not even fetching the disk update request’s target disk block This is possible only if additional application-specific metadata about a disk block is stored with its update request queue, for example, the remaining free capacity of a disk block in the case of B+ tree.

To support structural changes to a database index triggered by a modification command, the commit function of each update request comprises two components: the first component modifies the target disk block and is invoked

when the target disk block is brought into memory, and the second component performs synchronous structural modification triggered by an update request and is invoked at the time when the request is queued. The second component ensures that all additional update requests generated by an update request X are reflected to their associated queues immediately after X is queued. For example, for an insert operation to a B+ _{tree, its first component updates the}

index page into which the new record is inserted, and its second component is responsible for splitting an index page into two when the number of its pending update requests exceeds the capacity of the index page. For an insert operation to a hash table, the first component updates the page containing the target bucket, and the second component re-queues the request if the page that is supposed to hold the target bucket is already full.

3.4 Performance Evaluation

In document Efficient Metadata Update Techniques for Storage Systems (Page 60-63)