Fast Random Updates to On-Disk Data Structures

A common approach to improving the performance of small disk writes is to temporarily buffer the disk writes to a fast storage medium like NVRAM, and then asynchronously submit the buffered writes to data disks. Such a buffering technique provides two benefits: scheduling disk writes more flexibly and com- bining multiple writes with the same target. However, NVRAM is expensive,

and for workloads with poor locality, high update rate and large working set such as TPC-C [75], a small amount of NVRAM can only mask the delay for a finite number of disk writes, because eventually the sustained write performance is bottlenecked by the speed at which writes are propagated to disks. Write-only disk cache [76] mitigates the performance problem due to buffer flushing by injecting disk writes between consecutive disk reads. However, a single buffer page is still required to hold the result of each disk read and the read operations can still exhibit poor performance if the input workload has poor data locality. In contrast, BOSC’s low-latency logging technique can accommodate a much larger number of disk writes, its use of sequential disk I/O to commit pending updates greatly improves the sustained disk update throughput and it does not rely on NVRAM to ensure data durability.

There has been a long line of research on efficient file system metadata update techniques that ensure metadata consistency with minimal performance overhead. HyLog [77] further reduces the performance overhead associated with LFS’s cleaning [78], by treating hot and cold pages separately. The soft update technique [79, 80] avoids synchronous metadata writes by exploiting dependencies among metadata updates and makes it possible to aggregate updates as much as possible to improve the disk I/O efficiency. One problem with soft updates is that it is metadata-specific and thus needs to be tailored to each type of file system. Also, the above metadata update techniques focused mainly on the latency but not the throughput of metadata updates.

Efficient file system metadata update techniques that ensure metadata consistency with minimal performance overhead have received significant atten- tion in the last two decades. WAL (Write-Ahead Logging) [81,82] and shadow paging [77,78,83–85] group related metadata updates and commit them atom- ically to ensure metadata consistency. Performance benefits of WAL mainly come from sequential disk writes and group commit.

Much work [86–90] has been done to optimize the disk I/O performance for inserting and querying index data structures. One particularly interesting line of research in this area is the cache-oblivious data structures and algo- rithms [91–94]. Take a binary tree B of height H for example. This tree is abstracted into a 2-level abstract tree AB, whose root corresponds to the first

2 levels of B, and each of whose leaf nodes corresponds to a H

2-level subtree of B. Each node in AB is then recursively abstracted in the same way until the size of each final abstract tree node is smaller than a pre-defined threshold T . This linearization strategy for tree data structures, known as the van Emde Boas scheme, substantially reduces the number of disk accesses required in the tree look-up process if T is smaller than the cache line (page) size. The performance improvement of cache-oblivious data structures mainly comes from

the fact that they put portions of a tree that are likely to be accessed to- gether during the look-up process in the same units which are transferred in the memory hierarchy. With this set-up, when a transfer unit is fetched into the main memory, it is expected to service multiple accesses to the unit before it is evicted.

There have been several research efforts on the bulk update problem, which attempts to speed up index updates in the presence of a continuous stream of inputs to a database, which require real-time updates to its indexes. Arge et al. [95, 96] proposes a bulk update mechanism for dynamic R-trees, whereas Procopiuc et al. [97] describes a scalable bulk update algorithm for kd-trees. The basic idea behind these schemes is to hold the inserted input records in the internal nodes as long as possible and copy them sequentially to grow the tree when the internal nodes are filled up. In the buffer tree technique [91], incoming updates to a B+ _{tree are written to the smallest B}+_{tree that can fit} into the main memory. Merging is implemented as a background operation to take advantage of large sequential writes. However, read query performance again is sacrificed because multiple B+_{trees have to be queried before the final} result can be computed. Graefe [98] proposes a novel technique to improve the de-fragmentation and reorganization performance of B+ tree. The idea is to use a logical pointer called fence instead of a physical pointer to sibling B+ tree leaf nodes, to limit the performance overhead of migrating B+ _{tree leaf} nodes. However, this scheme optimizes the performance of insert operations but not update in-place operations, because the latter needs to fetch target leaf nodes before modifying them.

HDFS [3] uses append-only writing to mitigate random writes, but that comes at the expense of low locality in reads. Its optimized for batch processing systems like MapReduce [14]. HBASE [13] is built over HDFS to improve upon real-time read/write accesses.

BOSC is different from these database-index optimization schemes in three ways. First, BOSC is application-independent and requires only minor modifications to the database indexes built on top of it. Second, BOSC speeds up the disk access performance through request batching and sequential commit, without requiring any additional data structure copying. Third, BOSC can handle arbitrary index modifications, i.e., insert, delete and in-place update, but most bulk update schemes are optimized for streaming inserts.

In document Efficient Implementation Techniques for Block-Level Cloud Storage Systems (Page 57-59)