3.2 Batching Modifications with Sequential Commit
3.2.1 Low-Latency Synchronous Logging
Logging an update request to disk synchronously and then performing what- ever operations triggered by the update asynchronously is a well-known tech- nique. BOSC takes the same approach to deliver high sustained disk update
Sequential I/O Thread Block Interface ... ... ... Disk Queue BOSC
Data Logging Data Commit
Modify Read
Low−Latency Disk Logging
Access Interface Update−Aware Disk
Logging Disk Blocks
Data Disk Blocks
Index Data Structure
1 M
1 N
1 1 N
Figure 3.1: BOSC associates with each disk block an in-memory update request queue. When BOSC receives a disk update request, it logs the request to disk, queues the request in the update request queue associated with its target disk block, and then performs the update operation only when the target disk block is brought into memory. BOSC fetches target disk blocks using sequential disk I/O.
throughput and the same durability guarantee as synchronous disk updates. In BOSC, each log record for a disk update request contains the following information:
• A copy of the data structure pointed to by ptr modification,
• A global sequence number for the current disk update request,
• A back pointer to the disk location of the log record that is temporally immediate before this record,
• A global frontier, which corresponds to the global sequence number for the youngest disk update request before which all disk update requests have been committed to disk, and
• A local frontier, which corresponds to the global sequence number for the youngest disk update request before which all disk update requests to the target (local) disk block of the current disk update request have been committed to disk.
Upon receiving a disk update request, BOSC increments the current global sequence number and assigns the result to the request, and prepares its log
record by extracting the data structure pointed to by ptr modification and copying the global frontier, the local frontier associated with the specified target disk block, and the disk location of the last log record. To illustrate how BOSC maintains the system-wide global frontier and the per-block local frontiers, let’s consider the following update request sequence: 1(10), 2(15), 3(10), 4(2000), 5(30), 6(10), 7(15), where the numbers outside the parentheses are global sequence numbers of disk update requests and those inside are their target disk block numbers. Suppose a system failure occurs immediately after Request 7 is logged to disk, and at that point only the effects of Requests 1, 2, 3, 5 and 6 are committed to disk. So at that instant, the global frontier is 4, the local frontiers for Blocks 10, 15, 30 and 2000 are 6, 2, 5, and 0, respectively. BOSC uses the global frontier to determine which log records can be recycled at run time and to reduce the number of log records that need to be examined at recovery time. The per-block local frontiers can further cut down the recovery processing load, as is explained in Section 3.2.3.
Because the end-to-end throughput of BOSC is bounded by its synchronous disk logging performance, BOSC extends a low-latency disk logging tech- nique [12] to be both low-latency and space-efficient, to support aggressive disk request batching, and to work on a commodity disk array [160]. The key idea in this low-latency logging technique is to write a disk block to where the disk head happens to be. More concretely, BOSC maintains a separate disk request queue for each disk in the log disk array. At any point in time, one of the log disks serves as the active disk. In the beginning, BOSC randomly chooses one of the log disks as the active disk. Once a log disk becomes the active disk, it remains as the active disk until the waiting time of the oldest pending request in its queue exceeds a threshold, Twait. Whenever a new disk
write request arrives, BOSC inserts the request to the active disk’s queue as long as the waiting time of its oldest pending request is smaller than Twait
and there is enough free space in the current track to accommodate the new request; otherwise BOSC dispatches the request batch currently in the active disk’s queue, chooses another log disk as the active disk and inserts the new request to its queue.
To select a new active disk for an incoming write request, BOSC computes the earliest time at which the write request could be written to each log disk, and selects the one that can write the request to disk at the earliest. When computing a write request’s write time on a log disk, BOSC takes into account the current position of the log disk’s head and the possibility of batching the new request with others already in the disk’s queue. For those log disks that are currently idle, BOSC only needs to consider the delay due to batching. A key design decision in BOSC is to dispatch a new write request to a log
disk that allows batching of as many disk write requests into one physical disk write operation as possible, rather than to one with the earliest write time for that request. However, BOSC uses Twait to limit the size of batching and
to ensure that the experienced latency of each incoming disk write request is always bounded.
Because of batching, multiple log records could be merged into a physical disk request when they are written to disk. Also, the actual disk location of each log record is only known at the last moment, i.e., right when they are written to disk, and BOSC keeps track of this information accurately to chain log records together through their back pointers.
To track the log disks’ disk head position, BOSC statically extracts the physical disk geometry information from every log disk, and constantly keeps track of each log disk’s disk head position at run time. More concretely, after a physical disk write is completed, BOSC records the LBA (Logical Block Address) of its last sector, LBA0, and its completion timestamp T0. Assuming
the disk head stays in the same track, when the next write arrives at T1, BOSC
estimates the disk head’s current position CurrentLBA using the following formula:
CurrentLBA = SP T·(T1− T0) mod RoT ime
RoT ime + LBA0 (3.1)
where SP T is the number of sectors in the current track, RoT ime is the disk’s full rotation time. The final predicted position, DestinationLBA, is CurrentLBA + Lookahead, where Lookahead is an empirical value chosen to account for such delays as the controller delay and avoid a full rotation delay due to tracking errors. Detailed design and analysis of low-latency synchronous logging can be found in chapter 4.