3.4 Performance Evaluation
3.4.7 Logging and Recovery Performance
BOSC relies on low-latency logging to provide the same durability guarantee as synchronous disk updates. The average latency of logging a 4-Kbyte block to an IDE disk array is under 0.5 msec, about an order of magnitude smaller
0 5 10 15 20 25 30 35 40 0 1000 2000 3000 4000 5000 0 100000 200000 300000 400000 500000 600000
End-to-End Recovery Time (Unit: Second)
Number of Uncommitted Records
Input Rate (Unit: Records Per Second) End-to-End Recovery Time Number of Uncommitted Records
Figure 3.11: The total recovery time for a 64-GB B+ index and the number of
uncommitted pending update requests in the replay window as the input update request rate is varied before the crash.
than conventional disk logging implementations and the fastest ever reported in the literature. In addition, through aggressive disk request batching, BOSC is able to log more than 50000 per-insertion-request log records per second, or about 20 µs per log record. Finally, even with such high logging efficiency, BOSC is able to keep the log disks’ space utilization above 70%.
There are two major steps in BOSC’s recovery procedure: (1) identifying the youngest log record and (2) reconstructing the in-memory per-block re- quest queues by analyzing the log records between the youngest log record and its associated global frontier. Because Step (1) uses a binary search through the logging disk array, it typically takes between 0.8 to 0.9 seconds to complete. Table 3.2 shows the time required by each of these two steps when recov- ering four database index implementations. In addition, Table 3.2 shows the number of log records in the replay window between the youngest log record and its global frontier, and the number of log records that are actually put into the per-block request queues. The difference between the two is the number of log records in the replay window that have already been committed before the crash.
The time required by Step (2) depends on the number of uncommitted pending updates, which in turn depends on the input request rate. To evaluate how the total recovery time scales with the input rate, we ran a random update workload with varying input request rates to update records in a 64-GB B+
tree with the following configuration: 256-MB buffer memory, and 16-byte index record. In each run, we issued about 64 million update requests, shut down the B+ tree machine, restarted it and measured its recovery time.
Locating the Reconstructing Number Number
Index Youngest Per-Block of Log of Log
Structure Log Record RQs Records in Records
(second) (second) Replay Put Into
Window RQs
B+ Tree 0.87 32 498370 448504
R Tree 0.9 27 376753 338909
K-D-B Tree 0.91 29 609834 548758
Hash Table 0.86 23 1897640 1364983
Table 3.2: The break-down of the recovery processing time for four database index implementations, and the number of log records that are in the replay window and that are actually put into per-block Request Queues (RQ).
Figure 3.11 shows that the total recovery time of a BOSC-based B+ tree implementation indeed increases with the input request rate, because higher input request rate populates the per-block request queues faster and accu- mulates more uncommitted pending updates in the request queues when the system is shut down. These pending updates need to be scanned and recon- structed in Step (2) of the recovery process. As expected, increase in the total recovery time is roughly linearly proportional to increase in the number of uncommitted pending updates, as shown in the right Y axis of Figure 3.11.
Chapter 4
Continuous Data Protection
(CDP)
4.1
System Architecture
As shown in Figure 4.1, a Mariner storage system consists of six types of stor- age nodes. A client node, which could be a file or database server, accesses data in a virtual storage device through the iSCSI protocol. The current data of a virtual storage device is stored on a master storage node, and replicated on a local mirror storage node. The virtual storage device’s historical versions are maintained on a logging node (called Trail node from this point on), which also serves as a control gateway for remote replication. Data writes are first committed to remote logging nodes and then propagated to remote storage nodes. Manager node is used for system configuration, administration, moni- toring and failure recovery. A typical Mariner system contains multiple client nodes, storage nodes, Trail nodes, remote logging nodes and remote storage nodes, but only one manager node. A Trail node can be shared by multiple master and mirror nodes.
With CDP, Mariner allows users to roll back a virtual storage device to any point within the protection window. Users can only read and write the current or read any historical snapshot of a virtual storage device. To maintain the file system consistency for a particular point-in-time storage snapshot, Mariner may need to perform a fsck-like recovery procedure on the snapshot to return a storage view with consistent file system metadata. This recovery procedure needs to modify a historical storage snapshot, but the associated disk writes are held in a temporary buffer and are thrown away when the snapshot is no
Manager Remote Remote Storage Storage (mirror) Ethernet Switch Storage (master) Remote Logging Remote Logging Trail TRM NFS WAN Link WAN Link Switch Ethernet Ethernet Switch Trail Storage Storage Client Client Client Storage
Figure 1: Marinerconsists of six types of nodes: client node where end users operate, manager node for
system configuration and administration, storage node for local replicas of current data and Trail node for historical data, and remote logging/storage nodes for remote replication.
• Transparent reliable link-layer multicasting exploits the VLAN support in modern Ethernet switches
to perform in-network packet duplication and remove the bandwidth/latency penalty associated with local mirroring and remote replication.
• A user-level versioning file system architecture provides end users the ability to navigate through file
versions on a repairable storage server using the standard OS-supported file system interface.
• An asynchronous batched remote replication scheme that can aggregate writes to the same blocks to
decrease the WAN bandwidth consumption, and at the same time guarantee strong data consistency across site failures.
In the following sections, we first describe the overall system architecture of Mariner and then each of the above features in more detail.
2
System Architecture
As shown in Figure 1, a Mariner system consists of six different types of nodes. A client node, which could be a file or database server, accesses data stored on storage nodes and Trail nodes through the iSCSI protocol. Storage nodes contain only current data and possibly one or multiple local replicas. Trail nodes maintain historical data as well as control the remote replication process. Data updates are first committed to remote logging nodes and then propagated to remote storage nodes. Manager node is used for system configuration, administration, monitoring and failure recovery. A typical Mariner system contains multiple client nodes, storage nodes, Trail nodes, remote logging nodes and remote storage nodes, but only one manager node.
With continuous snapshotting, Mariner could provide users the storage snapshot corresponding to any
point in time within the protection window1. Users can only read and write to the current snapshot, but can
only read a historical snapshot. That is, Mariner does not support version branching. However, to maintain 1A protection window is the time period in which any update is undoable. Beyond the protection window, the before images of some updates may be lost forever.
Figure 4.1: A Mariner storage system consists of six types of nodes: client nodes that issue data access requests, manager nodes for system configuration and administration, storage nodes that hold local replicas of current data, Trail or logging nodes that maintain historical data and serve as a gateway for remote replication, and remote logging/storage nodes that keep a remote copy of current data.
longer needed.
Read requests for the current data on a virtual storage device are serviced by its associated storage nodes. Write requests for the current data on a vir- tual storage device are serviced by its associated Trail node and storage nodes. More specifically, a logical disk write request is first sent to the corresponding Trail node, which logs it to disk and returns an OK reply to the requesting client. Then the client writes it to one or multiple storage nodes, depending on the degree of local mirroring supported. As far as a Mariner client is con- cerned, a disk write is completed when it receives an OK reply from the Trail node. Because of track-based logging, Mariner clients experience very low disk write latency. To reduce the performance penalty associated with sending a disk write’s payload to multiple nodes, Mariner uses TRM to duplicate the payload packet in the network.
The Trail node of a virtual storage device services all read and write re- quests for that device’s historical data, and batches multiple disk writes to replicate them to a remote site more efficiently. Because of space constraints, the details of remote replication are omitted in this technical report.