Portable and Efficient Continuous Data Protection for Network File Servers

(1)

Portable and Efficient Continuous Data Protection for

Network File Servers

Ningning Zhu

Tzi-cker Chiueh

Computer Science Department

Stony Brook University

Abstract

Continuous data protection, which logs every update to a file system, is an enabling technology to protect file sys-tems against malicious attacks and/or user mistakes, be-cause it allows each file update to be undoable. Existing implementations of continuous data protection work either at disk access interface or within the file system. Despite the implementation complexity, their performance overhead is significant when compared with file systems that do not support continuous data protection. Moreover, such kernel-level file update logging implementation is complex and cannot be easily ported to other operating systems. This paper describes the design and implementation of four

user-level continuous data protection implementations for NFS

servers, all of which work on top of the NFS protocol and thus can be easily ported to any operating systems that sup-port NFS. Measurements obtained from running standard benchmarks and real-world NFS traces on these user-level continuous data protection systems demonstrate a surpris-ing result: Performance of NFS servers protected by pure user-level continuous data protection schemes is compara-ble to that of unprotected vanilla NFS servers.

1 Introduction

Data in a file system could be lost or corrupted in the face of natural disasters, hardware/software failures, human mistakes or malicious attacks. While replication and mirror-ing represent effective defenses against hardware and site failures, they cannot protect file system data from human mistakes, software failures and malicious attacks, against which conventional data backup systems provide limited protection. Advanced multi-snapshots backup systems [12] decrease the amount of potential data loss, but still cannot completely prevent data loss. The most effective way to prevent these types of data losses is continuous data

protec-tion (CDP) or comprehensive versioning, which logs every

modification to the file system and enables each file update operation to be undoable. CDP allows a user to rollback

his file system to any point in time in the past. As per-byte disk storage cost continues to drop precipitously and the fi-nancial penalty of data loss and system downtime increases significantly over time, CDP has emerged as a critical file system feature.

The key technical challenge for CDP is how to minimize the bandwidth and latency penalty associated with the file update logging it requires. Because file update logs are mainly for repair purpose, they are not expected to be ac-cessed frequently. Therefore, it is possible to minimize the run-time performance overhead of CDP at the expense of increased access delay at repair time.

Previous versioning file systems [16, 11, 21] are based on kernel-level implementation, and thus are both complex and non-portable. Wayback [8] is a user-level versioning system and requires only a small kernel module. All of these sys-tems incur non-trivial performance overhead. Some com-mercial products [3, 4] support continuous snapshotting at the user level but are tailored to specific applications such as Microsoft Exchange and Microsoft SQL rather than a gen-eral CDP solution for the entire file system.

The goal of this research is to develop user-level CDP implementations that incur minimal performance overhead and are portable across multiple platforms. Reparable File Service (RFS) [23] is designed to transparently augment ex-isting NFS servers with user-level file update logging and automatic data repair upon detection of user mistakes or malicious attacks. RFS logs file updates in terms of NFS commands/responses, and can inter-operate with the exist-ing IT infrastructure without requirexist-ing any modifications. In addition to the portability advantage, logging NFS com-mands/responses also leads to more compact log and sim-pler design, because one NFS operation can result in mul-tiple inode/indirect-block/data-block updates. For example, an NFS request createinvolves the following local file system operations on the NFS server: (1) a new inode for the created file is generated, (2) an entry for the created file is added to the current directory, (3) the current direc-tory file may be expanded with a new block, (4) the block pointer of the current directory file is updated to point to the new block, and (5) the inode of the directory file is updated

(2)

UCDP−K NFS−Processor/ non−overwrite logger kernel module Integrated Server

file system image

NFS Client

NFS−Processor/ non−overwrite logger

Integrated Server

file system image

NFS Client UCDP−I NFS Server primary image Logging Server logger non−overwrite Bridge Device NFS Client UCDP−A mirror image Traffic Interceptor Bridge Device NFS Client UCDP−O NFS Server primary image naive logger Logging Server mirror image Traffic Interceptor

Figure 1.Comparison among the system architectures of the four user-level CDP schemes studied in this paper. Both UCDP-O and UCDP-A require aTraffic Interceptorto intercept NFS commands and responses, process them and log them asynchronously. They differ in how they log write requests to disk. UCDP-I integrates NFS packet processing and file update logging into the protected NFS server and eliminates the need for a separate logging server. UCDP-K includes an in-kernel packet interception module to reduce context switching and memory copying overhead.

with new attributes. As a consequence, acreate opera-tion may generate multiple log records at the inode/block level but only one NFS-level log record.

The file update logging scheme in RFS, called UCDP-O (user-level continuous data protection using overwriting) requires a separate logging server that contains a mirror file system of the protected NFS server, and thus could log each file update asynchronously to minimize the performance impact on the protected NFS server. UCDP-O only in-curs non-negligible performance overhead in the face of a long burst of file write operations place. For each file write, UCDP-O (1) reads the before-image of the written block to compose an undo operation, (2) applies the write operation in place to the mirror file system, and (3) flushes the undo record onto disk. The target file block is overwritten in step (2), thus the name UCDP-O.

Although in-place update preserves on-disk data prox-imity, it requires an expensive three-step procedure for each file write: reading the before-image, writing the current im-age, and writing the before-image. One way to solve this problem is to use an append strategy to log file updates, where a new version is written to a different disk location than the old version. The append approach to file update logging, while more efficient, requires significant modifi-cation to file system metadata, as is the case with existing kernel-level versioning file systems [13, 11]. UCDP-A is a user-level continuous data protection scheme that uses an append approach but does not require any OS modification. UCDP-I improves upon UCDP-A by integrating the file update logging functionality directly into the protected NFS server and thus doing away with a separate logging server. In both UCDP-O and UCDP-A, the file update logging module only needs to process write requests, but not read re-quests. In contrast, the logging module in UCDP-I needs to process both read and write requests and send their replies to NFS clients. Even though user-level file update logging is more portable, it also incurs additional performance over-head in the form of additional data copying and context switch. UCDP-K improves upon UCDP-I by incorporating kernel-level optimizations that can eliminate most of these overheads at the expense of portability. Figure 1 compares

the system architectures used by these four CDP schemes. A complete CDP consists of a run-time logging compo-nent and a repair-time restoration compocompo-nent. Due to space constraints, this paper mainly focuses only on the efficiency of the logging component as it is the dominant factor in run-time performance. More specifically, this work makes the following three research contributions:

• The first known user-level continuous data protection

system that uses an append approach to file update log-ging and is portable across multiple platforms,

• A comprehensive comparison among four user-level

continuous data protection implementations based on empirical measurements of their performance under various workloads, and

• A fully operational prototype that demonstrates the

feasibility of portable and efficient user-level contin-uous data protection systems that can provide point-in-time rollback while incurring minimal performance overhead, and thus can be readily incorporated into mainstream file servers.

The rest of this paper is organized as follows. Section 2 pro-vides a comprehensive survey on previous file versioning and continuous data protection systems. Section 3 describes the design and implementation of the four user-level contin-uous data protection schemes studied in this work. Section 4 presents the results of a comprehensive performance eval-uation study of these CDP implementations and their anal-ysis. Section 5 concludes this paper with a summary of its major research results.

2 User-Level Continuous Data Protection

2.1 UCDP-O: Overwriting Before Image

RFS [23] uses a mirror file system that is an exact replica of the protected file server. The mirror file system is ac-cessed using NFS commands over a loop-back interface. RFS’s undo log consists of a list of undo records, each of which is essentially an NFS command, and contains all the

(3)

necessary information to undo a file update operation, e.g., the before-image (or a link to it) of an updated file block. Undo records also contain a timestamp and are kept for a period of time called the logging window.

RFS classifies update requests into three categories: file block updates, directory updates, and file attribute updates. To log a file block update, RFS first reads the before-image of the target block, updates the target block, and then appends the before-image to the undo log. For rectory updates, RFS does not need to save the old di-rectory explicitly. For example, the undo operation for

createisremove, which RFS can directly put into the

undo log without reading any before-image. The same

holds for mkdir, rmdir, symlink, link where

the corresponding undo operations arermdir,mkdir, and

remove, respectively. The only exception is remove,

whose undo operation depends on the object being deleted. If the object is a hard link, the undo operation is link. If the object is a symbolic link, the undo operation is

symlink. If the object is a regular file, the undo

opera-tion is to create a new file, and write it to the full length; hence the logging system needs to read the whole file and appends the content into undo log entry. For a file attribute update, i.e.,set attribute, RFS saves the old attribute value to the undo log and updates the attribute accordingly. Because the NFS protocol already includes the old attribute value in the NFS reply, RFS does not need to issue another

getattrrequest to get the before-image. For a file

trun-cate operation, RFS needs to read the truntrun-cated data and write it into the undo log before truncating the file. File block update, file truncate and regular file delete are the most expensive NFS commands in terms of logging over-head and thus represent promising targets for performance optimization. Figure 3 shows the four data structures used in UCDP-O:

• Protected file system stores the current file system

im-age, which is managed by the underlying kernel file system.

• Mirror file system stores the mirrored file system

im-age, which is also managed by underlying kernel file system.

• Undo log is managed by a user-level file update

log-ging daemon and consists of a list of time-stamped undo records, each of which stores the old image nec-essary to perform an undo operation, including old data blocks, directory entries or and attributes.

• File handle map associates the file system objects in

the protected file system to those in the mirror file sys-tem.

2.2 UCDP-A: Leaving Before Image

In-tact

2.2.1 Overview

UCDP-A uses a non-overwrite or append-only file update logging strategy to reduce the three-step file update logging

procedure used in UCDP-O to one step. When a file block is updated, UCDP-A allocates a new file block to hold the new version, and stores a pointer to the old version in the corresponding undo record. Unlike kernel-level versioning file systems, which can directly modify file metadata (such as inode) to point to the new version, UCDP-A needs to maintain a separate user-level metadata called block map to achieve the same purpose. Old data is kept intact during the logging window and recycled only when the corresponding undo record expires.

The first version of every file block is stored in UCDP-A’s base image, which is similar to the mirror file system in UCDP-O. It has the same directory hierarchy and inode attribute values (except the file length attribute) as the tected file system, but is not an exact replica of the pro-tected file system. UCDP-A uses a separate disk block pool, called overwrite pool, to hold the second and later versions of each file block. Each file block in the overwrite pool is a virtual block that is uniquely identified by a vblkno. The pool is physically organized into multiple regular files in the local file system. UCDP-A uses a block usage map to keep track of the overwrite pool’s usage, and store the

obsolete timeof each virtual block.

Each virtual block becomes obsolete when its associ-ated logical file block is overwritten. Each virtual block in the overwrite pool can be free, contain the newest ver-sion of some file block or contain an older verver-sion of some file block. The obsolete time of a free virtual block is0_. The obsolete time of a virtual block containing the newest version is infinity. If a virtual block contains an older ver-sion, its obsolete time corresponds to the timestamp of the undo record of the file update operation that obsoletes it. Any block with obsolete time smaller than the lower bound of the logging window can be reused. For each block in the base image that contains an old version, there is an en-try in the block map of the form<timestamp, fid,

blkno, vblkno>, which indicates that the newest

ver-sion of the blockblknoof filefidis stored at the virtual block vblkno. Ifvblknois -1, it means the target file block has been truncated.

In summary, when a logical file block is created, it is created in the base image. When a logical file block is over-written for the first time, a virtual block is allocated from the

overwrite pool, and a mapping entry is added to the block map to maintain the mapping between the logical file block

and its location in the overwrite pool. When a logical file block is overwritten for the second time, thevblkno num-ber currently in its block map entry is stored in an undo log record, a new virtual block from the overwrite pool is al-located, and the block map entry is updated with the newly allocated block’s virtual block number.

Essentially, UCDP-A distinguishes between write-once file blocks and overwritten file blocks. When a file con-tains only write-once file blocks, all its blocks are stored in the base image. However, as soon as some of them are overwritten, they will be stored in the overwrite pool. As a result, this design reduces the block map’s size and

(4)

BASE IMAGE BASE IMAGE 2 2 3 2 3 5

DELETE POOL DELETE POOL

1 1 4 4 4 1 1 1 1 1 1 ₁ 5 5 1 blk 0, version 1 blk 1, version 1

OVERWRITE POOL blk 1, version 3

blk 1, version 4

block 1, version 5 3

DAY2: overwrite blk1 DAY3: overwrite blk1 DAY4: overwrite blk1

1 expire ₃ expire,

DAY6: truncate blk1 blk 1, version 2

DAY1:create file append blk 0 and 1 real len = base len = 8192

DAY5: overwrite blk1 2 expire

base len: 8192 real len: 4096

DAY7: delete file 4 expire, real len: 4096_{base len: 4096}

Figure 2.This figure illustrates the design of UCDP-A by showing the evolution of a two-block file in a period of 7 days with a logging window of 3 days. Block 0 (sharp corner box) of the file is never overwritten. Block 1 (arcbox) is written fives times. Shaded boxes represent newest versions. Since the logging window is 3 days, the first version of Block 1 is created at day 1 and expires at day 4. At day 6, the file’suserlenattribute in thefile length map is modified due to a truncate operation, but the file in thebase imageremains intact. The file in thebase imageis physically truncated at day 7 when the fourth version of Block 1 expires. Due to a delete operation, the file is moved to thedelete pooland physically deleted at day 10.

proves its access performance because of improved hash-ing and cachhash-ing efficiency. For write-once file blocks, this scheme also preserves the disk proximity among adjacent file blocks.

For a file truncate operation, UCDP-A leaves the trun-cated data alone in the base image and decreases only the file’s length attribute. Because UCDP-A does not physi-cally truncate files in the base image, it needs to maintain a file length map to distinguish the file length of a logical file and that of the corresponding physical file in the base

image. Each file length map entry is of the form <fid,

userlen, baselen>. The truncate operation is

exe-cuted as an update to theuserlenfield. Note that with non-overwrite logging, whether a writeoperation is an “append” or an “overwrite” is based onbaselenand not

userlen. Although thebaselenis available from the

base image, it is stored in this map because it’s frequently

accessed. All other file attributes in the base image except file length are correct. The undo record for a file truncate operation again contains only a pointer to the truncated data. A file delete operation is replaced by arename oper-ation followed by moving the deleted file to a special di-rectory called delete pool, which has a flat structure, and assigns each new file inserted into it a unique name that is generated based on the file’s inode number. Accordingly, the undo operation for a file delete operation is another

renameoperation that brings the file back to the original

directory. The change time attribute of a deleted file serves as theobsolete timeand determines when the file will be deleted from the delete pool.

Figure 2 illustrates the lifetime of a two-block file, start-ing from the time when it is created, appended, overwrit-ten and truncated, until it is eventually deleted. UCDP-A’s undo log is much smaller than UCDP-O because its undo log contains pointers to old versions rather than the actual data. Finally, just like UCDP-O, UCDP-A also needs a file

handle map to maintain the mapping between the file

han-dles in the base image and those in the protected file system. Figure 3 shows the data structures used in UCDP-A:

• Protected file system is the same as that in UCDP-O. • Base image contains the current file directory

hierar-chy and most file attributes except file length, but its blocks could be current or obsolete.

• Overwrite pool is an extension of the base image that

could also hold both current and obsolete blocks.

• Delete pool holds deleted files, i.e., old data and

at-tributes.

• Block map contains the location of each file block and

their timestamps, which are for storage reclamation.

• Block usage map is used for allocation of virtual blocks

in the overwrite pool.

• File length map stores the file length attribute of every

logical file.

• Undo log is similar to UCDP-O except that each undo

record contains a pointer to an old data block rather the data block itself.

• File handle map is the same as that in UCDP-O. 2.2.2 Storage Reclamation

Each deleted file, truncated file block and overwritten file block is pointed to by some undo record. The obsolete time of these blocks and files is the timestamp of the correspond-ing undo records. These timestamps allow UCDP-A to re-claim these blocks after they fall off the logging window. The obsolete time of each block in the overwrite pool is stored in its corresponding entry in the block usage map. Because allocation of new virtual blocks is also based on the block usage map, it is straightforward to integrate vir-tual block reclamation with new block allocation by exam-ining each block’s obsolete time when scanning through the block usage map. This way, the expired virtual blocks can be re-allocated.

(5)

NFS Server Protected FS

File Length Map

Block Usage Map Block Map

Delete Pool Base Image Delete Pool Base Image Protected FS

Mirror File System File Handle Map

Kernel User

UCDP−O

Logging Server

Legend Data Metadata User−Level Metadata

Pool Overwrite

Logging/NFS Server

UCDP−I/UCDP−K

Undo Log With Data

Overwrite Pool Block Usage Map

Block Map File Length Map File Handle Map

Undo Log With

Data Pointer

NFS Server Logging Server

UCDP−A

Undo Log With

Data Pointer

Figure 3.This figure compares the data structures of UCDP-O, UCDP-A and UCDP-I. There is no difference between UCDP-I and UCDP-K in their data structures. There are three kinds of data structures: data, kernel-level metadata and user-level metadata. Data refers to the regular file block data. Kernel-level metadata includes super-block, inode, indirect block and directory, and is maintained inside the kernel. User-level metadata refers to the auxiliary data structures maintained by the user-level CDP system.

The obsolete time of a base image file block is kept in the corresponding block map entry, if it exists. When a file block is overwritten, UCDP-A checks the correspond-ing block map entry to see if the associated file block in the base image has already expired, and overwrites the base im-age block if that is the case, essentially reclaiming this base image block. However, this approach cannot reclaim an old file block in the base image if it never gets overwritten after it expires. One way to reclaim blocks that never get over-written is to have a background cleaner periodically check whether any block in the base image has expired, and move an expired block that holds the current version of some log-ical block from the overwrite pool to the base image. This block migration incurs extra overhead, and therefore should be done only when the system load is light and when the file block is not going to be be overwritten any time soon.

The background cleaner also checks the last update time attribute of files in the delete pool. Expired files are physi-cally deleted and the corresponding entries in the file handle map and file length map are freed. Finally the background cleaner periodically scans through the block map to look for any entry with a virtual block number of−1, which

indi-cates that the associated block has been logically truncated. If a truncated virtual block is a file’s last logical block ac-cording to the file’s baselen attribute, the file’s base image is physically truncated and its baselen is modified accord-ingly.

The undo log is itself is recycled in a cyclic fashion. Be-cause undo log entries contain old versions of file attributes, they are also reclaimed together with the recycling of the undo log entries.

2.2.3 File System Consistency and Fault Tolerance

After a machine crashes, UCDP-A restores the following three types of consistency: (a) consistency of the local file system on the logging server and on the protected NFS server, (b) consistency of user-level metadata, and (c) con-sistency between a logging server and the NFS server it pro-tects. After restart, first the standard local file system recov-ery (fsck) is performed to guarantee (a), and then user-level

file system recovery is performed for (b) and (c).

Similar to the standard file system journaling technique, an operation journal recorded by the traffic interceptor and the undo log on the logging server facilitate the user-level file system recovery. The “fsck” algorithm of UCDP-A works as follows:

1. Traverse the base image, and check (1) whether for each file system object on the logging server, there is an object on the protected NFS server, an entry in the file handle map, and an entry in the file length map; (2) whether the size of each object on the protected NFS server is equal to theuserlenfield of its correspond-ing file length map entry, and the size of an object’s base image is equal to the baselenattribute in the corresponding file length map entry.

2. Examine each <timestamp, fid, blkno,

vblkno> entry in the block map. If vblkno is

not -1, check thatblknois within theuserlenof the file fid, and the block usage map entry for the virtual block number vblkno has an obsolete time of infinity. If the virtual block number is -1, check if the file block is a truncated file block according to the

userlenof the file.

3. Check that for each entry in the block usage map with an obsolete time of infinity, there is a corresponding entry in the block map.

The “fsck” algorithms for UCDP-O, UCDP-I and UCDP-K are simpler and can be easily derived from the UCDP-A’s “fsck” algorithm. UCDP-O only needs to main-tain consistency among the protected file system, the mirror file system and the file handle map. The ”fsck” algorithm for I and K are largely the same as UCDP-A’s except that there is no logging server and file handle map, therefore there is no need to maintain the consistency related to them.

With a separate logging server, UCDP-A not only pro-vides file update logging, but also serves as a mirroring sys-tem that can tolerate single node failure. Upon a syssys-tem failure, if the disks on both the protected NFS server and

(6)

Request Path Reply Path

* Bold lines indicate path elements on which large data payload may appear

NFSD TCP/IP Stack Kernel User NFSD User Kernel NFSD User Kernel

User−Level Logging Module (NFS Proxy)

User−Level Logging Module

(NFS Proxy) a d e h a d e b g h c f a b c d e f g h

(A) NFS (B) UCDP-I (C) UCDP-K

Kernel Module TCP/IP Stack TCP/IP Stack

Figure 4. This figure illustrates the NFS packet processing path in UCDP-I and its overhead due to context switch, memory copy and user-level processing. In UCDP-K, a kernel packet interception module reduces the overhead by providing short-cut path and eliminating the copy of large data payload.

the logging server fail, data is lost; if the disk on the pro-tected NFS server fails, data can be copied from the logging server; if the disk on the logging server fails, current data can be copied from the protected NFS server but the old data is lost; if both disks are working but they lost synchroniza-tion with each other, they need to run ”fsck” to guarantee local file system consistency.

2.3 UCDP-I: Integrating Logging with

Protected File Server

Both UCDP-O and UCDP-A log file updates on a dedi-cated logging server, and thus are more transparent to the protected file server in terms of performance impact and deployment simplicity. In contrast, UCDP-I integrates file update logging to an existing network file server without re-quiring additional hardware. There are three design changes in the transition from UCDP-A to UCDP-I: (1) UCDP-I does not need the file handle map because there is only one copy of the protected file system in the UCDP-I archi-tecture. (2) UCDP-I needs to process both read and write requests as well as their responses, because the protected network file server is logically built on top of the file up-date logging module of UCDP-I. In contrast, UCDP-O and UCDP-A only need to process write requests, and do not need to touch the replies to read or write requests. (3) The undo logging in UCDP-I has to be done synchronously, be-cause a request cannot be serviced before its before-image is saved. As a result, the logging overhead is added to the latency of normal request processing.

Logically, each incoming NFS request first goes to UCDP-I’s user-level file update logging module or NFS proxy, which modifies the request properly and sends it to the local NFS daemon in the kernel, which in turn sends a reply back. The file update logging module converts the reply into a response packet and sends it back to the request-ing NFS client. In this design, as shown in Figure 4(B), the file update logging module acts like an NFS proxy.

If an NFS request involves only one data block, UCDP-I needs to determine whether the request should be directed to the base image or to the overwrite pool. If it should go to the overwrite pool, the request’s parameters (file handle, offset, count) need to be modified first. If the request in-volves more than one block, UCDP-I needs to check each block and if necessary, splits the request into multiple re-quests. After receiving a reply, UCDP-I may need to mod-ify the file handle and attribute information if the request has been directed to the overwrite pool. If an incoming re-quest is split into multiple rere-quests, UCDP-I reassembles their replies into one reply and sends the whole reply back to the requesting NFS client. In case some of these replies are successful and some are not, UCDP-I resolves the in-consistency and returns a coherent reply.

Figure 3 shows the data structures used in UCDP-I, which are similar to those in UCDP-A except it does not need a mirror file system or file handle map.

2.4 UCDP-K: Reducing Context

Switch-ing and Memory CopySwitch-ing overhead

With user-level implementation, an NFS request and its reply are passed between the kernel and the user-level file update logging module multiple times in I. UCDP-K introduces a special kernel module to reduce this con-text switching and memory copying overhead. Figure 4 il-lustrates the difference in packet processing path between UCDP-I and UCDP-K. When the kernel module receives an NFS request/reply, UCDP-K processes it in one of the following three ways:

• Path-0: Forwarding the request/reply to the in-kernel

NFS daemon/NFS-client directly (a→d/e→h in

Fig-ure 4(C)), if the user-level NFS proxy does not need to modify the request/reply, e.g., thereaddir com-mand.

• Path-1: Forwarding the request/reply to the in-kernel

(7)

and a→d in parallel /e→f and e→h in parallel in

Fig-ure 4(C)) if the request/reply does not need to be mod-ified, but needs to be recorded, e.g., thecreate com-mand.

• Path-2: Forwarding the request/reply to the user-level

NFS proxy if the request/reply potentially needs to be modified (a→b→c→d→e→f→g→h in Figure 4 (C)),

e.g., thereadorwritecommand.

Path-0 represents the zero-overhead path, which is as fast as normal kernel-level NFS processing. Path-2 involves two context switches/memory copies because the original re-quest and reply have to be sent to the user-level NFS proxy, and after user-level processing, the modified request or re-ply has to be sent back to kernel and forwarded to the NFS daemon or requesting NFS client. The additional overhead affects not only the CPU utilization but also the end-to-end latency experienced by the NFS requests. Path-1 incurs only one additional context switching/memory copying for send-ing a packet to the user-level daemon. The overhead affects only the CPU utilization but not the end-to-end latency be-cause the user-level processing is not on the critical path of NFS packet processing.

The intelligent demultiplexing scheme directly moves to the NFS daemon those NFS requests and replies that are not at all related to continuous data protection. However, in many cases an NFS reply only requires very simple mod-ification. For example, the getattrreply has complete correct content except that file length, which needs to be changed frombaselentouserlenaccording to the file length map. It is the same for many of the replies toread

and write where the requests are directed to the base

image. Therefore we introduce another optimization into

UCDP-K called in-kernel reply modification. When a user-level CDP system sends a request to the NFS daemon (step cin Figure 4(C)), whenever possible it also gives the ker-nel module specific instructions on how to perform the sim-ple modification when the corresponding reply arrives (step e). With this optimization, many NFS replies that used to take Path-2 can now take the less expensive Path-0 or Path-1. This optimization is particularly effective forread

replythat contains large data payload.

The last optimization in UCDP-K is write payload

de-coupling, which reduces the memory copying overhead of

write requests. A write request always needs to

be processed by the user-level NFS proxy. However, be-cause usually the user-level processing does not touch the payload, the kernel module can save a write request inside the kernel and forward only the request header to the user-level module (step b). When the NFS proxy sends the mod-ified header back, the kernel module replaces the old header with the new header. In case the NFS proxy does need the payload because the request needs to be split, it will make another system call to explicitly retrieve the request’s data payload.

3 Performance Evaluation

In this section, we evaluate and compare the run-time performance overheads of the four user-level continuous data protection schemes using micro benchmarks, the Har-vard NFS traces [9], the SPECsfs 3.0 benchmark [5]. By de-fault all the machines are equipped with the same hardware configuration (1.4GHz Pentium IV CPU, 500 MB mem-ory, 100Mbps Ethernet card) and OS platform (Redhat 7.2 with Linux kernel 2.4.7-10). The base case for performance comparison is the vanilla NFS server on the same platform, which sets the lower bound on the performance overhead of all CDP implementations.

3.1 Effects of Non-Overwrite Logging

A vanilla NFS server services write requests using in-place updates, whereas UCDP-A, UCDP-I and UCDP-K use the non-overwrite strategy. Under the non-overwrite strategy, random overwrite operations are turned into se-quential writes to the overwrite pool if there are enough contiguous free virtual blocks. At the same time, sequen-tial reads may become random reads if the file blocks have been overwritten randomly and dispersed in the overwrite

pool. As a result, the non-overwrite strategy may perform

better than in-place updates for workload dominated by ran-dom writes, but perform worse for workload ran-dominated by sequential reads after random writes.

To quantify the performance impact of non-overwrite logging strategy, we constructed the following micro bench-mark for the vanilla NFS and UCDP-K. The experiments use a server with 256MB RAM and a client with 128MB RAM. The server may run vanilla NFS or UCDP-K. The client is a generic NFS client. First we created a 500MB file on the server through sequential write from the client. In this setup, there is no cache hit on either the client side or the server side. The sequential write throughput for both vanilla NFS and UCDP-K is 11MB/sec.

Then we performed a sequence of overwrite operations of the size of 4096 bytes at random file offset of the 500MB file until the size of overwrite pool reaches 2GB, which produces sufficient disk layout difference between vanilla NFS and UCDP-K. Under vanilla NFS the disk utiliza-tion is 100% and the write throughput is 1.54MB/sec. Un-der UCDP-K, the disk utilization is 22.5% and the write throughput is 10.23MB/sec. Overall, the disk access effi-ciency of UCDP-K is 30 times higher than the vanilla NFS. This result shows that the non-overwrite strategy behaves similarly to log-structured file system [20], which is de-signed specifically to convert random writes into sequential writes.

Finally, we performed a sequence of sequential read op-erations against the same 500MB file with a request size of 4096 bytes. Under vanilla NFS, the disk utilization is 18.6% and the read throughput is 4.76 MB/sec. Under UCDP-K, the disk utilization is 94.4% and the read throughput is 0.87 MB/sec. Overall, the disk access efficiency of UCDP-K is

(8)

0 12 24 36 48 60 72 84 96

Percentage of Update Request(%)

0 100 200 300 400 500 600 700 800

Throughput (ops/sec)

NFS & UCDP-A UCDP-O UCDP-I UCDP-K SPEC load 700

Figure 5.Throughput comparison as the percentage of write requests in the input workload varies.

0 12 24 36 48 60 72 84 96

Percentage of Update Request (%)

0 5 10 15 20

Average Per-Request Latency (msec)

NFS UCDP-I UCDP-K

SPEC load 500

SPEC load 200

Figure 6.Average request processing latency compari-son as the percentage of write requests in the input work-load varies.

28 times worse than the vanilla NFS server. Periodic clean-ing can mitigate the loss of sequential read locality by mov-ing the current versions of those file blocks that have be-come read-only from the overwrite pool to the base image.

3.2 Comparison among Continuous Data

Protection Schemes

SPECsfs has a default NFS request mix, where 12% of the requests update the file system and the remaining re-quests are read-only. The file update logging overhead is more pronounced when the proportion of write requests is high. To stress-test the CDP schemes, we varied the write request percentage from 12% to 96%. The distribution among different types of write requests, and the distribution among different types of read-only requests remain fixed throughout the experiments. We also varied the input load, and set the initial working set size to be proportional to the input load, for example, 7GB for the load of 700 ops/sec and 5GB for the load of 500 ops/sec.

The measured peak throughput of a vanilla unprotected NFS server under SPECsfs decreases as the write percent-age increases. At 12%, it is about 700 ops/sec and goes down to 500 ops/sec at 96%. Figure 5 shows that all four continuous data protection schemes yield almost the same throughput when the percentage of write requests in the input workload is less than 12%. UCDP-A performs the same as the vanilla NFS server because the logging server in UCDP-A is not the system bottleneck, and the protected NFS server in UCDP-A is identical to the vanilla NFS server. As the write request percentage increases, UCDP-O is limited by the logging server due to the expensive three-step file update logging procedure (Section 1). Sur-prisingly, even with all the extra processing due to

version-ing, UCDP-I and UCDP-K actually out-perform the vanilla NFS server in throughput by around 7%. In this case, the disk is the system bottleneck, and the advantage of the non-overwrite strategy in processing random write requests, as discussed in Section 3.1, is significant enough that it re-sults in a small but distinct overall performance improve-ment. Another block-level versioning system, Clotho [17], reported similar performance gain due to the use of a non-overwrite strategy. Because the kernel module in UCDP-K has no effects on disk access efficiency, UCDP-K performs roughly the same as UCDP-I in all cases.

Figure 6 shows the average per-request latency of the vanilla NFS server, UCDP-I, and UCDP-K. Each latency number represents the average of ten measurements. The upper three curves correspond to a SPEC load of 500 ops/sec, while the lower three curves correspond to a SPEC load of 200 ops/sec. The latencies of O and UCDP-A are similar to that of the vanilla NFS because the reply to each NFS request actually comes from their primary NFS server. The results for UCDP-I and UCDP-K are similar, because the kernel module has no significant impact on la-tency. When the write request percentage is no more than 36%, the average per-request latency of UCDP-I/UCDP-K is similar to that of the vanilla NFS server. As the write re-quest percentage increases further, the per-rere-quest latency of UCDP-I/UCDP-K becomes higher than that of vanilla NFS server, and the latency gap also increases. In the worst case, when the write request percentage is at 96% of the SPEC load of 200 ops/sec, the gap is about 7 msec, which represents a 200% latency overhead. However, this latency gap decreases as the load increases. For example, the ad-ditional latency overhead is reduced to 10% to 30% at load 500.

(9)

com-0 12 24 36 48 60 72 84 96

Percentage of Update Request (%)

0 5 10 15 20 25 30

CPU Utilization (%)

UCDP-I UCDP-K NFS SPEC load 500

Figure 7.CPU utilization comparison as the percent-age of write requests in the input workload varies. pared with the vanilla NFS server comes from the extra processing associated with file update logging. With the non-overwrite strategy, to serve a client request, multiple requests may need to be issued to the local NFS daemon, such as reassembling read/write requests, reading the be-fore image when awriterequest is not aligned, checking out the file type of an object to be deleted, etc. These quests do not impose much additional disk bandwidth re-quirements, but they do increase the request processing la-tency. As the write request percentage grows, more and more blocks are overwritten and reside in overwrite pool, more reassembling ofreadorwriterequests is needed, the probability of issuing multiple local requests per client request becomes higher, and eventually the per-request la-tency of UCDP-I/UCDP-K increases.

Figure 7 compares the CPU utilization of NFS, UCDP-I and UCDP-K. UCDP-O and UCDP-A are excluded be-cause both of them require a separate server. When the in-put SPECsfs load is 500 ops/sec, the throughin-puts of vanilla NFS, UCDP-I and UCDP-K are comparable, but the CPU utilization of UCDP-I and UCDP-K is about 170% and 85%, respectively, higher than that of vanilla NFS. These results suggest that file update logging processing indeed consumes additional CPU resource, and the kernel module in UCDP-K effectively reduces the CPU consumption by eliminating a large portion of context switching and mem-ory copying overhead. When CPU is the system bottleneck, UCDP-K should out-perform UCDP-I in terms of overall throughput. To substantiate this claim, we modify the SPEC workload so that it is read-only and has high buffer hit ratio, and upgrade the network connection from a 100 Mbps to a 1000 Mbps network. As a result, disk and network are no longer the system bottleneck. With an initial working set size of 300MB and a SPECsfs input load of 7000 ops/sec, the measured throughput of NFS, UCDP-I and UCDP-K are 6560, 4166, 5441 respectively. UCDP-K indeed out-performs UCDP-I by 30%.

4 Related Work

WAFL [12] is a general-purpose high performance file system with snapshot support developed by Network Appli-ance. WAFL is not optimized for fine file update logging, and allows only a limited number (32 originally) of snap-shots. Each snapshot is taken at a coarse granularity and the cost is amortized over hundreds of file updates.

Wayback [8] is a user-level comprehensive versioning file system for Linux. For each data update, Wayback uses an undo logging scheme similar to UCDP-O. For each metadata update, Wayback incurs a higher overhead than all of our logging schemes. For normal file system update, the performance of Wayback is quite poor compared with tra-ditional file systems. When compared with EXT3, the data read/write overhead ranges from -2% to 70%, and the meta-data update overhead ranges from 100% to 400%. In con-trast, the performance of UCDP-I and UCDP-K are compa-rable to generic NFS server running on top of EXT3.

CVFS [11] is a kernel-level comprehensive versioning file system that is optimized for metadata logging efficiency.

Journal-based meta-data is used for inode/indirect-block

update and multiversion B-tree [10] is used for directory update. S4 [16] is a secure network-attached object store against malicious attacks. S4 logs every update and min-imizes the space explosion. S4 uses a log-structured de-sign to avoid overwriting of old data. S4 improves the inode/indirect-block logging efficiency by encoding their changes in a logging record. Although the overhead of S4’s logging scheme is low, its cleaning cost can be as high as 50%.

Log structured file system [20] has been used to reduce the disk access penalty of random small writes. Yet it shares one common feature of non-overwrite with file update log-ging. Logging systems do not overwrite to save old data. Log structured file system writes data into new location in big batch to improve write performance. In LFS, cleaning is essential to keep the disk locality of file but the overhead is often high. This issue is not pronounced in the proposed user-level file update logging system because we use a base

image (Section 4.3.1) to maintain the disk locality and our

cleaning cycle is much longer hence the cost is amortized. Elephant [13] is a kernel-level versioning file system that creates a new version only when a file is closed. There-fore, it does not distinguish updates between a file open and file close operation. VersionFS [19] is a versioning file system implemented using stackable file system tech-nique [22]. Similar to Elephant [13], the version is based on open-close session and the versioning policy is flexible. VersionFS also provides friendly interface for user to access old versions and to customize the versioning policies. VersionFS still incurs nonnegligible performance overhead -about 100% when measured by Postmark benchmark.

Clotho [17] is a versioning system at the disk block level. Compared with file system versioning, block-level version-ing is less complex due to its simpler interface. However, it is more difficult for users to directly manage versions of

(10)

disk blocks. Usually another layer on top of block-level ver-sioning is required to provide easy version access. Clotho aggregates updates that take place within a period of time (e.g. a minute) and create new versions for them, and there-fore does not support CDP.

5 Conclusion

Continuous data protection is a critical building block for quickly repairing damage to a file system due to malicious attacks or innocent human errors. So far it has not been in-corporated into mainstream file servers because of the con-cern of additional storage requirements, performance over-head, and the implementation complexity. Given the dra-matic improvements in the per-byte cost of magnetic disk technology, disk cost is no longer an issue. Measurements on a real-world NFS trace shows that a $200 200GB disk can easily support a one-month logging window for a large NFS server whose size is 400GB and whose average load is 34 requests/sec [23]. The performance overhead and im-plementation complexity associated with continuous data protection, however, remain significant barriers to its de-ployment in practice. This paper describes a user-level con-tinuous data protection architecture that is both efficient and portable, and thus completely eliminates these barriers. We have implemented four variants of this user-level CDP architecture, and compared their latency, throughput, and CPU usage characteristics using standard benchmarks, NFS traces, and synthetic workloads. The main lessons from this implementation effort and performance study are

• User-level continuous data protection based on the

NFS protocol that is portable across multiple operat-ing system platforms is feasible and relatively simple to implement.

• UCDP-A incurs close to zero latency and

through-put penalty compared with an unprotected vanilla NFS server, and is thus the best choice for IT environments where performance is the main concern, mirroring file system image is desirable, and minimum disruption to the primary NFS server is important.

• User-level continuous data protection, when embedded

into an NFS server, can have comparable throughput as the unprotected NFS server although it does incur 3_∼5 msec of latency penalty when the write request percentage is above 36%. The write request percent-age in typical NFS sites, as specified in the SPECsfs benchmark, is less than 12%.

• If portability can be slightly compromised, simple

in-kernel optimizations could significantly decrease the CPU overhead due to context switching and mem-ory copying associated with user-level CDP. But they do not produce noticeable latency and throughput im-provement when the write request percentage in the in-put workload is lower than 12%.

• Logging updates at a higher level of abstraction, such

as NFS requests and replies, tends to produce a much

more compact log than logging at a lower level of ab-straction, such as disk accesses and responses, and is also more portable and flexible.

Acknowledgement

This research is supported by NSF awards SCI-0401777, CNS-0410694 and CNS-0435373.

References

[1] The Advanced Maryland Automatic Network Disk Archiver. (http://www.-amanda.org/).

[2] Concurrent Versions System. (http://www.cvshome.org/).

[3] Enterprise Rewinder: Product Suite for Continuous Data Protection (CDP). (http://www.xosoft.com/).

[4] RealTime - Near-Instant Recovery to Any Point in Time. (http://www.-mendocinosoft.com/).

[5] System File Server Benchmark SPEC SFS97 R1 V3.0. Standard Performance Evaluation Corporation. (http://www.specbench.org/sfs97r1/).

[6] NFS: Network file system protocol specification. Sun Microsystems, Mar 1989. [7] A. Chervenak, V. Vellanki, and Z. Kurmas. Protecting file systems: A survey of backup techniques. In Proceedings Joint NASA and IEEE Mass Storage

Conference, March 1998.

[8] Brian Cornell, Peter A. Dinda, and Fabin E. Bustamante. Wayback: A user-level versioning file system for linux. In USENIX 2004 Annual Technical

Con-ference (Freenix).

[9] D. Ellard, J. Ledlie, P. Malkani, and M. Seltzer. Passive nfs tracing of email and research workloads. In 2nd USENIX Conference on File and Storage

Tech-nologies, Mar 2003.

[10] B. Becker et al. An asymptotically optimal multiversion b-tree. Very Large Data Bases Journal, 1996.

[11] C.A.N. Soules et al. Metadata efficiency in a comprehensive versioning file system. In 2nd USENIX Conference on File and Storage Technologies, Mar 2003.

[12] D. Hitz et al. File system design for an nfs file server appliance. In USENIX

winter 1994 conference, pages 235–246, Chateau Lake Louise, Banff, Canada,

1994.

[13] D. S. Santry et al. Deciding when to forget in the elephant file system. In

Pro-ceedings of the Seventeenth ACM Symposium on Operating Systems Principles,

pages 110–123, December 12-15, 1999.

[14] G. W. Dunlap et al. Revirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of 5th Symposium on Operating Systems

Design and Implementation, Dec 2002.

[15] Hugo Patterson et al. Snapmirror: file system based asynchronous mirroring for disaster recovery. In Conference on File and Storage Technologies, pages 28–30, Monterey, CA, January 2002.

[16] J. Strunk et al. Self-securing storage: Protecting data in compromised systems. In Proceedings of the 2000 OSDI Conference, October 2000.

[17] Michail D. Flouris and Angelos Bilas. Clotho: Transparent data versioning at the block i/o level. In 21st IEEE Conference on Mass Storage Systems and

Technologies, April 2004.

[18] D. Mazires. A toolkit for user-level file systems. In Proceedings of the 2001

USENIX Technical Conference, pages 261–274, June 2001.

[19] K. Muniswamy-Reddy, C. P. Wright, A. Himmer, and E. Zadok. A Versatile and User-Oriented Versioning File System. In Proceedings of the Third USENIX

Conference on File and Storage Technologies (FAST 2004), pages 115–128,

San Francisco, CA, March/April 2004.

[20] M. Rosenblum and J. K. Ousterhout. The design and implementation of a log-structured file system. In ACM Transactions on Computer Systems, 1991. [21] Michael Rowan. Continuous data protection: A technical overview. In

http://www.revivio.com/index.asp?p=tech white papers, 2004.

[22] E. Zadok and J. Nieh. FiST: A Language for Stackable File Systems. In

Proceedings of the Annual USENIX Technical Conference, pages 55–70, June

2000.

[23] N. Zhu and T. Chiueh. Design, implementation, and evaluation of repairable file service. In The International Conference on Dependable Systems and