Related Work of Data De-duplication - Efficient Metadata Update Techniques for Storage Systems

Venti [151] pioneers the content-addressable storage (CAS) by computing the fingerprint (i.e., SHA1 hash value) of a data block and using the computed fingerprint instead of a logical block number to address the data block. Data blocks are de-duplicated because data blocks with the same content have the same fingerprint. However, Venti does not focus on de-duplicating data blocks, and more efforts are spent ensuring write-once-read-many property. In concrete, Venti does not address two performance problems associated with fingerprint-based de-duplication. Firstly, the access locality is lost because adjacent data blocks have very different fingerprint values. Secondly, for a large-scale storage system, the fingerprint index can not fit into main memory and the lookup of the fingerprint index during data de-duplication can incur extra disk I/Os, which can incur a significant performance overhead [27, 28].

Based-on whether the de-duplication steps into the critical I/O path, de- duplication techniques can be categorized into two camps, the inline de-duplication technique and the out-of-line de-duplication. In inline de-duplication [27, 28], each incoming write is checked for de-duplication purpose before it arrives at the disk. If previously there is write payload with the same content, the incoming write does not need to be written to disk. Otherwise, the new write payload is written to the disk. In contrast, out-of-line de-duplication techniques do not de-duplicate write payload on the fly. Instead, each data payload is first written to disk. A background procedure checks the newly written data payload for the de-duplication purpose.

Kai Li et al. [27] proposed an online de-duplication technique for a disk- to-disk (D2D) [22] data backup system. Because the fingerprint index can not fit into memory and resided on the disk, the paper proposed two techniques to minimize the overhead due to I/O access of the fingerprint index. Namely, the two techniques are (1) a bloom filter-based [152] summary vector to avoid unnecessary fingerprint lookup, and (2) a locality-preserving data placement scheme to leverage the spatial locality of the input fingerprints.

As the first optimization technique, the auxiliary bloom filter covers all fingerprint values of data blocks and accounts for a non-negligible amount of memory usage. On one side, a miss in the bloom filter indicates there is no such fingerprint value in the index structure. On the other hand, a hit in the

bloom filter is not decisive in determining if the fingerprint value of interest exists in the fingerprint index, and a search to the index structure is inevitable. The second optimization technique preserves the locality by loading/evicting the fingerprints based on a container rather than a continuous range of fingerprint values because neighboring fingerprints do not reflect the data locality. The container corresponds to a continuous range of logical blocks from the input backup stream, and contains all metadata related to the continuous range, including the fingerprints and physical locations of these blocks. A query miss of one fingerprint in the container triggers the loading of all fingerprints in the container, predicting other fingerprints in the container will be queried in subsequent fingerpint queries. In most cases, the prediction is correct due to data locality in the input backup stream.

The sparse indexing scheme [28] uses a sampling fingerprint index instead of a whole fingerprint index to further reduce the memory usage of de-duplication in an online de-duplication system for a D2D data backup system. The in- sight of the proposed scheme is that duplicated data blocks tend to be in a consecutive range with a non-trivial length. A match of sampling fingerprint values in the range indicates the matching of the whole range with a high probability. Among all matched ranges, a champion is chosen to de-duplicate against. The sampling ratio can be used to trade-off the de-duplication quality and the memory usage. In one extreme, if all fingerprint values in the range are sampled, the is most efficient. In the other extreme, if only one fingerprint value in the range is sampled, the de-duplication algorithm can err in choosing the champion range and therefore the de-duplication efficiency drops. Although the segment match based on sampled fingerprint works well for ex- amined workload, it is not clear how effective it is for other backup workloads, including changed blocks within a file or an email, which our de-duplication techniques focus on.

Many live storage systems [153–155] opt for the out-of-line de-duplication techniques because the performance overhead associated with online de-duplication is not acceptable for these systems. In these systems, the de-duplication func- tionality is cut out of the critical data write path, and the de-duplication is scheduled in the background. In particular, the data de-duplication im- proves the storage efficiency for distributed file systems [153–156] because all hosts in a cluster tend to host similar files or operating systems. For data de- duplication in distributed systems, the performance of metadata lookup and maintenance is not a serious concern because metadata can be distributed across all participating hosts in a distributed fashion.

HYDRAstor [157] is a content-addressable distributed near-line storage system with fault tolerance in the design. Backup images are chunked into

segments, each segment is routed to a peer machine based on its fingerprint. Because each peer machine has the full hash key information for all segments stored on it, the peer machine can de-duplicate the segment in an online fashion. Different from Venti, segment is stored continuously on commodity storage devices to preserve data locality so that both read and writing of segments are I/O-efficient. HYDRAstor employs the mark-and-sweep garbage collection technique to reclaim physical blocks. Because typical storage capacity of peer machines in HYDRAstor is in the scale of ˜10 TB, the counters of all physical blocks can still fit into memory and the marking process can be very fast. As the per-peer machine storage capacity grows, the mark-and-sweep garbage collection is not scalable because the counters of all physical blocks can not fit into memory anymore and the I/O becomes the performance bottleneck.

However, HYDRAstor showcases complicated practical consideration in de- signing a robust garbage collection and de-duplication scheme against failures if the failure of machines is a norm in a distributed environment. For example, during the garbage collection, the marking of deleted blocks needs to survive machine failures to prevent future writes to take the to-be-deleted blocks as a stored duplicate. Their solution for this problem is to make the system read- only first, marking all to-be-deleted blocks in one shot, and then make the system read-and-write. For our deduplication system, we can learn how they deal with each corner cases in fighting against failures.

The Foundation [158] leverages commodity USB external hard drives to archive digital files in a similar fashion to Venti. Different from Venti but similar to the Data Domain scheme, a 16 MB segment is stored continuously to preserve locality for sequential read and fingerprint caching. For each fresh write, up to 3 disk accesses are encountered: (1) the lookup of fingerprint, (2) appending the data payload to the end of the data log, and (3)fingerprint updates to the on-disk fingerprint index. Bloom filter is employed to filter out unnecessary lookup in (1). For (3), in updating the fingerprint store, similar to BOSC, a buffer is employed to accommodate fingerprint updates and a single sequential scan is used to commit updates when the buffer is full, which shows the effectiveness of BOSC in de-duplication. However, the fingerprint updates are not logged to the disk and their durability is not ensured, which can cause problem when the fingerprint update buffer is not filled for a long time.

Online de-duplication techniques in backup storage systems (i.e., D2D backup systems) and out-of-line de-duplication techniques in live storage systems are different in 4 aspects. Firstly, online de-duplication can save disk space in the first place because data de-duplication is done on the fly. In contrast, out-of-line de-duplication first stores duplicates and de-duplicate data in the background. Secondly, online de-duplication incurs performance over-

head on the critical data path, while out-of-line de-duplication chooses to offload the performance overhead in the background. Thirdly, online de- duplication in backup storage systems is conducted periodically, while out- of-line de-duplication techniques need to deal with duplicates produced by dynamic changes. Fourthly, out-of-line de-duplication techniques employs file as the basic unit of data de-duplication, while online de-duplication techniques in the backup storage systems use block streams as the basic unit of data de- duplication.

Although different in many aspects, out-of-line and online de-duplicatio fit well in their corresponding arenas. For example, because the primary concern of a live storage system is to provide low-latency access to live data and tem- porary space utilization is not the top concern, deferring the data strikes a reasonable balance between performance overhead and space utilization. Con- versely, backup systems focus more on backup throughput than the latency of individual backup requests, and the storage utilization is the first priority. Therefore, online de-duplication is preferable for backup systems.

Our proposed de-duplication technique differs from that of Data domain paper and that of sparse indexing paper in three aspects. Firstly, the input of the backup stream consists of only changed blocks since last backup. Secondly, the spatial locality is captured based on the file. Thirdly, the sampling rate of fingerprint selection is varied based on the de-duplication history to capture the temporal locality of input backup stream. The first difference requires our technique to squeeze out the duplicates by fully exploring their spatial locality and temporal locality.

In [159], de-duplication is based on files. Each file has a representative fingerprint. Entries in the fingerprint index are distributed to K nodes one by one based on modular operation, or other distributed hash table functions. The whole container corresponding to a fingerprint index entry is distributed to the same node as the fingerprint index entry. If two fingerprint index entries happen to have the same container but distributed to two different nodes, the same containers are duplicated on two nodes. When a file is backed up, only the representative fingerprint is used to route the file to a node, all other fingerprints of the file are not used for routing purpose. In contrast, in our proposed technique, all fingerprints of a segment are used to query the fingerprint index on the corresponding node. More importantly, containers are also distributed to all K nodes using the hash of fingerprints, eliminating the duplicate of containers.

In [159], each file has a whole-file fingerprint, which means that a match of the whole-file fingerprint indicates that the whole file is a duplicate. In our proposed de-duplication technique, each segment has a whole-segment

fingerprint, which is more flexible. However, to save RAM space, it is desirable to have the file-level information, including the file identifier and file offsets, to represent the segment instead of individual physical block addresses.

Chapter 3 Batching mOdification and

Sequential Commit (BOSC)

In document Efficient Metadata Update Techniques for Storage Systems (Page 48-53)