Parallelization Techniques for Deduplication and Garbage Col-

Historically, the size of a VD has increased from a few kilo bytes to several giga bytes over the last few years and continuing with this trend, the amount of data required to backup a VD is expected to increase, as well. However, the

duration of the backup process cannot increase at the same rate for obvious reasons. Hence the only way a backup operation can process large amounts of data without increasing the duration of the backup process is to increase the backup throughput using efficient parallelization techniques.

The sequential version of Sungem processes each input segment using the following steps:

1. Look up every fingerprint in the SFI,

2. Fetch into memory all containers referenced by all the hit responses from the SFI,

3. Perform fingerprint-by-fingerprint comparison between the input segment and all relevant stored segments in the fetched containers,

4. Identify new stored segments,

5. Put the new stored segments into proper containers, 6. Sample the new stored segment, and

7. Insert the sampled fingerprints into the SFI.

A simple strategy for K-way parallel data deduplication is to partition the fingerprint space into K portions, for example using a modulo K function, and assign each fingerprint in an input segment to one of the K nodes using the same partitioning function, where each node runs the sequential version of Sungem. This strategy is fully parallelized and relatively simple to implement, but it has a major flaw: the fingerprints associated with each data sharing unit are likely to form K stored segments. This means the total number of stored segments is increased by a factor of K, and more seriously the number of container-related disk I/Os required by processing of an input segment is also increased by a factor of K. Because the most likely bottleneck of a data deduplication engine is disk accesses associated with container fetches, a parallelization strategy that significantly increases the number of required disk I/Os is unacceptable.

3.4.1 Distributed Deduplication Algorithm Design

To overcome the limitation of the simple parallelization strategy discussed above, Sungem uses the following parallelization strategy, which requires a master node and K slave nodes. Given an input segment, the master node broadcasts it to all K slave nodes, each of which looks up every fingerprint in

its share (e.g. using some variants of modulo K partitioning function on the physical block address-space) in its local SFI, and returns to the master node those fingerprints that hit in its SFI and the associated hit responses. Due to the modulo K partitioning scheme, a slave node containing a SFI entry for a fingerprint need not hold the container storing that fingerprint. Hence the SFI entry is modified to hold additional information regarding the target slave node address where the container storing the given fingerprint is stored. Lets say the number of slave nodes that are addressed in the positive hit responses are K2, where K2 is independent of K. The master node again broadcasts the accumulated hit responses to the K2 slave nodes, each of which then fetches containers that are referenced in the hit responses and stored in its local disks, performs fingerprint-by-fingerprint comparison, and returns to the master node the hit/miss status of the input fingerprints it processes. The master node completes stored segment processing, and broadcasts to all the K slave nodes informing about the new stored segments and their associated containers. Each of the K slave nodes update their corresponding SFI accordingly.

With reference to the sequence of steps in the sequential version of Sungem described above, in Sungem’s parallelization strategy, all K slave nodes are involved in step 1 and 7, only K2 slave nodes are involved in steps 1, 2, 3, 5 and 7, and only the master node is involved in steps 4 and 6. Though the fingerprints are processed largely in a partitioned fashion, the processing at Steps 1, 2, 3, 5 and 7 are data-driven, i.e., whichever nodes hold the needed stored segment perform the associated computation. This strategy forms a single stored segment for each data sharing unit and stores it in only one of the K2 slave nodes. In order to scale-up the system to match higher demands in either deduplication throughput or the size of the storage system, this parallelization strategy could be further developed using minor changes to incorporate mul- tiple master nodes. Additionally, this parallelization strategy doesn’t increase the disk I/O activity compared to the sequential version, because it uses the same number of container-related disk I/Os for each input segment as the sequential version. However, the parallel version incurs additional inter-node communications cost, and may result in potential load imbalance.

3.4.2 Distributed GC Design

Similar to the K-way parallelization scheme mentioned above, the distributed GC partitions the P-array into equal-sized K portions using the same modulo function that the SFI uses to partition the physical address-space. In the last step of the K-way parallelization scheme, after each of the K slave nodes update their respective SFI, each of those slave nodes also update their local P-array. This parallelized GC operation is similar to the standalone GC al-

gorithm described previously except for the management of stale containers described in Section 3.3.2. When a fingerprint is marked to be recycled, since both the GC and SFI are partitioned on a common key (physical block number), the fingerprint is removed locally in that slave node’s SFI However, the lazy approach scheme in the distributed GC can neither delete the fingerprint directly from the in-memory container structure nor queue the fingerprints in the stale-container list to lazily delete that fingerprint from its container, because the slave node holding the metadata of the given block need not es- sentially hold the container storing the fingerprint of that block. Therefore each slave node aggregates a list of fingerprints to be recycled and forwards such a list to the master node, and waits for acknowledgement from the master node. The master node broadcasts the list to K slave nodes and returns the acknowledgement that it receives from those slave node to the slave node that generated the list. When a slave node receives a request from the master node to delete a list of fingerprints, it can either delete the fingerprint from the in-memory container or queue the fingerprint in the stale-container list to lazily delete it at a later time when the container is brought into memory. However, this could cause a potential race condition which is better explained in the following example scenario. A stored fingerprint F1 is referenced as a duplicate by some incoming fingerprint F2, and immediate to that action the block corresponding to F1 is marked to be recycled on some other slave node. Since the slave nodes aren’t synced, it is quite possible for the GC to recycle the block corresponding to F1 and then get a request to increment the reference count on that block because it was referenced by F2. This could lead to data corruption if the block corresponding to F2 is already recycled presuming F1 to serve as a duplicate to F2. To be theoretically correct in fixing such a race condition, the slave nodes could be synced for every GC operation, but its impractical, because the throughput of the entire deduplication process would then drop down drastically.

A simple alternate solution that the distributed GC adopts, is to maintain a timestamp in each stored fingerprint in a container, that indicates the time at which that stored fingerprint was last referenced as a duplicate. When a slave node receives a request from master node to delete the fingerprint, it would do so only if the corresponding container is present in memory, and the timestamp in the stored fingerprint is larger than a threshold T. Only if both these conditions satisfy, the fingerprint is deleted and recorded as ”deleted” in the acknowledgement, otherwise the fingerprint is marked as ”not deleted” in acknowledgement. The threshold T is chosen large enough to ensure that the fingerprint is successfully recycled in the GC, since it was last referenced as a duplicate to any other fingerprint. When a slave node receives the ac-

knowledgement from the master node indicating the status of whether the fingerprints in the recycle list are deleted or not, the slave node proceeds to recycle only those fingerprints that are successfully acknowledged as ”deleted” and ignores the ”not deleted” fingerprints. The ignored fingerprints if are not referenced as duplicates by any fingerprint, but are still marked as ”not deleted” in the acknowledgement, will eventually be garbage collected when- ever the container holding that fingerprint is fetched into memory.

In document Efficient Implementation Techniques for Block-Level Cloud Storage Systems (Page 80-84)