Scalable Garbage Collection - Efficient Implementation Techniques for Block-Level Cloud Storage

In a data backup system like DISCO that supports data deduplication, a physical block may be referenced by multiple backup snapshots. Because a backup snapshot typically has a finite retention period, the number of references to a physical block varies over time. When a physical block is no longer referenced by any backup snapshot, it should be reclaimed and reused. There are two general approaches to identifying physical blocks in a data backup system that are no longer needed. The first approach is global mark and sweep, which freezes all active backup snapshot representations, scans each of them, marks those physical blocks that are referenced by these snapshots, and finally sweeps all the physical blocks in the entire storage system to garbage collect those physical blocks that are not marked in the mark phase. The second approach is local metadata bookkeeping, which maintains certain metadata for each physical block, and locally updates a physical block’s metadata whenever it is referenced by a new backup snapshot or de-referenced by an expired backup snapshot. The first approach does not incur any run-time performance overhead but may require an extended pause time, which is proportional to the storage system size, and thus is not appropriate for petabyte-scale data backup systems. But, a number of commercial products have adopted modified mark and sweep approaches that minimize the pause time to a great extent, and

seems to be reasonably effective. However it is still batch-oriented rather than incremental as in the case of our algorithm. Consequently, Sungem takes the second approach, which incurs run-time performance overhead due to metadata bookkeeping. How to minimize this metadata bookkeeping overhead is an important design consideration of Sungem’s garbage collection (GC) algorithm. A detailed comparison of these approaches is described in Section 2.3. DISCO maintains a backup snapshot table for every VD that it needs to backup. The backup snapshot table consists of several columns where the first column represents the logical addresses of the blocks in the VD, and the second column represents the corresponding physical addresses of the blocks in that VD’s second snapshot, and so on. Since a large majority of the blocks remain unmodified between successive snapshots of a VD, the table is optimized to record entries only for those logical blocks that are modified in a particular snapshot with respect to the previous snapshot of the given VD. Effectively, we can assume that every incremental backup snapshot is represented by a logical-to-physical (L2P) map, which maps logical addresses in the incremental backup snapshot to their corresponding physical addresses. For a full backup snapshot, the L2P map contains mapping for all the active referenced blocks in the VD. In addition, the garbage collector maintains a physical block array (P-array) that maintains metadata for each physical block in the entire SDDS system.

3.3.1 Hybrid GC: Our Approach

The main weakness with the reference count-based [70–72] and expiration time- based [73] garbage collection scheme is that their performance overhead at backup time is proportional to the full size of the VD being backed up, rather than the size of the backup snapshot which corresponds to changes to the VD. The performance overhead is much smaller if incremental backups are taken. We propose a hybrid garbage collection algorithm, specifically targetted for incremental backup systems. It maintains both a reference count and an expiration time for each physical block, and its performance overhead at backup time is proportional to the size of an incremental backup snapshot rather than the snapshot’s underlying VD.

An incremental backup snapshot consists of a set of entries each of which corresponds to a logical block that has been modified since the last backup. Each incremental backup snapshot entry thus consists of a logical block number (LBN), a before image physical block number (BPBN) that points to the physical block to which the logical block LBN was mapped in the last backup, and a current image physical block number (CPBN) that points to the physical block to which the logical block LBN is currently mapped. At backup

! ! GC Disk 1 GC Disk 2 Metadata P-Array GC Thread 1 Update Bucket N GC Thread 2 Update Bucket K "! #! "! "! "! "! "! "! #! #! #! "! Fast Logging Disk BOSC Logging

Incoming Fingerprints from Sungem to be updated in GC Database

Figure 3.2: Figure indicating metadata updates in garbage collection at backup time

time, given an entry hLBN, BPBN, CPBNi, the reference count of BPBN is decremented, the reference count of CPBN is incremented, and the expiration time of BPBN is set to the maximum of its current value and the current time plus the retention period of the VD being backed up. With this design, when the reference count of BPBN reaches zero, the physical block BPBN together with its expected expiration time is put into a recycle list. At garbage collection time, the physical blocks in the recycle list are scanned and those whose expiration time is less than the current time are garbage blocks.

In general, a VD has a current image and multiple backup snapshots. The reference count of a physical block in this algorithm keeps track of the number of current images, but not their associated backup snapshots, that currently point to it. The expiration time of a physical block records the time after which no backup snapshot will reference it. If there is at least one current image pointing to a physical block, this physical block cannot be a garbage block and its expiration time could be ignored. Whenever a logical block in a current image is modified, the current image no longer points to the physical block associated with the logical block before the modification, and the expiration time of this before-image physical block is updated to incorporate the retention time requirement of the current image’s associated VD.

3.3.2 Batched Updates to P-Array

Hybrid GC algorithm implementation poses two major challenges. First, its accesses to the P-array, which is too large to fit into main memory, are largely

random and therefore could incur significant disk I/O overhead. Second, after a physical block is chosen to be recycled, the physical block’s fingerprint needs to be removed from the rest of the deduplication engine, including the SFI, and the container holding the fingerprint. This fingerprint-removing overhead is a significant part of the garbage collection process regardless of the actual algorithm used to determine which physical blocks are recyclable.

To address the first problem, Sungem adopts the BOSC (Batched mOdifi- cations with Sequential Commit) mechanism [74] (explained in detail in Chap- ter 5) to modify the on-disk P-array. More specifically, Sungem partitions the P-array into a set of chunks, and allocates a per-chunk queue for each such chunk. Each update to the P-array is put into the per-chunk queue associated with the P-array entry to be updated. Multiple background threads are used to sequentially scan the on-disk P-array, by fetching to memory each chunk whose per-chunk queue is non-empty, committing all updates in the chunk’s per-chunk queue to the chunk, and writing the chunk back to disk. Using BOSC, Sungem requires mostly sequential disk accesses to update the P-array. Figure3.2 gives an overview of the garbage collection setup.

To address the second problem, Sungem uses a lazy update approach to removing a recycled block’s fingerprint from the deduplication engine. If the container holding the recycled block’s fingerprint is memory-resident, Sungem deletes it immediately; otherwise Sungem marks the container as stale in a stale-container list, and queues the fingerprint to be deleted in the corresponding stale container entry in the stale-container list. When a container is brought into memory because of some HIT fingerprint processing, the stale- container list is first searched and if its found, the fingerprints in its queue are permanently deleted from that container. However, Sungem deletes the recycled block’s fingerprint from the SFI immediately, because the SFI is always memory-resident. When a chunk of the P-array is brought in, the garbage collector scans the entries in the chunk to identify those whose reference count is zero and whose expiration time has expired, and puts them in the free list. This mechanism piggy-backs garbage collection with P-array accesses and thus reduces the garbage collection overhead to the minimum.

3.4 Parallelization Techniques for Deduplica-

In document Efficient Implementation Techniques for Block-Level Cloud Storage Systems (Page 77-80)