To simplify the comparison study between various garbage collection (GC) algorithms, let us use a reference backup system with the following config- urations and assumptions. The reference backup system has 1PB worth of physical blocks with 8KB block size and supports four 32TB VDs, each of which is fully utilized. Assume one backup is taken for each VD every day and each backup snapshot is kept for 32 days and then discarded. Also assume that every block gets accessed 64 times before it’s evicted. Though the last assump- tion is somewhat arbitrary, we made them up only to quantitatively compare the relative performance of the GC algorithms we consider below. In order to represent an incremental backup snapshot, the backup system maintains a log- ical to physical translation (L2P) map, where the logical addresses correspond to the logical disk blocks in the VD and the physical addresses correspond to the target location of those disk blocks on the respective storage devices. In addition, the garbage collector maintains a physical block array (P-array) that maintains metadata for each physical block in the entire storage system. Therefore, at any point in time, there are totally 128 backup snapshots in this system and the percentage of change between consecutive incremental backup snapshots of a VD is assumed to be 5% of the VD’s size.
GC Schemes Lookup cost Mark and Sweep 512 Billion
Ref Count 16 Billion Expiry Time 8 Billion Hybrid RC/ET 0.4 Billion
Table 2.1: Comparison of the lookup cost overheads for four GC algorithms using a reference data backup system whose detailed configuration is described in the text.
2.3.1
Mark and Sweep
A naive mark and sweep approach [68] scans the L2P maps of all active backup snapshots for all VDs, and marks only those physical blocks that are actively referenced. Upon completion of the mark phase, the sweep phase begins wherein all the non-marked blocks are garbage collected. For the reference backup system, one needs to lookup 128 ∗32T B8KB = 512 Billion L2P map entries
in a largely sequential fashion. The P-array needs to accommodate 8KB1P B = 16T entries, where each entry is represented by a 1-bit flag to mark the presence of a physical block. Therefore, the total storage space required by P-array is
16T
8 = 2T Bytes. Clearly the main memory cannot completely accommodate either of these structures and hence a large majority of the entries have to be stored on the disk. Such a naive mark and sweep GC algorithm has the following major drawbacks:
1. Even though the accesses to both the L2P map and P-array are largely sequential, since both these structures are stored mostly on disk, the large number of disk access requests result in very low overall throughput of the GC process.
2. For the overall duration of the mark and sweep phases, the entire VD has to be frozen, or else a block referenced during an ongoing mark phase could potentially miss being captured and the sweep phase could garbage collect such an active block, leading to data corruption.
While 1) results in large delays, 2) leads to long pause times, either of which hurts the overall GC performance. Hence the naive mark and sweep approach is impractical for a large scale storage system.
HYDRAstor [32] employs a variation of the mark and sweep GC technique, where instead of freezing the entire system from doing any I/O activity on any of the VDs, all the VDs are marked read-only. However, the mark phase can still be prohibitively long if the VDs are dominated by write I/Os. Fanglu et al. [69] propose group mark and sweep (GMS) mechanism, whose key idea is to avoid touching every file in the mark phase and every container in the sweep phase to make GC scalable and fast. However, the GMS technique operates at the file-system level to track modified files and hence groups a set of modified files to perform mark and sweep on selected areas in the storage system.
2.3.2
Reference Count based
The simplest example of the local metadata bookkeeping approach is reference counting [70–72], which maintains a reference count for each physical block to record the number of backup snapshots that point to it. When a backup snapshot of a VD is taken, the reference count of every physical block the snapshot references is incremented. When a backup snapshot of a VD is retired, the reference count of every physical block the snapshot references is decremented. When a physical block’s reference count reaches 0, it is collected and put in the free pool. Assuming each P-array entry keeps a 2-byte reference count, the number of lookups in the P-array is 32T B8KB ∗ 1
where a factor of 2 is multiplied because reference count of every block is updated both at creation and deletion times of a snapshot, and the factor 641 refers to the assumed degree of reuse for every fetched block. We account for all 128 snapshots because we are comparing with mark and sweep approach which can be scheduled to run after aggregating multiple snapshot creation and deletion events. Although the number of lookups are much lesser than the mark and sweep approach, updating 16 billion entries with random locality disk IO accesses will obviously cause the system to bottleneck.
2.3.3
Expiry Time based
The retention period of a VD is configured at the time when the backup snap- shot is created, and since its known beforehand, it is possible to determine the last moment at which a backup snapshot continues to reference a physical block. Suppose a backup snapshot is created at time T and its retention period is R, then this snapshot will not reference any of the physical blocks it refer- ences after T + R. Assume we maintain an expiration time for every physical block, which indicates the time after which the block can be freed. When a backup snapshot of a VD is taken, the expiration time of every physical block the backup snapshot references is set to the larger of the current expiration time and the current time plus the snapshot’s retention period. With this ar- rangement, no additional actions need to be taken when a backup snapshot of a VD is retired. To reclaim garbage blocks, one scans the P-array, each entry of which in this case maintains a 2-byte expiration time, and those physical blocks whose expiration time is less than the current time are garbage blocks.
Unlike reference-count based garbage collectors, expiration-time based garbage collectors [73], cannot immediately reclaim a physical block that is no longer referenced by any logical block, but instead have to wait to garbage collect, until the expiration time of a block expires. As a result, a key advantage of the expiration time-based scheme over the reference count-based scheme is that no actions need to be taken at the time when a backup snapshot is retired. An asynchronous scanning process can be scheduled at any time after a snapshot expires to reclaim all the expired blocks. Therefore, for the reference backup system, the total number of lookups in the P-array, required to create and retire backup snapshots at the end of each day is 32T B8KB ∗ 1
64∗ 128 = 8 Billion. The factors in this equation are very similar to those in the reference counting approach except that we no longer need to account for any action when the snapshot is retired. Hence the number of lookups in the expiry time based approach is half of that in the reference count approach. However, a limita- tion of this scheme is that the retention period of a backup snapshot cannot be modified after the snapshot is taken.
2.3.4
Summary of GC comparisons
Table2.1shows a detailed comparison among the four GC algorithms discussed in this section. In the first approach, batched GC algorithms such as mark and sweep run periodically, require system pause, touch a fixed amount of metadata in each activation that is independent of the interval time between successive mark and sweep invocations, and incur largely sequential disk accesses for a huge number of P-array and L2P map accesses. In the second approach, incremental GC algorithms such as reference count and expiration time, run incrementally, do not require system pause, touch an amount of metadata within a time interval that grows with the interval’s length, and incur largely random disk accesses.
Consequently, we propose a hybrid approach, where Sungem takes the second approach, which incurs run-time performance overhead due to meta- data bookkeeping. To minimize this metadata bookkeeping overhead, Sungem adopts the BOSC (Batched mOdifications with Sequential Commit) mecha- nism [74] (explained in detail in Chapter 5) to modify the on-disk P-array. The main advantage of the proposed hybrid GC algorithm is that the number of P-array entries that the GC needs to modify is proportional to the number of modified blocks in a input snapshot. Therefore, for the reference backup system, the total number of lookups in the P-array required to manage the snapshots in a given day are: 32T B8KB ∗ 1
64∗ 128 ∗ 0.05 = 0.4 Billion. The major factor in this equation that brings down the lookup count is the operation over 5% delta list change instead of the complete list of blocks in a snapshot. The total amount of metadata that the proposed hybrid GC algorithm needs to touch is proportional to the amount of block-level change, and hence it can be easily shown that its total metadata update overhead is no worse than any known mark and sweep variants. Therefore, the proposed hybrid GC algo- rithm is the first known GC algorithm that is both incremental, in terms of not requiring system pause, and minimal, in terms of total metadata update overhead.