To examine the tradeoffs between compute and storage cost with the addition of de-duplication, we need a method to estimate the instance type required to execute a given de-duplication algorithm. Since the main factor impacting the computing cost is the memory availability in different type of instances, we developed a methodology for predicting the memory requirements for executing a given algorithm on a particular dataset.
Estimation Method
With de-duplication there is not a one size fits all configuration. Depending on the application and resources available, certain algorithms might be more effective than another. One resource consideration is the total index memory size required to store the index of unique data signatures. For both fixed and variable block exact matching implementations an in memory index is used for data signature lookups for determining duplicate blocks. There are some techniques that use similarity signatures that increase the chunk size to control the memory index size, the tradeoff being the increased reliance on the speed and responsiveness of the de-duplication store for data comparisons.
Therefore, our focus when utilizing cloud resources is the exact matching techniques where the memory size for the index is a concern.
Providing a means to estimate the index size is important when sizing system requirements for a de-duplication implementation. Memory size estimates for both fixed and variable block algorithms follow the same formula, only varying in how the specific
17
variables are derived. We provide the following formula for the basic memory requirements estimations:
Memory Size Estimates =
(Data Size / Chunk Size) * (1 – De-duplication %) * (Signature Bytes)
To estimate the index memory size for a given dataset, the following variables have to be determined and or estimated: Data size - what is the total data set size that is targeted for de-duplication. Chunk size - for a fixed chunking algorithm what is the size of the chunk used in the de-duplication implementation. De-duplication percentage is based on the percentage seen during a sample run on a subset of the dataset. From our testing a sample size of 10 to 15% provides a good sample for the various data types we tested. The de-duplication percentage estimates are on par with similar measurement results in other studies for the given data types [4] [7]. Signature bytes refer to the
number of bytes used for generating a chunk’s signature hash. In most implementations a 20 byte SHA-1 hash signature for each chunk is used for the collision resistant properties that SHA-1 provides [6].
Variable block index memory estimates are more complex since the chunk sizes are not static but a distribution between the minimum and maximum chunk sizes set at execution time. Figure 8 shows the distribution based on the testing performed using the fs-c algorithm [7] on multiple datasets. The fs-c algorithm uses the TTTD approach to the variable block de-duplication. The CDC32 (content defined chunking) has an expected
18
(average) block side of 32KB, a lower threshold (Tmin) set at 8KB and the upper threshold (Tmax) at 128KB. The threshold proportions remain consistent with CDC16, CDC8, and CDC4 algorithms.
The following table outlines the different fixed and variable algorithms used in the fs-c algorithm [7] tests.
Chunker Type
Table 1: fs-c Algorithm Chunk Selection
0.00%
19
Figure 8: FS-C Chunk Distributions
Based on figure 8 distributions, the percentage of data chunks between the minimum and average block size is 50% to 55% of the total unique chunks, which in terms of the total data size is 20-25%. We can derive the total number of chunks based on these calculations. In the worst case scenario (most number of chunks), we would
assume that 25% of the data is chunked at the minimum block size, and the remaining 75% of the data would chunk just above the average chunk size. The best case scenario (least number of chunks), 25% of the data would chunk at the average chunk size and the remaining 75% at the max chunk size.
Worst Case Total Chunks =
((.25 * DataSize) / Min Block Size) + ((.75 * Datasize ) / Average Block Size) Best Case Total Chunks =
((.25 * DataSize) / Average Block Size) + ((.75 * Datasize) / Max Block Size)
As an example - in a dataset size of 100GB (107374182400 bytes) and chunking at a variable block size of 16KB (4KB lower threshold, 64KB upper threshold), with an
0.00%
20
estimated de-duplication percentage of 25%, and signature size of 20 bytes, the memory requirements range is:
Worst Case Total Chunks =
((.25 * 107374182400) / 4096 ) + ((.75 * 107374182400) / 16384) = 11468800 Chunks
Best Case Total Chunks =
((.25 * 107374182400) / 16384 ) + ((.75 * 107374182400) / 65536) = 2867200 Chunks
From the worst and best case chunk estimates, we can now utilize our memory estimation formula presented earlier to estimate the minimum and maximum memory requirements for the index when running the CDC16 algorithm against the 100GB dataset.
Minimum Memory Requirements =
2867200 * (1 - .25) * (20) = 43008000 bytes ~ 42MB Maximum Memory Requirements =
11468800 * (1 - .25) * (20) = 172032000 bytes ~ 165MB
Therefore, the memory requirements for our 100GB dataset are in the range of 42MB to 165MB.
21 Validation of Method
We performed experiments with small (150GB or less) and large (500GB or more) datasets with both the fixed and variable algorithms to test how well the memory estimation formula applies to real world scenarios.
Fixed Block Index Memory Requirements
Table 2: Fixed Index Memory Estimates vs. Actual
Using the fs-c [7] fixed chunking algorithms, we tested an office type dataset extracted from a corporate office file share environment. To obtain our memory estimates we assumed the de-duplication percentage to be at or around the 5% mark for the small and large dataset. This percentage was obtained from a sample run of the fixed algorithm on a dataset a fraction of the size. Additionally, the SHA-1[6] data signature size of 20 bytes was selected at execution time. Based on our assumptions of the de-duplication percentage and parameters selected at run time (signature size, average chunk size) the memory estimates calculate from the formula presented previously were within 8% of the actual memory requirements. The estimate error is dependent on the percentage of de-duplication assumed versus the actual, and is only improved by using a larger sample size in the de-duplication percentage estimate [30].
22
The variable block chunking experiments again used the same dataset as the fixed and assumed the chunk distribution discussed previously to obtain the estimate range for the index memory. The de-duplication percentage estimate used for the CDC16 and CDC8 algorithm was 15% and 20% respectively. These estimates were obtained from local sample runs on the dataset. The SHA-1 data signature size was again set to 20 bytes at execution [6]. The minimum and maximum block thresholds set by the fs-c [7]
algorithms were 4KB and 64KB respectively for the CDC16 algorithm and 2KB and 32KB respectively for the CDC8 algorithm.
Variable Block Index Memory Requirements
Table 3: Variable Index Memory Estimates vs. Actual
Based on the assumptions, the chunk distributions, and the parameters set at execution time, the actual memory requirements for the variable block executions on the small and large datasets were within the estimated range for the index memory - trending toward the higher end of the range for both the small and large dataset. For the variable algorithm, the index memory estimates are based not only on the de-duplication
percentage estimate, but also on the best and worst case chunk distribution estimates.
23
Resource considerations regarding cloud instance type selection around the required index memory have been examined in relation to the chunking algorithm selected for duplicated detection. A methodology for estimating memory requirements was presented and tested against real world datasets. From our real world test performed on the corporate file share datasets the index memory estimates presented for both fixed and variable block algorithms provide good estimates for sizing the compute instance required to perform de-duplication using the sub file level granularity. We can now proceed with our experimental evaluation of the tradeoffs cost between the compute and storage when introducing de-duplication algorithms in a cloud environment.
24