Experimental Evaluation - Cloud De-duplication Cost Model THESIS

In our experimental evaluation of de-duplication in a cloud based environment we look at the following factors namely the dataset size, the cloud compute instance

requirements, and the length at which the data is going to be retained in the cloud to analyze the potential cost avoidance surrounding performing fixed and variable de-duplication detection on a given dataset.

We performed our analysis on the Amazon Web Services offerings, using elastic compute (EC2) for the compute platform and simple storage services (S3) for the storage infrastructure. The standard small and large instance types along with the high cpu medium instances were used in our testing. Below is a recap of resources specifications:

o Small Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of local instance storage, 32-bit platform [26]

o Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform [26]

o High-CPU Medium Instance 1.7 GB of memory, 5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each), 350 GB of local instance storage, 32-bit platform [26]

Amazon defines one EC2 Compute Unit (ECU) as providing the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor [26]. Additionally, Amazon’s Linux AMI operating system was selected for the instance builds. The cost

analysis is based on the Amazon’s pricing for the US East region where our testing was performed.

We used the fs-c algorithm developed by [7] with reporting and statistical gathering modification as our de-duplication engine. The fs-c algorithm has both fixed and variable chunking options. Chunk size options vary from 2KB to 32KB for both fixed and variable algorithms. The variable chunking approached is based on the two threshold two divisor algorithm [8] using Rabin fingerprinting [1] to determine the natural content boundaries. Additionally, the fs-c approach is an out-of-band approach to de-duplication, which allows the analysis of data in place.

Our initial evaluation centered on small datasets extracted from a corporate file share environment that were 300GB or less in size. We grouped our datasets around the following data classifications:

Office data types: Microsoft Word (doc,docx), Excel (xls, xlsx), PowerPoint (ppt, pptx), Adobe’s Portable Document Format(pdf), rich text documents(rtf).

Database file types: Microsoft SQL master database files(mdf), Microsoft Access(mdb)

Virtual Machine Data files: VMware virtual machine (vmdk) files Media Files: JPEG, GIF, PNG, MP3, MP4,WAV

As a first step we performed testing on a local system on the above datasets to gauge de-duplication percentages and instance type requirements. This allowed us to determine the dataset to focus on when moving the testing to cloud resources.

Below is a summary of the results from the first dataset of each type against both fixed and variable (CDC) algorithms using various block sizes. The algorithms used followed a format of chunk type followed by a number that indicates the average or fixed chunk size used. For example the cdc8 algorithm uses an average chunk size of 8KB with a lower threshold of 2KB and upper threshold of 32KB. The lower and upper bound thresholds remain proportionally consistent to the average chunk size for the other variable (CDC) algorithms. Refer to table 1 for algorithm specifications.

The local system specifications used in our initial testing were as follows:

Hardware Brand : HP DL580 G5

CPU : 2 Dual Core Intel(R) Xeon(R) CPU @ 3.00GHz Memory: 8GB

Hard Drives : 2 X 73GB 10K SAS drives RAID 1 for OS OS : Ubuntu 11.0.4 (x64)

Data Storage: EMC VNXe 3100

The datasets were stored on an EMC VNXe 3100 and accessed via NFS. Each dataset was run in isolate to the others eliminating any competition for resources.

27 fixed8 35923993 181.7139233 73.48% 64 min 274.08G fixed16 17962030 98.59985798 71.22% 63 min 274.08G fixed32 8981049 51.73257442 69.80% 64 min 274.08G

cdc4 21235250 187.2049818 53.78% 45 min 159.77G cdc8 11162546 98.42767746 53.77% 44 min 159.77G cdc16 6015166 53.05123822 53.76% 45 min 159.77G cdc32 3363838 29.66763861 53.76% 44 min 159.77G fixed8 20940832 184.6494033 53.77% 46 min 159.77G fixed16 10470416 92.32470163 53.77% 47 min 159.77G fixed32 5235208 46.16235081 53.77% 46 min 159.77G

Media

Table 4: Small Dataset Results

As expected overall the variable algorithms were able to find more redundancy within each dataset type. The office dataset had the largest de-duplication percent change

between fixed and variable block algorithms. Surprisingly the execution time did not vary when changing algorithm chunking granularity or between the fixed and variable block algorithms. We examined this more closely and discovered that the bottleneck was not the CPU in processing the fixed or variable block chunks but at the disk I/O when trying to process the data out-of-band. We recorded high I/O wait times during each execution which caused the CPU to wait on the I/O to finish. This explained the consistency around the execution time regardless of the algorithm. Additionally, the VMDK de-duplication percentage is the highest based on the data redundancy inherent across similar operating system builds. The DB percentage remains the same for all the test perform due to the fact the SQL database files were extracted from a system that had the allocation unit size set to 64K, therefore no additional duplicates would be discovered by reducing the chunk size less than 64K. Finally, as expected based on our research the more random data types, such as media formats, produced the lowest de-duplication percentages.

The small dataset memory requirements are within the resources available on the small and medium cloud compute instance types. Additionally, from our local testing the office dataset provides the most interesting analysis given the range of de-duplication percentage, therefore moving forward we will focus solely on office type datasets. Also to ensure the result consistency we collected another office dataset of roughly the same size for our remaining small set testing.

After completing the initial testing on our local system our remaining testing will be using Amazon’s cloud resources. With the small dataset our testing will focus on the small and medium instance types that differ in the amount of available ECUs [26]. The

dataset was transferred to the Amazon S3 storage in original form to perform the out-of-band de-duplication testing. Our motivation for the small dataset test using cloud resources is to gauge the execution time differences between the small and medium instance type to analyze any cost savings. Again all tests were run in isolate on separate instances types and only a single test was accessing the S3 storage bucket [26] at one time. fixed4 29205380 540.1141443 3.04% 140 min 111.35G fixed8 14612039 271.2054281 2.69% 138 min 111.35G fixed16 7313593 136.0362223 2.48% 145 min 111.35G fixed32 3662809 68.28364404 2.26% 150 min 111.35G Office2

cdc4 24453549 305.4548119 34.51% 203 min 113.61G cdc8 12056195 155.8396025 32.23% 206 min 113.61G cdc16 6073155 81.40970867 29.72% 205 min 113.61G cdc32 3072219 42.46005797 27.54% 212 min 113.61G fixed4 29825063 471.1933076 17.17% 205 min 113.61G fixed8 14934682 237.6557387 16.57% 210 min 113.61G fixed16 7489809 119.7711156 16.16% 206 min 113.61G fixed32 3767320 60.54580368 15.74% 214 min 113.61G

Table 5: EC2 m1.small Instance Small Dataset Results

30 fixed4 29825063 471.1933076 17.17% 69 min 113.61G fixed8 14934682 237.6557387 16.57% 67 min 113.61G fixed16 7489809 119.7711156 16.16% 70 min 113.61G fixed32 3767320 60.54580368 15.74% 70 min 113.61G

Table 6: EC2 c1.medium Instance Small Dataset Results

Based on the results of the cloud testing on the small office dataset the execution time difference is inline with the cost difference based on Amazon’s EC2 pricing at the time of this publication going from the small instance to the medium instance. Also since the memory resources are the same on the small and medium instance type a more

aggressive algorithm cannot be used as a differentiator in terms of space and cost savings.

Therefore there is little to no cost savings when comparing the executions times and the related compute cost differences of the small and medium size instances on a small dataset. One interesting aspect of this testing is the relative consistency in the percentage of additional redundancy detected between the fixed and variable block algorithms for both office datasets.

Transitioning into the larger dataset of 500GB and larger we again focus our attention on an office type dataset extracted from a corporate file share environment. The goal of the large dataset is to examine more aggressive algorithms that exhaust the memory resources available in the small and medium instance type for the global chunk index. This will allow us to explore the cost model and tradeoffs associated with

choosing a more aggressive algorithm and large instance type versus a less aggressive and smaller instance type over varying storage durations.

Using a dataset size of 764GB on the small and medium instance type the fixed16 and cdc16 were the most aggressive algorithms able to be run within the memory

constraints of the instances of 1.7GB, after memory for operating system and the execution of the de-duplication algorithm were allocated. The execution times within a particular instance type are again controlled by the large I/O wait time experienced processing the data. We again see notable increases in duplicate detection with the variable algorithms over the fixed. When using CDC4, a more aggressive algorithm on the larger instance an additional 5 percent of redundancy was detected over that of the CDC16 algorithm on the smaller instances. This translates into approximately 41GB of additional redundant data eliminated. The execution times on the large instance with more aggressive algorithms are slightly longer compared with the medium instance.

fixed16 43828883 835.9696007 12.66% 1552.283333 764.04G cdc16 30729169 586.1123848 22.00% 1515.566667 764.04G c1.Medium

fixed16 43828883 835.9696007 12.66% 875.7333333 764.04G cdc16 30729169 586.1123848 22.00% 896.2833333 764.04G m1.Large

fixed16 43828883 835.9696007 12.66% 916.5666667 764.04G fixed8 87026664 1659.901886 13.10% 964.7333333 764.04G cdc16 30729169 586.1123848 22.00% 940.5333333 764.04G cdc8 58972915 1124.819088 25.05% 853.1333333 764.04G cdc4 116120895 2214.830303 27.37% 959.8833333 764.04G

Table 7: EC2 Instance Large Dataset Results

Using these results we are able to now construct and analyze a cost model associated with the cost tradeoff when selecting a smaller instance type and less aggressive algorithms versus the option to select a larger instance type and a more aggressive algorithm. We also looked at the cost model when storing the data for varying lengths of time from one month to one year and the affect storage duration have on the cost savings and decision in selecting an instance type.

To recap the factors of our cost analysis: instance type (m1_small, c1_medium, m1_large), de-duplication algorithms (fixed16,fixed8, cdc16,cdc8,cdc4), and the storage duration (1month, 3 months, 6 months, 1 year). Comparisons will be performed using the small and medium instance types against the large instance types. As discovered with the small dataset testing the cost savings are nonexistent or insignificant to compare the small and medium instances against each other.

To start we will look at the cost breakdowns of the Amazon EC2 and the S3 offerings. The Amazon EC2 compute costs are based on per instance hour used and data

transfer in and out of the EC2 environment. Partial consumed instance hours are billed as full hours, so all execution times will be rounded to the next hour for cost comparison. As for the data transfer in to the EC2 environment, this cost will be excluded from our analysis since this cost does not change depending on the instance type we are selecting.

Amazon’s S3 cost model is based on the following factors: standard storage pricing – which is the pricing for the amount of storage used; request pricing – cost for the number of put, copy, post, list or get operations performed on your S3 storage bucket; data transfer cost – the cost to transfer data into and out of S3. Since we are using EC2 to communicate with S3 there is no data transfer charge for data transferred between Amazon EC2 and Amazon S3 within the same region, for our case both the EC2 and S3 are within the East region [26]. Additionally, we are focusing on the standard storage pricing as opposed to the reduced redundancy storage which introduces the risk of data loss. In the case of de-duplication data protection is critical due to the large percentage of files that can be referencing a single data block. The reduced storage option is available as a lower cost option for data that is reproducible [26]. Below are a couple tables with the breakdown of the E2 and S3 pricing for the East region at the time of publication.

AWS EC2 Compute Pricing

Type $ Cost/Hr

Small (m1_small) 0.08

Medium (c1_medium) 0.165

Large (m1_large) 0.32

Table 8: AWS EC2 Pricing

AWS S3 Storage Pricing

Based on the execution times seen on the three instance types we will begin by breaking out the compute and storage cost associated with small instance type running the CDC16 algorithm. For the compute cost we look at the execution time which is 1552.283333 minutes which translates to 26 hours after rounding up to the nearest hour.

The compute cost is a straight calculation using the 26 hour multiplied by the per hour cost of the small instance type of $.08 per hour, which equals $2.08. The storage cost has a couple factors to take into account. One being the storage cost of $.125 per GB for the first TB stored. After running the CDC16 algorithm 22% redundant data was removed leaving approximately 596GB which has an associated cost of $74.50 per month. The second component of the storage cost is the request pricing. The request pricing is based on PUT, COPY, POST, LIST, or GET requests. The pricing for the PUT, COPY, POST, or LIST are $.01 per 1,000 requests while the GET and other request are $.01 per 10,000 requests [26]. In order to calculate the number of request we need to determine the number of files that make up our dataset which translates into the number of object put requests. The 764.04GB dataset is comprised of 450,990 files and directories, which translates to an estimated 450,990 PUT operations that has an associated cost of $4.51.

Using these figures we are able to calculate the cost for a one month, three month, six month, and one year storage period. The calculations are the same for the medium and large instance types with the exception of the values for execution time and

de-duplication percentage. The request cost remains the same as the dataset and number of files remains consistent across instance tests.

Algorithm / Instance Storage TimeFrame

CDC16 on Small Instance 1 Month 3 Month 6 Month 1 Year

Compute Cost $2.08 $2.08 $2.08 $2.08

Storage Cost $79.01 $237.03 $474.06 $948.12

CDC16 on Medium Instance 1 Month 3 Month 6 Month 1 Year

Compute Cost $2.48 $2.48 $2.48 $2.48

Storage Cost $79.01 $237.03 $474.06 $948.12

CDC8 on Large Instance 1 Month 3 Month 6 Month 1 Year

Compute Cost $4.80 $4.80 $4.80 $4.80

Storage Cost $76.01 $228.03 $456.06 $912.12

CDC4 on Large Instance 1 Month 3 Month 6 Month 1 Year

Compute Cost $5.12 $5.12 $5.12 $5.12

Storage Cost $73.76 $221.28 $442.56 $885.12

Table 10: Instance Cost Assessment

With the varying storage durations the following assumptions were made:

Compute Cost – the compute cost was only calculated for the initial data de-duplication process, any subsequent data accesses are not taken into account. The data access frequency is independent of the instance type. Another aspect that was not taken into account is additional data being added or delete to the cloud instance over the storage duration. Data additions based on our research have a positive impact on the cost savings seen with the more aggressive algorithms.

Looking at the cost savings of the large and small instances, the use of the large instance type with the aggressive algorithm of CDC4 over a year storage time frame produces a cost savings of 6.15% or $58.37 compared with running the CDC16 using the smaller instance type. The smaller storage timeframes also produce a cost savings in the range of 2.5% for the first month to 5.82% at the six month mark. When comparing the CDC8 on the large instance a cost savings is not realized immediately, with the first month savings at less than 1%. Therefore, in order to maximize the cost savings in the

shortest amount of time running the CDC4 algorithm on the larger instance will produce such results.

Small vs Large 1 Month 3 Month 6 Month 1 Year

$ Savings CDC8 vs CDC16 0.18375 5.99125 14.7025 32.125

% Savings 0.23% 2.51% 3.09% 3.39%

$ Savings CDC4 vs CDC16 2.0775 12.3125 27.665 58.37

% Savings 2.57% 5.16% 5.82% 6.15%

Table 11: m1_small vs. m1_large Instance Cost

The cost analysis between the medium and large instance is similar to that of the small and large. The medium instance having the same memory requirements as the small instance type is again limited to the CDC16 algorithm. Therefore, the amount of

duplicates detected will be the same as the small instance but with an improvement in execution time. The execution time of the medium instance is 40% less than that of the small instance, but with the per instance hour cost for the medium instance 52% higher this will only generate a higher compute cost for execution compare with the small instance. For example in the first month the small instance has a cost of $81.09 compared with $81.49 for the same execution on the medium instance type. With the increased compute cost for the medium instance the overall saving compared will large instance will reflect that increased cost amount with a positive gain in savings. For example the first month savings of 2.57% for the small to large compared with 3.04% for the medium to large.

Medium vs Large 1 Month 3 Month 6 Month 1 Year

$ Savings CDC8 vs CDC16 0.57875 6.38625 15.0975 32.52

% Savings 0.71% 2.67% 3.17% 3.43%

$ Savings CDC4 vs CDC16 2.4725 12.7075 28.06 58.765

% Savings 3.04% 5.32% 5.90% 6.19%

Table 12: c1_medium vs. m1_large Instance Cost

Dataset size is an important factor in the potential cost savings when selecting and appropriate instance type. Initially our testing focused on a small dataset of 100GB to 300GB for various data types. This testing examined the memory requirements for the most aggressive variable algorithm CDC4. It was determined that the small and medium instance memory resources were sufficient to handle the CDC4 algorithm. Therefore the execution time improvement on the medium instance produced little to no cost avoidance for choosing the medium instance over the small instance.

Looking at a larger dataset of 764.04GB support for cost savings can be realized by selecting a larger instance type and running a more aggressive duplicate detection algorithm. This assumes that additional redundancy continues to be detected the more granular the algorithm chunks a given dataset. The redundancy factor is data dependent and sample tests need to be performed locally to use to gauge the expected percentage of savings and for selecting the appropriate instance type using the memory estimation methodology presented. Also simply selecting a larger instance is not always the most cost effective method as we present with the small dataset testing.

In document Cloud De-duplication Cost Model THESIS (Page 33-47)