Cloud De-duplication Cost Model THESIS

(1)

Cloud De-duplication Cost Model

THESIS

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Christopher Scott Hocker

Graduate Program in Computer Science and Engineering

The Ohio State University 2012

Master's Examination Committee:

(2)

Copyright by Christopher Scott Hocker

(3)

ii

Abstract

De-duplication algorithms have been used extensively in the backup and recovery realm to exploit the temporal data redundancy on the versioning nature of backup

applications. More recently de-duplication algorithms have progressed into the primary storage area where more spatial data exist, in either case increasing the efficiency of the usable storage capacity. In parallel another industry trend is the use of cloud resources to provide a utility model for software and infrastructure services. Cloud resources aid in decreasing the provisioning time for applications and infrastructure, while increasing the scalability to meet the elastic nature of application and infrastructure requirements. Using the de-duplication algorithms within cloud resources is the next logic step to increase the efficiency and cost related to cloud computing.

Cloud environments involve pricing models for both computing and storage. In this work, we examine the tradeoffs between computing costs and reduction in storage costs (due to higher de-duplication) for a number of popular de-duplication algorithms. Since the main factor impacting the computing cost is the memory availability in different type of instances, we also develop a methodology for predicting the memory requirements for executing a given algorithm on a particular dataset.

(4)

iii

Dedication

Dedicated to those who supported me throughout my academic career… my Wife, Parents, Brother, Sister and Friends.

(5)

iv

Acknowledgments

First I would like to thank my advisor, Dr. Gagan Agrawal for challenging me and providing guidance from the beginning of my time at Ohio State. The support and advice he provided was invaluable.

Additionally, I would like to thank my thesis committee member Dr. Christopher Stewart for his time and participation during this work.

I would also like to thank my Wife for her support and understanding of the long hours during this work and throughout my entire academic career. Finally, I would like to thank the rest of my support system, my Parents, Brother, Sister, and Friends who were always there to provide a word of encouragement and motivation.

(6)

v

Vita

2001...Vandalia Butler High School 2007...B.S. CS, Wright State University 2008 to present ...M.S. CSE, The Ohio State University

Fields of Study

(7)

vi Table of Contents Abstract ... ii Dedication ... iii Acknowledgments... iv Vita ... v Fields of Study ... v Table of Contents ... vi

List of Tables ... vii

List of Figures ... viii

Chapter 1: Introduction ... 1

Chapter 2: De-duplication Algorithms... 4

Chapter 3: Memory Prediction... 16

Chapter 4: Experimental Evaluation ... 24

Chapter 5: Related Research ... 38

(8)

vii

List of Tables

Table 1: fs-c Algorithm Chunk Selection ... 18

Table 2: Fixed Index Memory Estimates vs. Actual ... 21

Table 3: Variable Index Memory Estimates vs. Actual ... 22

Table 4: Small Dataset Results ... 27

Table 5: EC2 m1.small Instance Small Dataset Results ... 29

Table 6: EC2 c1.medium Instance Small Dataset Results ... 30

Table 7: EC2 Instance Large Dataset Results ... 32

Table 8: AWS EC2 Pricing ... 33

Table 9: AWS S3 Pricing ... 33

Table 10: Instance Cost Assessment ... 35

Table 11: m1_small vs. m1_large Instance Cost ... 36

(9)

viii

List of Figures

Figure 1: Out-of-Band vs. In-band De-duplication... 4

Figure 2: Basic Sliding Window Algorithm [8] ... 8

Figure 3: TTTD Pseudo Code [8] ... 10

Figure 4: Chunk Distribution of TTTD algorithm [9] ... 11

Figure 5: TTTD-S Chunk Distribution Improvements [9] ... 12

Figure 6: TTTD-S algorithm pseudo code ... 12

Figure 7: De-duplication ratio and percent savings [19] ... 14

(10)

1

Chapter 1: Introduction

De-duplication algorithms have been used extensively in the backup and recovery realm to exploit the temporal data redundancy on the versioning nature of backup

applications. More recently de-duplication algorithms have progressed into the primary storage area where more spatial data exist, in either case increasing the efficiency of the usable storage capacity.

In parallel another industry trend is the use of cloud resources to provide a utility model for software and infrastructure services. Cloud resources aid in decreasing the provisioning time for applications and infrastructure, while increasing the scalability to meet the elastic nature of application and infrastructure requirements. As companies begin to transition their data and infrastructure to the cloud, the methods of cost savings they are accustomed such as de-duplication should transition as well. The increased efficiency in usable capacity gained through the use of de-duplication will translate into a positive impact on the cloud pay as you go model. De-duplication does come with a tradeoff of additional compute resources required to analyze the data for duplicates, so selecting the right instance type to run a given de-duplication algorithm is an important aspect. Therefore we will examine the resources and cost factors related to the cloud environments and how de-duplication can be effectively integrated.

Cloud environments involve pricing models for both computing and storage. In this work, we examine the tradeoffs between computing costs and reduction in storage costs (due to higher de-duplication) for a number of popular de-duplication algorithms. Since the main factor impacting the computing cost is the memory availability in

(11)

2

different type of instances, we also develop a methodology for estimating the memory requirements for executing a given algorithm on a particular dataset.

Through experiments we show support for more aggressive de-duplication algorithms to maximize cost savings on a larger cloud compute instance versus that of a smaller instance type and less aggressive de-duplication algorithm. In some cases the dataset size does not warrant a larger instance type to run more granular de-duplication algorithms since the index memory requirements are satisfied by the entry level instance types. In these situations there is no benefit to choosing a larger instance type in an effort to reduce cloud resource cost.

Thesis Statement

“Integrating de-duplication effectively and efficiently into a cloud environment

requires an understanding of the resource requirements, specifically the memory

requirements and the tradeoff in compute cost for processing data for duplicates at

various levels of granularity.”

Contributions

This thesis makes the following contributions:

1. Proposes a methodology to predict the required cloud instance type based on memory requirements to run popular de-duplication algorithms on a given dataset.

2. Analyze cloud compute requirements for running de-duplication algorithms at varying chunk granularity.

(12)

3

3. Evaluate cost factors associated with running de-duplication in a cloud environment including compute instance types, de-duplication algorithm chunk granularity, and data storage durations.

(13)

The implementation of data

duplication placement, the timing of the de

which data is analyzed to find duplicate data. The placement of the de process can occur at the client or

both the client and target

performed either in-band as the data is received/sent versus out time.

The placement and timing are key components of de however in this research we will focus on client based de of band timing which will allow us to integrate de environment and analyze the data in place

at which duplicates are detected specifically at a fixed and variable block levels. we will provide a brief

the ratio.

4

Chapter 2: De-duplication Algorithms

The implementation of data de-duplication technologies varies in terms of de duplication placement, the timing of the de-duplication of data, and the granularity at which data is analyzed to find duplicate data. The placement of the de

process can occur at the client or target device [4]. Additionally, a hybrid both the client and target also exist. The timing of the de-duplication process is

band as the data is received/sent versus out-of-band at a scheduled

Figure 1: Out-of-Band vs. In-band De-duplication

The placement and timing are key components of de-duplication algorithms however in this research we will focus on client based de-duplication algorithms and out

which will allow us to integrate de-duplication into an existing cloud environment and analyze the data in place. We will explore in more detail the at which duplicates are detected specifically at a fixed and variable block levels.

e will provide a brief explanation of de-duplication ratios and the contributing factors to duplication technologies varies in terms of

de-duplication of data, and the granularity at which data is analyzed to find duplicate data. The placement of the de-duplication

hybrid approach of duplication process is

band at a scheduled

duplication

duplication algorithms duplication algorithms and out duplication into an existing cloud

in more detail the granularity at which duplicates are detected specifically at a fixed and variable block levels. Finally,

(14)

5

Duplicate Detection

For duplicate detection there are three main approaches: whole file, often called single instance storage, sub file chunking which is comprised of fixed and variable block hashing, and delta encoding.

For whole file, entire files are given a hash signature using MD5 or SHA-1 [6]. Files with exact hash signatures are assigned a pointer to the single file instance previously stored. In certain algorithms a byte by byte comparison is performed to eliminate the potential for hash collisions, which is often a concern with hash based comparisons [27]. Lower de-duplication ratios are generally obtained due to the large dataset that has to be matched; any small change in the file will alter the hash and affect any previous match.

The second and more popular approach is the sub file chunking [4]. Methods such as fixed and variable block hashing are two types of sub file chunking. Fixed block de-duplication chunks and hashes a byte stream based on a fixed block size. The hash signatures are stored and referenced in a global index, which is implemented using a bloom filter type data structure to quickly identify new unique block segments [14]. If the block signature already exists in the index, a pointer to the existing block is created; otherwise the signature is stored and the block is written to disk. In contrast, variable block algorithms use methods such as Rabin Fingerprinting [1], a hashing algorithm that uses a sliding window to determine the natural block boundaries with the highest

probability of matching other blocks. Variable block algorithms still employ the bloom filter based data structure for the in-memory index [11]. Variable block proves to be a

(15)

6

more efficient approach compared to fixed and whole file hashing, regardless of the slight variations or offsets that exist due to modifications in similar files and blocks [4].

Exact and similarity matching techniques are used in the sub file hashing

algorithms. Exact matching algorithms examine the chunk hash index for exact signature match. For exact matching algorithms, the block size has a direct impact on the hash index size which can present a problem when storing in memory [29]. An example was provided in [2], where the index size required for 1 Petabyte of de-duplicated data, assuming an average block size of 8KB would require 128GB of memory for the index at a minimum. Similarity matching techniques address the index size by increasing the block size to 16MB in [2], which in turn will reduce the index size to 4GB for the 1 Petabyte of data. A similarity signature approach consists of a number of block signature based on a subset of the chunk bytes. If a similarity signature matches more than some threshold of block signatures then there is a reasonable probability that the two chunks will share common block based signatures [2]. Thus a match is found, and the new similarity chunk is compared and de-duplicated against the similar chunks [29]. The trade off with the similarity matching technique is de-duplication performance is highly

dependent upon the speed of the de-duplication repository storage; as the repository is referenced for similarity block signature matches. Also since each chunk is only

compared against a limited number of other chunks in similarity matching, occasionally duplicates are stored [29].

Finally, delta encoding (also called data differencing), processes files based on a reference file for differences, storing the deltas in a patch files [17]. Selecting the

(16)

7

reference file previously stored is a key operation in delta encoding algorithms and often selected based on a fingerprinting technique similar to that of the whole file and sub chunking algorithms [4].

The sub file chunking approach with exact matching used with variable block algorithms are able to detect the varying offsets of data blocks. This addresses the boundary shifting concerns of fixed or whole file algorithms. The tradeoff comes in terms of additional resources required to maintain the metadata associated with the increase chunks seen with variable block approaches. Additionally, index lookups increase during the variable chunk detection process. Even with the increased resource requirements for variable block algorithms the majority of current algorithms use variable block sub file hashing with exact matching to maximize efficiency and overall de-duplication ratios.

De-duplication Implementations

The process is straight forward for the whole file hashing and fixed block hashing: perform the hash index lookups based on the SHA-1 [6] or MD5 hash value of the file or fixed bock to determine if the data is unique. If a duplicate is detected in the hash lookup, optionally perform a byte by byte comparison then modify the file metadata to reference the previously stored data. If the data is unique, record the hash value and perform local compression (e.g. LZ, gzip) on the unique block and store only that data [13].

For the variable block otherwise known as content based chunk algorithms, there are several algorithms that vary in their implementation, overall performance and

(17)

8

block algorithms is elimination of the boundary shifting problem [18]. If a small modification is made to a file, the chunk boundaries for whole file and fixed block chunking shift, causing a poor duplicate detection against the file or data.

The low bandwidth network filesystem [18] first introduced the basic sliding window (BSW) using three parameters as inputs: fixed window size – W, integer divisor – D, and integer remainder – R. The window shifts one byte at a time to the max window size of W from the beginning of the file to the end. Fingerprints (h) of the window contents are generated with Rabin fingerprinting. Rabin introduced the idea of detecting the natural block boundaries in a byte stream and assigning the variable chunks a signature [1]. The algorithm then tests if (h mod D ) = R which if true, a D-Match has been found and the current position is set as a breakpoint for that chunk. The parameter D can be configured to make the chunk size as close to the data expectations as possible to maximize the de-duplication. The parameter R must be between 0 and D-1, and most often is configured as D-1. Figure 2, provides a visual representation of the basic sliding window approach.

(18)

9

Problems presented by the basic sliding window approach include the large chunk sizes generated when a match is not detected and the data is chunked at the window size. This leads to boundary shifting problems when small modifications are made, making the large chunk matches more difficult.

Additional improvements were made which resulted in the introduction of the two divisor (TD) algorithm [8]. This algorithm addressed the issue with the basic sliding window algorithm by introducing a second divisor (S) that is smaller than D which increases the chance of a match. Both D and S are calculated at each byte shift to increase the chances of chunk match, decreasing the number of large chunks.

Using the BSW or the TD algorithms, the chunk size is only upper bounded and could cause chunks to vary greatly in size. Small chunk sizes greatly increases the number of chunks which is directly related to the memory overhead required for exact matching techniques [29]. The two threshold two divisor algorithm was developed to address the range of chunk size generated during duplication detection [8]. TTTD added two threshold parameters to the BSW and TD algorithms which control the upper (Tmax) and lower bounds (Tmin) of chunk sizes [8]. Data fingerprints are not generated until the minimum byte threshold is met; addressing the overhead issues related to small chunk size, while still addressing the boundary shifting concerns of chunking at large chunk sizes.

(19)

10 int p=0; //current position

int l=0; //position of last breakpoint int backupBreak=0; //position of backup breakpoint

for (;!endOfFile(input);p++){

unsigned char c=getNextByte(input); unsigned int hash=updateHash(c); if (p - l<Tmin){

//not at minimum size yet continue; } if ((hash % Ddash)==Ddash-1){ //secondary divisor backupBreak=p; } if ((hash % D) == D-1){

//we found a breakpoint //before maximum threshold. addBreakpoint(p); backupBreak=0; l=p; continue; } if (p-l<Tmax){

//we have failed to find a breakpoint, //but we are not at the maximum yet continue;

}

//when we reach here, we have //not found a breakpoint with //the main divisor, and we are //at the threshold. If there //is a backup breakpoint, use it. //Otherwise impose a hard threshold. if (backupBreak!=0){ addBreakpoint(backupBreak); l=backupBreak; backupBreak=0; } else{ addBreakpoint(p); l=p; backupBreak=0; } }

Figure 3: TTTD Pseudo Code [8]

Additional studies found that when maximum threshold (Tmax) of TTTD was reached, that only the last secondary divisor (S) was used for chunking. Therefore, all other secondary divisors calculated were not considered, causing a large distribution of chunk sizes. See Figure 4 to view the chunk distribution of the TTTD algorithm.

(20)

11

Figure 4: Chunk Distribution of TTTD algorithm [9]

The chunk distribution contains two groupings (as seen on the figure above) - the first around the chunk size detected by the expected main divisor (D), the second near the max chunk threshold where a match was not discovered and the previous secondary divisor (S) was used. TTTD-S [9] improves upon the large distribution of the chunk grouping by introducing a switchP value that is set to 1.6 times the expected chunk-size [9]. Once the window size has reached switchP, the divisors are reduced in half to shorten the match process. This in turn helps to shorten the process of finding a breakpoint before max chunk threshold is reached. Additionally, it can improve the distribution and bring the second chunk grouping closer to the average chunk size detected by the main divisor. Figure 5 illustrates the improvements in the chunk distribution made by the switchP parameter introduced in the TTTD-S algorithm, further reducing the chances of a boundary shifting conditions with data modifications.

(21)

12

Figure 5: TTTD-S Chunk Distribution Improvements [9]

int currP = 0, lastP= 0, backupBreak = 0; for ( ; ! endOfFile( input ) ; currP++ ) { unsigned char c = getNextByte( input ); unsigned int hash = updateHash( c ) ; if ( currP – lastP < minT ) {

continue ; }

if ( currP – lastP > switchP ) {

switchDivisor( ) ; }

if (( hash % secondD ) == secondD – 1 ) { backupBreak = currP ;

}

if (( hash % mainD ) = = mainD – 1 ) { addBreakpoint( currP ) ; backupBreak = 0 ; lastP = currP ; resetDivisor( ); continue; }

if ( currP – lastP < maxT ) { continue ; } if ( backupBreak != 0 ) { addBreakpoint( backupBreak ) ; lastP = backupBreak ; backupBreak = 0 ; resetDivisor( ) ; } else { addBreakpoint( currP ) ; lastP = currP ; backupBreak = 0 resetDivisor( ) ; } }

Figure 6: TTTD-S algorithm pseudo code

Additional algorithms exist that use a hybrid approach by incorporating variable and fixed block techniques, as well as small chunk merging techniques to reduce the

(22)

13

number of small chunks – therefore, reducing the overhead associated with the large number of small chunks. Compression is also often used in conjunction with de-duplication to increase the storage space utilization.

Data de-duplication algorithms have been extensively researched. The techniques available today vary depending on the de-duplication placement, timing of the detection process, and the granularity at which duplicates are discovered [4]. Regardless of the techniques, the overall effectiveness of any of the de-duplication algorithms remains data dependent [30]. The fs-c algorithm [7] used in our research is based on the TTTD

algorithm using the Rabin [1] fingerprinting for generating the natural chunk boundaries.

De-duplication Savings

In addition to the inner workings of the de-duplication algorithms, the data characteristics will help in understanding the space savings obtained by the various algorithms. There are several factors such as data type, scope of the de-duplication, and data storage period that play a role in overall de-duplication savings. De-duplication savings are often stated in terms of ratios, which is in relation to the number of input bytes to the de-duplication process divided by the number of bytes of output. [19]. Figure 7 depicts the calculation of de-duplication ratios to percentages. In our studies we will use percentages to eliminate any confusion and make the overall savings more apparent.

(23)

14

Figure 7: De-duplication ratio and percent savings [19]

Data files types are one component that has an impact on the de-duplication saving expectations. For example, files generated by humans in applications such as text documents, spreadsheets, presentations, etc often contain a large amount of redundant data, while data generated by a computer system such as images, media, archived files, etc often have less redundancy due to the random nature of the data [19].

The scope of de-duplication refers to the range of datasets examined during duplicate detection. For example, the term “global” de-duplication allows for detection of duplicates across multiple data sources which can span multiple storage systems or locations [19]. Conversely, the de-duplication across just a single appliance or within a single client’s data only looks at the data contained within that appliance or client,

creating silos of de-duplication stores. In general, the larger the data scope for duplication detection, the higher the expected space savings. .

Data storage periods have an effect on the de-duplication savings by increasing the chance of exploiting temporal redundancy. For example, in a backup type scenario

(24)

15

where temporal data is accumulated over time due to the versioning nature of backup applications, ratios are expected to be higher. In a primary storage scenario, spatial data exists across a broad spectrum of data types which lead to lower de-duplication ratios overall.

This outlines data de-duplication approaches in terms of de-duplication

placement, timing, and the granularity at which data is analyzed to find duplicates. These approaches combined with the algorithms outlined above provide a foundation for

(25)

16

Chapter 3: Memory Prediction

To examine the tradeoffs between compute and storage cost with the addition of de-duplication, we need a method to estimate the instance type required to execute a given de-duplication algorithm. Since the main factor impacting the computing cost is the memory availability in different type of instances, we developed a methodology for predicting the memory requirements for executing a given algorithm on a particular dataset.

Estimation Method

With de-duplication there is not a one size fits all configuration. Depending on the application and resources available, certain algorithms might be more effective than another. One resource consideration is the total index memory size required to store the index of unique data signatures. For both fixed and variable block exact matching implementations an in memory index is used for data signature lookups for determining duplicate blocks. There are some techniques that use similarity signatures that increase the chunk size to control the memory index size, the tradeoff being the increased reliance on the speed and responsiveness of the de-duplication store for data comparisons. Therefore, our focus when utilizing cloud resources is the exact matching techniques where the memory size for the index is a concern.

Providing a means to estimate the index size is important when sizing system requirements for a de-duplication implementation. Memory size estimates for both fixed and variable block algorithms follow the same formula, only varying in how the specific

(26)

17

variables are derived. We provide the following formula for the basic memory requirements estimations:

Memory Size Estimates =

(Data Size / Chunk Size) * (1 – De-duplication %) * (Signature Bytes)

To estimate the index memory size for a given dataset, the following variables have to be determined and or estimated: Data size - what is the total data set size that is targeted for de-duplication. Chunk size - for a fixed chunking algorithm what is the size of the chunk used in the de-duplication implementation. De-duplication percentage is based on the percentage seen during a sample run on a subset of the dataset. From our testing a sample size of 10 to 15% provides a good sample for the various data types we tested. The de-duplication percentage estimates are on par with similar measurement results in other studies for the given data types [4] [7]. Signature bytes refer to the

number of bytes used for generating a chunk’s signature hash. In most implementations a 20 byte SHA-1 hash signature for each chunk is used for the collision resistant properties that SHA-1 provides [6].

Variable block index memory estimates are more complex since the chunk sizes are not static but a distribution between the minimum and maximum chunk sizes set at execution time. Figure 8 shows the distribution based on the testing performed using the fs-c algorithm [7] on multiple datasets. The fs-c algorithm uses the TTTD approach to the variable block de-duplication. The CDC32 (content defined chunking) has an expected

(27)

18

(average) block side of 32KB, a lower threshold (Tmin) set at 8KB and the upper threshold (Tmax) at 128KB. The threshold proportions remain consistent with CDC16, CDC8, and CDC4 algorithms.

The following table outlines the different fixed and variable algorithms used in the fs-c algorithm [7] tests.

Chunker Type Average Chunk Size (bytes) Minimum Chunk Size (bytes) Maximum Chunk Size (bytes) Fixed8 Fixed 8192 Fixed16 Fixed 16384 Fixed32 Fixed 32768 CDC4 Variable 4096 1024 16384 CDC8 Variable 8192 2048 32768 CDC16 Variable 16384 4096 65536 CDC32 Variable 32768 8192 131072

Table 1: fs-c Algorithm Chunk Selection

0.00% 5.00% 10.00% 15.00% 20.00% 25.00%

Min Average Max

% o f C h u n k s Office 1 - CDC32 Office 1 - CDC16 Office 1 - CDC8 Office 1 - CDC4 Block Sizes

(28)

19

Figure 8: FS-C Chunk Distributions

Based on figure 8 distributions, the percentage of data chunks between the minimum and average block size is 50% to 55% of the total unique chunks, which in terms of the total data size is 20-25%. We can derive the total number of chunks based on these calculations. In the worst case scenario (most number of chunks), we would

assume that 25% of the data is chunked at the minimum block size, and the remaining 75% of the data would chunk just above the average chunk size. The best case scenario (least number of chunks), 25% of the data would chunk at the average chunk size and the remaining 75% at the max chunk size.

Worst Case Total Chunks =

((.25 * DataSize) / Min Block Size) + ((.75 * Datasize ) / Average Block Size) Best Case Total Chunks =

((.25 * DataSize) / Average Block Size) + ((.75 * Datasize) / Max Block Size)

As an example - in a dataset size of 100GB (107374182400 bytes) and chunking at a variable block size of 16KB (4KB lower threshold, 64KB upper threshold), with an

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00%

Min Average Max

% o f C h u n k s Office 3 - CDC32 Office3 - CDC16 Office3 - CDC8 Office3 - CDC4 Block Sizes

(29)

20

estimated de-duplication percentage of 25%, and signature size of 20 bytes, the memory requirements range is:

Worst Case Total Chunks =

((.25 * 107374182400) / 4096 ) + ((.75 * 107374182400) / 16384) = 11468800 Chunks

Best Case Total Chunks =

((.25 * 107374182400) / 16384 ) + ((.75 * 107374182400) / 65536) = 2867200 Chunks

From the worst and best case chunk estimates, we can now utilize our memory estimation formula presented earlier to estimate the minimum and maximum memory requirements for the index when running the CDC16 algorithm against the 100GB dataset.

Minimum Memory Requirements =

2867200 * (1 - .25) * (20) = 43008000 bytes ~ 42MB Maximum Memory Requirements =

11468800 * (1 - .25) * (20) = 172032000 bytes ~ 165MB

Therefore, the memory requirements for our 100GB dataset are in the range of 42MB to 165MB.

(30)

21

Validation of Method

We performed experiments with small (150GB or less) and large (500GB or more) datasets with both the fixed and variable algorithms to test how well the memory estimation formula applies to real world scenarios.

Fixed Block Index Memory Requirements Dataset Size

(GB) Alg

Estimated Minimum Memory (MB) Actual Memory (MB) % Error 111.35 FIXED16 132.228125 139.4957161 5% 111.35 FIXED8 267.5920949 278.7025261 4% 924.41 FIXED16 1097.736875 1034.24 6% 924.41 FIXED8 2221.507036 2058.24 8%

Table 2: Fixed Index Memory Estimates vs. Actual

Using the fs-c [7] fixed chunking algorithms, we tested an office type dataset extracted from a corporate office file share environment. To obtain our memory estimates we assumed the de-duplication percentage to be at or around the 5% mark for the small and large dataset. This percentage was obtained from a sample run of the fixed algorithm on a dataset a fraction of the size. Additionally, the SHA-1[6] data signature size of 20 bytes was selected at execution time. Based on our assumptions of the de-duplication percentage and parameters selected at run time (signature size, average chunk size) the memory estimates calculate from the formula presented previously were within 8% of the actual memory requirements. The estimate error is dependent on the percentage of de-duplication assumed versus the actual, and is only improved by using a larger sample size in the de-duplication percentage estimate [30].

(31)

22

The variable block chunking experiments again used the same dataset as the fixed and assumed the chunk distribution discussed previously to obtain the estimate range for the index memory. The de-duplication percentage estimate used for the CDC16 and CDC8 algorithm was 15% and 20% respectively. These estimates were obtained from local sample runs on the dataset. The SHA-1 data signature size was again set to 20 bytes at execution [6]. The minimum and maximum block thresholds set by the fs-c [7]

algorithms were 4KB and 64KB respectively for the CDC16 algorithm and 2KB and 32KB respectively for the CDC8 algorithm.

Variable Block Index Memory Requirements Dataset Size (GB) Alg Minimum Memory (MB) Maximum Memory (MB) Actual Memory (MB) 111.35 CDC16 51.76035156 118.309375 115.6596375 111.35 CDC8 98.09142787 222.7 224.4574738 924.41 CDC16 429.7062109 982.185625 741.23 924.41 CDC8 814.3394417 1848.82 1433.417511

Table 3: Variable Index Memory Estimates vs. Actual

Based on the assumptions, the chunk distributions, and the parameters set at execution time, the actual memory requirements for the variable block executions on the small and large datasets were within the estimated range for the index memory - trending toward the higher end of the range for both the small and large dataset. For the variable algorithm, the index memory estimates are based not only on the de-duplication

(32)

23

Resource considerations regarding cloud instance type selection around the required index memory have been examined in relation to the chunking algorithm selected for duplicated detection. A methodology for estimating memory requirements was presented and tested against real world datasets. From our real world test performed on the corporate file share datasets the index memory estimates presented for both fixed and variable block algorithms provide good estimates for sizing the compute instance required to perform de-duplication using the sub file level granularity. We can now proceed with our experimental evaluation of the tradeoffs cost between the compute and storage when introducing de-duplication algorithms in a cloud environment.

(33)

24

Chapter 4: Experimental Evaluation

In our experimental evaluation of de-duplication in a cloud based environment we look at the following factors namely the dataset size, the cloud compute instance

requirements, and the length at which the data is going to be retained in the cloud to analyze the potential cost avoidance surrounding performing fixed and variable de-duplication detection on a given dataset.

We performed our analysis on the Amazon Web Services offerings, using elastic compute (EC2) for the compute platform and simple storage services (S3) for the storage infrastructure. The standard small and large instance types along with the high cpu medium instances were used in our testing. Below is a recap of resources specifications:

o Small Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit (1

virtual core with 1 EC2 Compute Unit), 160 GB of local instance storage, 32-bit platform [26]

o Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores

with 2 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform [26]

o High-CPU Medium Instance 1.7 GB of memory, 5 EC2 Compute Units

(2 virtual cores with 2.5 EC2 Compute Units each), 350 GB of local instance storage, 32-bit platform [26]

Amazon defines one EC2 Compute Unit (ECU) as providing the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor [26]. Additionally, Amazon’s Linux AMI operating system was selected for the instance builds. The cost

(34)

25

analysis is based on the Amazon’s pricing for the US East region where our testing was performed.

We used the fs-c algorithm developed by [7] with reporting and statistical gathering modification as our de-duplication engine. The fs-c algorithm has both fixed and variable chunking options. Chunk size options vary from 2KB to 32KB for both fixed and variable algorithms. The variable chunking approached is based on the two threshold two divisor algorithm [8] using Rabin fingerprinting [1] to determine the natural content boundaries. Additionally, the fs-c approach is an out-of-band approach to de-duplication, which allows the analysis of data in place.

Our initial evaluation centered on small datasets extracted from a corporate file share environment that were 300GB or less in size. We grouped our datasets around the following data classifications:

Office data types: Microsoft Word (doc,docx), Excel (xls, xlsx), PowerPoint

(ppt, pptx), Adobe’s Portable Document Format(pdf), rich text documents(rtf).

Database file types: Microsoft SQL master database files(mdf), Microsoft

Access(mdb)

Virtual Machine Data files: VMware virtual machine (vmdk) files Media Files: JPEG, GIF, PNG, MP3, MP4,WAV

As a first step we performed testing on a local system on the above datasets to gauge de-duplication percentages and instance type requirements. This allowed us to determine the dataset to focus on when moving the testing to cloud resources.

(35)

26

Below is a summary of the results from the first dataset of each type against both fixed and variable (CDC) algorithms using various block sizes. The algorithms used followed a format of chunk type followed by a number that indicates the average or fixed chunk size used. For example the cdc8 algorithm uses an average chunk size of 8KB with a lower threshold of 2KB and upper threshold of 32KB. The lower and upper bound thresholds remain proportionally consistent to the average chunk size for the other variable (CDC) algorithms. Refer to table 1 for algorithm specifications.

The local system specifications used in our initial testing were as follows: Hardware Brand : HP DL580 G5

CPU : 2 Dual Core Intel(R) Xeon(R) CPU @ 3.00GHz Memory: 8GB

Hard Drives : 2 X 73GB 10K SAS drives RAID 1 for OS OS : Ubuntu 11.0.4 (x64)

Data Storage: EMC VNXe 3100

The datasets were stored on an EMC VNXe 3100 and accessed via NFS. Each dataset was run in isolate to the others eliminating any competition for resources.

(36)

27 Algorithm # of Chunks Memory Requirements (MBs) % Deduplication Execution

Time Total Size

Office cdc4 23582739 365.286681 18.79% 34 min 111.35G cdc8 11768036 186.9730756 16.70% 32 min 111.35G cdc16 6063896 98.99308369 14.41% 32 min 111.35G cdc32 2980966 49.93786693 12.17% 33 min 111.35G fixed4 29205380 540.1141443 3.04% 32 min 111.35G fixed8 14612039 271.2054281 2.69% 32 min 111.35G fixed16 7313593 136.0362223 2.48% 33 min 111.35G fixed32 3662809 68.28364404 2.26% 32 min 111.35G VMDK cdc4 32066793 116.5758275 80.94% 57 min 274.08G cdc8 15870107 61.29639945 79.75% 64 min 274.08G cdc16 7952461 32.62661669 78.49% 63 min 274.08G cdc32 4007976 17.36854834 77.28% 64 min 274.08G fixed8 35923993 181.7139233 73.48% 64 min 274.08G fixed16 17962030 98.59985798 71.22% 63 min 274.08G fixed32 8981049 51.73257442 69.80% 64 min 274.08G DB cdc4 21235250 187.2049818 53.78% 45 min 159.77G cdc8 11162546 98.42767746 53.77% 44 min 159.77G cdc16 6015166 53.05123822 53.76% 45 min 159.77G cdc32 3363838 29.66763861 53.76% 44 min 159.77G fixed8 20940832 184.6494033 53.77% 46 min 159.77G fixed16 10470416 92.32470163 53.77% 47 min 159.77G fixed32 5235208 46.16235081 53.77% 46 min 159.77G Media cdc4 27980371 492.8564571 7.65% 30 min 129.20G cdc8 13944798 245.8149397 7.58% 35 min 129.20G cdc16 6990914 123.3405199 7.50% 31 min 129.20G cdc32 3525336 62.29155103 7.36% 34 min 129.20G fixed8 16978508 302.7250152 6.52% 35 min 129.20G fixed16 8508362 151.7843433 6.47% 35 min 129.20G fixed32 4279530 76.38519619 6.42% 35 min 129.20G

Table 4: Small Dataset Results

As expected overall the variable algorithms were able to find more redundancy within each dataset type. The office dataset had the largest de-duplication percent change

(37)

28

between fixed and variable block algorithms. Surprisingly the execution time did not vary when changing algorithm chunking granularity or between the fixed and variable block algorithms. We examined this more closely and discovered that the bottleneck was not the CPU in processing the fixed or variable block chunks but at the disk I/O when trying to process the data out-of-band. We recorded high I/O wait times during each execution which caused the CPU to wait on the I/O to finish. This explained the consistency around the execution time regardless of the algorithm. Additionally, the VMDK de-duplication percentage is the highest based on the data redundancy inherent across similar operating system builds. The DB percentage remains the same for all the test perform due to the fact the SQL database files were extracted from a system that had the allocation unit size set to 64K, therefore no additional duplicates would be discovered by reducing the chunk size less than 64K. Finally, as expected based on our research the more random data types, such as media formats, produced the lowest de-duplication percentages.

The small dataset memory requirements are within the resources available on the small and medium cloud compute instance types. Additionally, from our local testing the office dataset provides the most interesting analysis given the range of de-duplication percentage, therefore moving forward we will focus solely on office type datasets. Also to ensure the result consistency we collected another office dataset of roughly the same size for our remaining small set testing.

After completing the initial testing on our local system our remaining testing will be using Amazon’s cloud resources. With the small dataset our testing will focus on the small and medium instance types that differ in the amount of available ECUs [26]. The

(38)

29

dataset was transferred to the Amazon S3 storage in original form to perform the out-of-band de-duplication testing. Our motivation for the small dataset test using cloud resources is to gauge the execution time differences between the small and medium instance type to analyze any cost savings. Again all tests were run in isolate on separate instances types and only a single test was accessing the S3 storage bucket [26] at one time. Algorithm # of Chunks Memory Required (MBs) % Deduplication Execution

Office1 cdc4 23582739 365.286681 18.79% 152 min 111.35G cdc8 11768036 186.9730756 16.70% 137 min 111.35G cdc16 6063896 98.99308369 14.41% 147 min 111.35G cdc32 2980966 49.93786693 12.17% 147 min 111.35G fixed4 29205380 540.1141443 3.04% 140 min 111.35G fixed8 14612039 271.2054281 2.69% 138 min 111.35G fixed16 7313593 136.0362223 2.48% 145 min 111.35G fixed32 3662809 68.28364404 2.26% 150 min 111.35G Office2 cdc4 24453549 305.4548119 34.51% 203 min 113.61G cdc8 12056195 155.8396025 32.23% 206 min 113.61G cdc16 6073155 81.40970867 29.72% 205 min 113.61G cdc32 3072219 42.46005797 27.54% 212 min 113.61G fixed4 29825063 471.1933076 17.17% 205 min 113.61G fixed8 14934682 237.6557387 16.57% 210 min 113.61G fixed16 7489809 119.7711156 16.16% 206 min 113.61G fixed32 3767320 60.54580368 15.74% 214 min 113.61G

(39)

30 Algorithm # of Chunks Memory Required (MBs) % Deduplication Execution

Office1 cdc4 23582739 365.286681 18.79% 50 min 111.35G cdc8 11768036 186.9730756 16.70% 50 min 111.35G cdc16 6063896 98.99308369 14.41% 49 min 111.35G cdc32 2980966 49.93786693 12.17% 46 min 111.35G fixed4 29205380 540.1141443 3.04% 50 min 111.35G fixed8 14612039 271.2054281 2.69% 40 min 111.35G fixed16 7313593 136.0362223 2.48% 46 min 111.35G fixed32 3662809 68.28364404 2.26% 46 min 111.35G Office2 cdc4 24453549 305.4548119 34.51% 67 min 113.61G cdc8 12056195 155.8396025 32.23% 68 min 113.61G cdc16 6073155 81.40970867 29.72% 69 min 113.61G cdc32 3072219 42.46005797 27.54% 69 min 113.61G fixed4 29825063 471.1933076 17.17% 69 min 113.61G fixed8 14934682 237.6557387 16.57% 67 min 113.61G fixed16 7489809 119.7711156 16.16% 70 min 113.61G fixed32 3767320 60.54580368 15.74% 70 min 113.61G

Table 6: EC2 c1.medium Instance Small Dataset Results

Based on the results of the cloud testing on the small office dataset the execution time difference is inline with the cost difference based on Amazon’s EC2 pricing at the time of this publication going from the small instance to the medium instance. Also since the memory resources are the same on the small and medium instance type a more

aggressive algorithm cannot be used as a differentiator in terms of space and cost savings. Therefore there is little to no cost savings when comparing the executions times and the related compute cost differences of the small and medium size instances on a small dataset. One interesting aspect of this testing is the relative consistency in the percentage of additional redundancy detected between the fixed and variable block algorithms for both office datasets.

(40)

31

Transitioning into the larger dataset of 500GB and larger we again focus our attention on an office type dataset extracted from a corporate file share environment. The goal of the large dataset is to examine more aggressive algorithms that exhaust the memory resources available in the small and medium instance type for the global chunk index. This will allow us to explore the cost model and tradeoffs associated with

choosing a more aggressive algorithm and large instance type versus a less aggressive and smaller instance type over varying storage durations.

Using a dataset size of 764GB on the small and medium instance type the fixed16 and cdc16 were the most aggressive algorithms able to be run within the memory

constraints of the instances of 1.7GB, after memory for operating system and the execution of the de-duplication algorithm were allocated. The execution times within a particular instance type are again controlled by the large I/O wait time experienced processing the data. We again see notable increases in duplicate detection with the variable algorithms over the fixed. When using CDC4, a more aggressive algorithm on the larger instance an additional 5 percent of redundancy was detected over that of the CDC16 algorithm on the smaller instances. This translates into approximately 41GB of additional redundant data eliminated. The execution times on the large instance with more aggressive algorithms are slightly longer compared with the medium instance.

(41)

32 Chunker # of Chunks Memory Requirements (MBs) % Deduplication Execution Time (min) Total Size m1.Small fixed16 43828883 835.9696007 12.66% 1552.283333 764.04G cdc16 30729169 586.1123848 22.00% 1515.566667 764.04G c1.Medium fixed16 43828883 835.9696007 12.66% 875.7333333 764.04G cdc16 30729169 586.1123848 22.00% 896.2833333 764.04G m1.Large fixed16 43828883 835.9696007 12.66% 916.5666667 764.04G fixed8 87026664 1659.901886 13.10% 964.7333333 764.04G cdc16 30729169 586.1123848 22.00% 940.5333333 764.04G cdc8 58972915 1124.819088 25.05% 853.1333333 764.04G cdc4 116120895 2214.830303 27.37% 959.8833333 764.04G

Table 7: EC2 Instance Large Dataset Results

Using these results we are able to now construct and analyze a cost model associated with the cost tradeoff when selecting a smaller instance type and less aggressive algorithms versus the option to select a larger instance type and a more aggressive algorithm. We also looked at the cost model when storing the data for varying lengths of time from one month to one year and the affect storage duration have on the cost savings and decision in selecting an instance type.

To recap the factors of our cost analysis: instance type (m1_small, c1_medium, m1_large), de-duplication algorithms (fixed16,fixed8, cdc16,cdc8,cdc4), and the storage duration (1month, 3 months, 6 months, 1 year). Comparisons will be performed using the small and medium instance types against the large instance types. As discovered with the small dataset testing the cost savings are nonexistent or insignificant to compare the small and medium instances against each other.

To start we will look at the cost breakdowns of the Amazon EC2 and the S3 offerings. The Amazon EC2 compute costs are based on per instance hour used and data

(42)

33

transfer in and out of the EC2 environment. Partial consumed instance hours are billed as full hours, so all execution times will be rounded to the next hour for cost comparison. As for the data transfer in to the EC2 environment, this cost will be excluded from our analysis since this cost does not change depending on the instance type we are selecting. Amazon’s S3 cost model is based on the following factors: standard storage pricing – which is the pricing for the amount of storage used; request pricing – cost for the number of put, copy, post, list or get operations performed on your S3 storage bucket; data transfer cost – the cost to transfer data into and out of S3. Since we are using EC2 to communicate with S3 there is no data transfer charge for data transferred between Amazon EC2 and Amazon S3 within the same region, for our case both the EC2 and S3 are within the East region [26]. Additionally, we are focusing on the standard storage pricing as opposed to the reduced redundancy storage which introduces the risk of data loss. In the case of de-duplication data protection is critical due to the large percentage of files that can be referencing a single data block. The reduced storage option is available as a lower cost option for data that is reproducible [26]. Below are a couple tables with the breakdown of the E2 and S3 pricing for the East region at the time of publication.

AWS EC2 Compute Pricing

Type $ Cost/Hr

Small (m1_small) 0.08

Medium (c1_medium) 0.165

Large (m1_large) 0.32

Table 8: AWS EC2 Pricing

AWS S3 Storage Pricing

Tiers $ Cost/GB

First 1TB / Month 0.125

Next 49TB 0.11

Next 450TB 0.095

Request Cost per 1,000 0.01

(43)

34

Based on the execution times seen on the three instance types we will begin by breaking out the compute and storage cost associated with small instance type running the CDC16 algorithm. For the compute cost we look at the execution time which is 1552.283333 minutes which translates to 26 hours after rounding up to the nearest hour. The compute cost is a straight calculation using the 26 hour multiplied by the per hour cost of the small instance type of $.08 per hour, which equals $2.08. The storage cost has a couple factors to take into account. One being the storage cost of $.125 per GB for the first TB stored. After running the CDC16 algorithm 22% redundant data was removed leaving approximately 596GB which has an associated cost of $74.50 per month. The second component of the storage cost is the request pricing. The request pricing is based on PUT, COPY, POST, LIST, or GET requests. The pricing for the PUT, COPY, POST, or LIST are $.01 per 1,000 requests while the GET and other request are $.01 per 10,000 requests [26]. In order to calculate the number of request we need to determine the number of files that make up our dataset which translates into the number of object put requests. The 764.04GB dataset is comprised of 450,990 files and directories, which translates to an estimated 450,990 PUT operations that has an associated cost of $4.51.

Using these figures we are able to calculate the cost for a one month, three month, six month, and one year storage period. The calculations are the same for the medium and large instance types with the exception of the values for execution time and

de-duplication percentage. The request cost remains the same as the dataset and number of files remains consistent across instance tests.

(44)

35

Algorithm / Instance Storage TimeFrame

CDC16 on Small Instance 1 Month 3 Month 6 Month 1 Year

Compute Cost $2.08 $2.08 $2.08 $2.08

Storage Cost $79.01 $237.03 $474.06 $948.12

CDC16 on Medium Instance 1 Month 3 Month 6 Month 1 Year

Compute Cost $2.48 $2.48 $2.48 $2.48

Storage Cost $79.01 $237.03 $474.06 $948.12

CDC8 on Large Instance 1 Month 3 Month 6 Month 1 Year

Compute Cost $4.80 $4.80 $4.80 $4.80

Storage Cost $76.01 $228.03 $456.06 $912.12

CDC4 on Large Instance 1 Month 3 Month 6 Month 1 Year

Compute Cost $5.12 $5.12 $5.12 $5.12

Storage Cost $73.76 $221.28 $442.56 $885.12

Table 10: Instance Cost Assessment

With the varying storage durations the following assumptions were made: Compute Cost – the compute cost was only calculated for the initial data de-duplication process, any subsequent data accesses are not taken into account. The data access frequency is independent of the instance type. Another aspect that was not taken into account is additional data being added or delete to the cloud instance over the storage duration. Data additions based on our research have a positive impact on the cost savings seen with the more aggressive algorithms.

Looking at the cost savings of the large and small instances, the use of the large instance type with the aggressive algorithm of CDC4 over a year storage time frame produces a cost savings of 6.15% or $58.37 compared with running the CDC16 using the smaller instance type. The smaller storage timeframes also produce a cost savings in the range of 2.5% for the first month to 5.82% at the six month mark. When comparing the CDC8 on the large instance a cost savings is not realized immediately, with the first month savings at less than 1%. Therefore, in order to maximize the cost savings in the

(45)

36

shortest amount of time running the CDC4 algorithm on the larger instance will produce such results.

Small vs Large 1 Month 3 Month 6 Month 1 Year

$ Savings CDC8 vs CDC16 0.18375 5.99125 14.7025 32.125

% Savings 0.23% 2.51% 3.09% 3.39%

% Savings 2.57% 5.16% 5.82% 6.15%

Table 11: m1_small vs. m1_large Instance Cost

The cost analysis between the medium and large instance is similar to that of the small and large. The medium instance having the same memory requirements as the small instance type is again limited to the CDC16 algorithm. Therefore, the amount of

duplicates detected will be the same as the small instance but with an improvement in execution time. The execution time of the medium instance is 40% less than that of the small instance, but with the per instance hour cost for the medium instance 52% higher this will only generate a higher compute cost for execution compare with the small instance. For example in the first month the small instance has a cost of $81.09 compared with $81.49 for the same execution on the medium instance type. With the increased compute cost for the medium instance the overall saving compared will large instance will reflect that increased cost amount with a positive gain in savings. For example the first month savings of 2.57% for the small to large compared with 3.04% for the medium to large.

(46)

37

Medium vs Large 1 Month 3 Month 6 Month 1 Year

% Savings 0.71% 2.67% 3.17% 3.43%

% Savings 3.04% 5.32% 5.90% 6.19%

Table 12: c1_medium vs. m1_large Instance Cost

Dataset size is an important factor in the potential cost savings when selecting and appropriate instance type. Initially our testing focused on a small dataset of 100GB to 300GB for various data types. This testing examined the memory requirements for the most aggressive variable algorithm CDC4. It was determined that the small and medium instance memory resources were sufficient to handle the CDC4 algorithm. Therefore the execution time improvement on the medium instance produced little to no cost avoidance for choosing the medium instance over the small instance.

Looking at a larger dataset of 764.04GB support for cost savings can be realized by selecting a larger instance type and running a more aggressive duplicate detection algorithm. This assumes that additional redundancy continues to be detected the more granular the algorithm chunks a given dataset. The redundancy factor is data dependent and sample tests need to be performed locally to use to gauge the expected percentage of savings and for selecting the appropriate instance type using the memory estimation methodology presented. Also simply selecting a larger instance is not always the most cost effective method as we present with the small dataset testing.

(47)

38

Chapter 5: Related Research

Cloud cost analysis has been performed for various cloud offerings including data caching models [31] where depending on the average unit-data size, total cache size and cache request per month various cloud offerings provide the best cost per unit speedup. In [34] the author looked at the cost efficiency of using Amazon S3 offering for large scale science projects, concluding that S3 is a cost effective option only for large dataset size. Additional studies looked at the total cost of ownership and cloud utilization cost of resources from a solution provider point of view. Comparing the cost of the traditional infrastructure model to evaluate the economic efficiency and cost optimization of the cloud [32]. Other studies continue the theme of total cost of ownership but from the consumer point of view, specifically in the education system [33]. Examining the capital and operating cost of acquiring, operating and maintaining the infrastructure versus that of a pure cloud model. The research showed support for significant cost savings using cloud resources over the traditional model.

(48)

39

Chapter 6: Conclusion

In this research we provided a review of de-duplication technologies. We examined the various approaches from the timing, placement, and the granularity of the duplicate detection. In addition, we explored de-duplication implementations and presented an index memory estimation formula to assist in sizing the cloud compute instance type when exploring integrating a fixed or variable block de-duplication

algorithm into a cloud environment. A cloud resource cost model was also developed by examining the cost saving potential based on the instance type, de-duplication chunk granularity and data storage periods.

Through experiments we have shown support for more aggressive de-duplication algorithms to maximize cost savings on a larger cloud compute instance versus that of a smaller instance type and less aggressive de-duplication algorithm. In some cases the dataset size does not warrant a larger instance type to run more granular de-duplication algorithms since the index memory requirements are satisfied by the entry level instance types. In these situations there is no benefit to choosing a larger instance type in an effort to reduce cloud resource cost.

Our cost model focused on the Amazon cloud offering, which continues to evolve and prices continue to be improved across both the EC2 and S3 offerings. Based on our analysis the prices changes appear to happen in the same proportion across all instance types and even storage offerings. Therefore from a cost savings percentage point of view our analysis will remain valid in the midst of the price changes.

(49)

40

Additionally our analysis focused on a large dataset of less than 1TB in size, while datasets much larger exist in real world applications. One of the factors related to the overall de-duplication ratio is the dataset size and with the increased scope for duplicate detection the ratio is likely to increase [19]. With the increase in the de-duplication ratio this would have a direct impact on the overall cost savings. Therefore we expect as larger datasets are in scope for de-duplication in the cloud the higher the cost saving will be even as the instance type requirements increase beyond what was presented in our testing.

(50)

41

References:

[1] M. O. Rabin. Fingerprinting by random polynomials. In Center for Research in Computing Technology, Harvard University. Tech Report TRCSE- 03-01, 2006, 1981.

[2] L. Aronovich, R Asher, E. Bachmat,H. Bitner, M Hirsch, S Klein - The Design of a Similarity Based De-duplication System, 2009

[3] G Dal Bianco, R Galante, C. Heuser - A fast approach for parallel de-duplication on multicore processors 2011

[4] N. Mandagere, P Zhou, M Smith, S. Uttamchandani - Demystifying Data De-duplication 2008

[5] X. Li, M. Lillibridge, M Uysal - Reliability Analysis of Deduplicated and Erasure-Coded Storage

[6] National Institute of Standards and Technology. FIPS Publication 180–1: Secure Hash Standard, 1995.

[7] D. Meister, A. Brinkmann - Multi-Level Comparison of Data Deduplication in a Backup Scenario, 2009

[8] K Eshghi, H. Tang – A Framework for Analyzing and Improving Content Based Chunking Algorithms 2005

[9] T. Moh, B. Chang – A Running Time Improvement for the Two Thresholds Two Divisors Algorithm 2010

[10] Data Domain, Data Invulnerability Architecture

(51)

42

[12] B. Romanski, L. Heldt, W. Kilian, K. Lichota, C. Dubnicki – Anchor Driven Subchunk Deduplication 2011

[13] P Nath, B Urgaonkar, A. Sivasubramaniam – Evaluating the Usefulness of Content Addressable Storage for High-Performance Data Intensive Applications 2008

[14] S. Bhattacherjee, A. Narang, V. Garg – High Throughput Data Redundancy Removal Algorithm with Scalable Performance 2011

[15] W. Curtis Preston, Backup Central – The Rehydration Myth 2009 [16] About Restore – Defragmentation, Rehydration and Deduplication 2009 [17] Wikipedia – Delta Encoding 2012

[18] A. Muthitacharoen, B. Chen, D. Mazieres – A Low-bandwidth Network File System

[19] M. Dutch, L. Freeman – Understanding data de-duplication ratios 2008 [20] D. Meyer, W. Bolosky – A Study of Practical Deduplication 2012 [21] EMC Avamar -

http://www.emc.com/backup-and-recovery/avamar/avamar.htm

[22] Symantec Pure Disk - http://www.symantec.com/netbackup-puredisk [23] Data Domain - http://www.datadomain.com/

[24] IBM ProtecTIER -

http://www-03.ibm.com/systems/storage/tape/enterprise/virtual.html [25] Exagrid - http://www.exagrid.com/

[26] Amazon AWS - http://aws.amazon.com/

(52)

43 [28] Cloud Computing – Wikipedia

[29] M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar,G. Trezise,P. Camble – Spare Indexing: Large Scale, Inline Deduplication Using Sampling and Locality

[30] C. Constantinescu, M. Lu – Quick Estimation of Data Compression and De-Duplication for Large Storage Systems

[31] D. Chiu, G. Agrawal - Evaulating Caching and Storage Options on the Amazon Web Services Cloud -

[32] - X. Li, y. Li, T. Liu, J. Qiu, F. Wang - The Method and Tool of Cost Analysis for Cloud Computing

[33] - D. Kondo, B. Javadi, P. Malecot, F. Cappello, D. Anderson - Cost-Benfit analysis of Cloud Computing versus Desktop Grids

[34] - M. Palankar, M. Ripeanu, S. Garfinkel - Amazon S3 for Science Grids: a Viable Solution? - 2008