High Availability Replication Strategy for Deduplication Storage System

(1)

High Availability Replication Strategy for Deduplication Storage System

1

_{Zhengda Zhou,}

2

_{Jingli Zhou}

College of Computer Science and Technology,

Huazhong University of Science and Technology,

*1,

Abstract

As the amount of digital data growing explosively, data deduplication becomes an attractive technique to conserve the storage and network requirements for the mass storage systems. However, data deduplication achieves data compression at the cost of error resilience. A high availability data replication strategy is proposed to improve data availability for the deduplication storage systems. Firstly, data availability optimization model is proposed to calculate the optimal replication degrees for the individual data object. Then several acceleration techniques are proposed to reduce the computational cost of the model and make the proposed scheme feasible and effective. Lastly, the evaluation experiments demonstrated the proposed strategy has improved the data availability significantly for the deduplication storage systems.

Keywords

: Deduplication, Data Availability, Storage System

1. Introduction

As the volume of stored digital data expanding explosively, storage systems confront challenges to store and manage the massive digital data in an efficient way. By the year of 2020, the amount of digital information will be 44 times as large as it was in 2009. [1] To meet the exponentially increasing storage requirement, the deduplication technology provides a solution to the storage efficiency issues for modern storage systems.

The stored data in the storage systems contains plenty of duplicates: identical files or sub-file regions, and obviously, eliminating the duplicates in stored data can reduce the storage space overhead and improve the storage efficiency. The deduplication technology is a specialized data compression technique to improve storage and network utilization. The storage system divides the data into non-intersecting chunks or storage objects and assigns the digital fingerprints by the cryptographic hash function, typically SHA1 or MD5, before storing the data on the physical media. Consequently, the chunks or storage objects which their contents are identical are supposed to share the same digital fingerprint, and then the duplicated chunks or storage objects can be detected and eliminated easily.

However, the deduplication storage system achieves storage space savings at the cost of data availability. The client data is divided into chunks during the deduplication processing, and the duplicate data chunks just store one copy. Since a data chunk can be share by many files or clients, the loss of a single data chunk will possibly bring much more loss of client data than the case before the deduplication processing. Therefore, the different data chunks have different influences on the availability and integrality of the total data set. Specifically, the data chunks with a higher commonality tend to have more influences on the total data set.

In this paper, our objective is to achieve an optimum level of data availability at minimum storage space overhead. We argue that it is an efficient method to improve the overall data availability by exploiting the statistical characteristic of the commonality of the data chunks in the deduplication storage system. To achieve this, we build a data availability optimization model in the context of the deduplication storage system. And we proposed some acceleration techniques to improve the feasibility and efficiency of the proposed scheme.

The remainder of this paper is organized as follows. The next section discusses the motivation for our research. Section 3 proposes a novel replication scheme for the deduplication storage systems. Then, section 4 proposes some acceleration techniques to improve the feasibility and efficiency of the proposed scheme. The section 5 presents the experimental methodology, datasets and discusses the experimental results. Section 6 provides some related works. Section 7 concludes the paper.

(2)

2. Motivation

The deduplication technology helps the storage system to increase storage utilization by eliminating data redundancy. As the duplicates in data are eliminated, the storage systems have reduced the actual storage capacity demands, and the reduction of the TCO (Total cost of ownership) of storage infrastructure becomes the immediate benefit. However, the deduplication technology achieves high storage space efficiency at the cost of data availability.

The issue of data availability becomes more interesting when the target datasets performed deduplication processing. The duplicate data chunks can be detected by their fingerprints and only store one copy on the physical media. Thus the data chunks actually reorganized in a partially overlap way. Obviously, the loss of a data chunk may bring disproportionately large client data or files unavailability. In other word, the more files are referred to a specific data chunk; the more loss brings to the total dataset if the data chunk is unavailable. The figure 1 illustrates the problem.

A B C A A E D B File 1 File 2 File 3 F References 3 2 1 1 1 1 Chunks Chunks Deduplication A B C D E F

Figure 1. Illustration of relationship between files and data chunks in deduplication storage systems

As shown in the Figure 1, File 1, File 2 and File 3 are divided into chunks which are recorded with capital letters A to F. After the deduplication process, the data chunks are stored on the physical storage media. If the data chunk A is lost, all the three files are unavailable. Similarly if the data chunk C is lost, only the File 1 is unavailable. Thus, the data chunks are not equally valuable in the terms of data availability. Therefore, we consider finding a way to improve the overall availability of the datasets stored in a deduplication storage system by exploiting the commonality of data chunks.

3. A novel data replication scheme

As the failure is inevitable, data replication is the common solution to improve availability is to keep the data redundant. The proposed scheme aims at maximizing data availability at minimal storage space overhead by calculating the optimal replicated degree for every individual data object in deduplication storage system. The proposed scheme is suitable for both file-level deduplication and chunk-level deduplication. The basic deduplication units, files for file-level mode and chunks for chunk-level mode, are referred to as storage object in the paper.

The proposed scheme assumes that storage node failures are independent and identical distributed. The replicas are stored in a distributed storage systems randomly, and no two same replicas are stored in the same storage node. All the storage nodes are supposed to share the same availability distribution, which implies that the expected availability of a storage node is a constant in the model. Moreover, the failure event occurs on a storage node independently and randomly, hence the binomial distribution can be employed to determine the probability of obtaining exactly i available storage nodes in all the storage nodes which a specific data object are stored on. This model only considers the conventional replication case; hence the data object is available if at least one replica can be accessed. The availability function for an individual data object can be given as follows:

1 0 (1 ) k k k i i i i A C      



 (1) where μ is the availability of a storage node and k is the number of replicas. The availability function can be used to calculate the availability of an individual data object in the distributed storage system by giving the replication degree k.

(3)

In a deduplication storage system, a data object is probably shared by a number of files or different client due to deduplication processing. According to the analysis above, the data objects are not equally valuable in term of the availability of data set. A lost data object with a higher commonality causes more data lost than the one with a lower commonality. In order to evaluate the value of a data object in deduplication storage system, we define a cost function as follows:

( ) r (1 ) P k A N   (2) where the total number of files and the number of references are denoted by N and r respectively; the factor (1-A) is the probability the data object with k replicas are unavailable, and the factor r/N is the weight based on the data object commonality in the data set. Therefore, the cost function for the data set can be given as follows:

1 2 1 ( , ) N i (1 ) N i i r P k k k A N  



  (3) The individual data object size is given and denoted by s, hence the total storage space overhead S can be computed as follows:

1 N i i i S k s  



(4) Then, the problem is how to find the value of k for each data object to minimize the cost function by given a tolerant storage space overhead. For this purpose, we employ the Lagrangian multiplier method to derive the solution of the problem. Firstly, the maximum value of S is given and denoted by Smax. Then, one Lagrange multiplier λ is introduced, and the Lagrange function is constructed to incorporate the cost function and the constraint condition together as follows:

1 2 max max 1 1 1 1 ( , , ) (1 ) ( ) (1 )i ( ) N N N N k i i N i i i i i i i i i r r k k k A k s S k s S N N           



 



 



 



 (5)

The critical point of Λ is obtained when its gradient is equal to 0. N+1 equations are obtained as follows: 1 1 1 1 max 1 (1 ) ln(1 ) 0 (1 ) ln(1 ) 0 0 N k k N N N N i i i r s k N r s k N k s S









   _ _ _ _ _     _  _ _ _ _ _     _ _  



 (6)

As above, for the very k in (6), the following equations can be satisfied

1 1 (1 ) ln(1 ) log log ln(1 ) i k i i i i i s N r s N k r                    (7) Let 1 ( ) lo g ln (1 ) N L         (8)

Applying (8) to the constraint condition in (6), we have the following equation:

m ax 1 1 1 ( ( ) lo g ) N N i i i i i i i s S s k s L r          _  _  



(9)

The original total storage space overhead is denoted by Sorig. Sorig can be computed as follows: 1 N orig i i S s  



(10) Combining (9) to (10), we can obtain the following equation to calculateL( ) :

(4)

1 1 max log ( ) N i i i i orig orig s s S r L S S 



_ _



  ₍₁₁₎ Applying (11) to (7), we obtain the solution of k,

1 1 max 1

log

N i i i i i i i orig orig

s

S

r

k

r

S

    









(12)

According to the equation (12), the solution of k consists of three parts. The first part is log1-usi/ri , which depends on the properties of the data object i. (the value ofμis regarded as a constant in the model as mentioned above) The second part Smax/Sorig is the redundancy degree which indicates the storage space overhead for the replication relative to the original storage space overhead. The value of this part is regarded as a constant when the constraint condition in (6) is determined. The third part depends on the properties of the targeted dataset. The influence of the properties of an individual data object to the value of the third part is considerably small, thus, it can be regarded as a constant when calculating k for an individual data object.

Both the cost function (3) and the constraint condition in (5) are strictly convex. Therefore, the solution is globally minimal since a local critical pointof a strictly convex function is also its global minimal point.

4. Acceleration techniques

The optimization model can obtain the set of optimal replication degrees for each data object theoretically, however, this model meet some challenges to be employed by actual systems for some reasons. Due to the large scale of the dataset in an actual storage system, the model can impose a heavy burden to the system computational requirement. If the computational complexity is not improved, the negative effects on system performance can make the model impractical in many applications. In order to relieve computational burden for implementation of the model, we propose two techniques below.

4.1. Sample-based learning algorithm

The purpose of the sample-based learning algorithm is to reduce computational burden by exploiting statistical characteristics of the data objects. According to the equation (12), the value of the third part depends on the statistical characteristics of the targeted data set. In other word, the value can be regard as a constant, if the distribution of object size and reference is stationary. Based on the observation, our idea is to sample a sufficient subset from the targeted dataset to apply the model instead of the whole dataset. The other data objects can estimate their optimal replication degrees by empirical data. Because only the sample dataset is applied to the model, the proposed algorithm can obtain optimal replication degrees for each data object at much lower computational cost.

Choosing a proper subset from the target dataset is important to the proposed algorithm. The proper subset must be a representative subset which shares the same statistical distribution with the entire data set. Specifically, the empirical distributions of the object size and the reference are obtained from the subset must approximate the true distributions of entire data set adequately, or else the excessive sampling errors can lead to the inaccurate results eventually. To reduce the sampling errors, the sample method employed by the proposed algorithm must satisfy two requirements. Firstly, the samples must be chosen randomly and independently, ensured that the empirical distribution converges uniformly to the true distribution. Secondly, the sample size must be sufficient in order to reduce the sample errors.

In the deduplication storage system, the data is divided into data objects which are assigned the IDs by hash functions. Because the employed hash functions are pseudo random number generators, the object IDs are mapped to address range uniformly. Therefore, the random and independent sample can be implemented easily by prefix-based ID sampling method.

The object ID which is a hash key generated by hash function can be viewed as fixed-size strings of bits in the space covering all possible combinations of such strings. A prefix can be used to select a sample zone. If the object ID is a N-bit key, then a given M-bit prefix ( M<N)

(5)

1 2 M

p p



p

(where p denotes a specific bit )is used to determine the sample zone where all the object ID starting with this prefix, i.e. of the form

1 2 M M 1 M 2 N

p p



p b



b





b

where b denotes a variable bit. The prefix-based ID sampling method has two advantages. Firstly, since the object ID follows the uniform distribution, the method is efficient and simple. Secondly, most distributed storage systems also employed the prefix-based method to distribute the data objects, hence this method make the proposed algorithm more feasible and scalable.

To determine a proper sample size, we proposed an error control method. According to the equation (12), only the third part depends on the statistical distribution of the data set. Hence, we define an observed parameter T as follows:

1 1 1 log ( ) n i i i i n i i s s r T n s     



(13)

where the sample size is denoted by n. As the n grows large, the value of T converges to the T(N), where N is the total number of data objects in the data set. Our goal is to obtain approximate value of T(N) under a tolerate error . We proposed a heuristic algorithm to find the proper approximate value of T(N) by an optimal sample size. The proposed algorithm estimates the value of T(N) by the moving average and variance analysis in a successive approximation approach.

Figure 1. The Flowchart of the proposed heuristic method

Figure 1 shows the flowchart of the proposed method. The method is summarized as follows: Step 1: Initialize the parameters. The initial sample size and the step size are given and denoted by n0 and m respectively; hence, the ith sample size is ni=n0+m*i.

Step 2: The size of moving average window is denoted by w. T(ni) are calculated by (13).

Step 3: Calculating the simple moving average of the subset T(n0),T(n1)…T(nw-1) by formula as follows: 1 1 ( ) w i i T n M A w   



(14) Step 4: Calculating variance by formula as follows and compared with threshold K

2 1 2 1 [ ( ) ] w i i T n MA w



   



(15) If _2_{is less than K, the optimal sample size is set to n}

w-1 and the approximate value of T(N) is set to T(nw-1). Otherwise n0 is incremented by n0 = n0+m and the new T(nw-1) is calculated by (13), then go

(6)

to the step 3 and the MA and _2_{is calculated repeatedly. The proposed heuristic method improves}

the scalability and feasibility of the optimization model significantly.

4.2. Fast decision table

The fast decision table is another method to reduce computational burden of the optimization model. After the statistical characteristics of the targeted data set is obtained by sample-based learning algorithm, other data objects in targeted dataset estimate their optimal replication degrees by the equation (12). According to analysis above, only the first part in the equation (12) is an individual deviation part, thus the formula used to calculate optimal replication degrees can be written as follows:

1 ( , ) log s K s r C r     (16) where C is a constant. Since the amount of data objects is very large, calculating the optimal replication degrees for every data objects is still a heavy computing task. To reduce the computational complexity further, we proposed an optimized method which applies a fast decision table to lookup the optimal replication degree instead of calculating directly by the equation (12).

The proposed method has the advantage over the calculating method. Due to the computational complexity of the logarithm calculation for the equation (12), the fast decision table can achieve the same accuracy by lookup operations at a lower computational cost. In addition, the calculating method can bring in the round-off errors and representation errors to the intermediate results easily, which can accumulate in the ill-conditioned case and make the result meaningless.

To establish the fast decision table efficiently, the proposed method must satisfy two requirements. Firstly, the table ought to achieve adequate accuracy in value estimation. Secondly, the size of the table ought to be small enough in the context of large-scale dataset, so that the storage system can afford the storage overhead of the table and the computational cost of lookup operations.

Actually, the fast decision table is a feasible and efficient method. On the one hand, since only the conventional replication is considered for data redundancy scheme, the feasible solutions of the optimal replication degrees are integer. The referred calculations in the model employ float point calculation and the round-off operation can achieve adequate accuracy easily. On the other hand, the formula (16) is monotonic for both the individual data object size s and the number of references r, which facilitates the creation of the fast decision table.

The proposed method does the following steps to establish the fast decision table:

Step 1: Set the range of the final solutions. The lower bound of the solution is 1 theoretically. And an upper bound is given by U according to the system requirements. Obviously, the final solution is a integer in the range of [1,U].

Step 2: Calculate the corresponding value of s/r for every possible solution by the inverse function of (16). The formula can be written as follows:

( , )

(1

)

K s r C

s

r





 

(17) Step 3: Choose a subset R form value range of r by a binary logarithmic scale. Hence the subset of r is {1,2,4,8….2N_{} . The maximum of r is in the range of [2}N-1_,2N_].

Step 4: Calculate the corresponding value of s by given r and s/r; and establish the fast decision table.

Step 5: Merge adjacent interval of r and s which share the same replication degree.

The method to establish the table is presented above. In addition, the method can be optimized further by exploiting the statistical distribution of the r and s. Since the optimization depends on the intrinsic property of the data-set itself, we do not discuss this issue in detail here.

5. Evaluation

In order to evaluate the feasibility and effectiveness of the proposed scheme, we built a simulator that allows us to experiment with some important parameters. The simulator can perform both file level and chunk level deduplication on target data sets. The simulator applies the SHA-1 algorithm to generate the data object IDs and group them in the prefix-based method. In the experiments, the groups

(7)

of data objects are supposed to be different storage nodes and the group IDs are set to the corresponding prefix. The simulator can replicate the data object in successive groups, for instance, if the data object t with replication degree n is distributed in group G, the simulator replicates it from the group G+1 to group G+n-1.

We apply the simulator to two realistic data sets in the experiments. The first data set was collected all the document files in the pdf, doc, txt formats and the media files in mp3, avi, mkv and rmbv formats from the desktop PCs of 30 graduate students in our research team. There are 200,823 files which amount to a total of 1.86TB data in the data set. The second data set was collected from the backups of source code over the course of two years in the lab. The backup policy is to do full backups if a new version is developed. There are 2.17 million files which amount to a total of 826.75GB data in the data set.

The first data set was employed to the simulator in file-level depduplication mode and then performed the proposed scheme. For the comparison, the uniform replication method which all the data objects shared the same replication degree was also performed on the data set.

Figure 2. The rate of available files comparison in file depduplication mode

For evaluating the effectiveness of the proposed scheme, we compare the rate of available files under the same number of lost storage nodes. The redundancy degree R is set to 2 or 3 in the experiments. In the figure 2, the vertical axis represents the average client file lost rate and the horizontal axis represents the number of unavailable storage nodes. For a specific number of unavailable storage nodes, there can be several possible combinations. In the experiment, the total number of the storage nodes was set to 8 and all the possible combinations of lost nodes enumerated exhaustedly and the average client file lost rates are summarized by arithmetic means. (for every possible combination are equally probable.) As shown in the figure 2, the average client file lost rates for the proposed scheme is lower than the rate for the uniform replication in most case. However，since the replication degrees of some files for the proposed scheme are less than the redundancy degree R, the average client file lost rates for the proposed scheme are a bit higher than the lost rates for the uniform replication when the number of unavailable storage nodes is less than R. In the experiment, we measure the actual storage overhead for both replication schemes. Table 1 compares the actual storage overhead (after deduplicaton process) for the two replication schemes under different redundancy degrees. The data in the table 1 shows the space overhead of two schemes have tolerated difference.

Table 1. The actual storage overhead for the two replication schemes

Redundancy degree The proposed scheme The uniform replication Δ ΔRate

1 2 3 1587.86 GB 3200.21 GB 4721.37 GB 1512.25 GB 3024.51 GB 4536.76 GB +75.61GB +175.7GB +184.61GB 105.00% 105.81% 104.07%

The second data set was employed to the simulator in chunk-level deduplication mode and then performed the proposed scheme. Similarly, for the comparison, the uniform replication method was also performed on the data set for the comparison.

(8)

Figure 3. The rate of available files comparison in chunk-level depduplication mode

In the figure 3, the vertical axis represents the average file lost rate and the horizontal axis represents the number of unavailable storage nodes. Since the file is split into many chunks for deduplication processing, one chunk lost can cause the whole file useless. Therefore, the file availability in chunk-level deduplication is more vulnerable. In the experiment, the total number of the storage nodes was set to 256, and in order to reduce the computation cost and make the experiment feasible, we chose 100 possible combinations randomly for every lost node number and summarized the file lost rates by arithmetic means. As shown in the figure 3, the file lost rates for the proposed scheme is lower than the rate for the uniform replication in most case. Table 2 below compares the actual storage overhead for the two replication schemes under different redundancy degrees. The data in the table 2 shows the space overhead of two schemes have tolerated difference.

Table 2. The actual storage overhead for the two replication schemes

Redundancy degree The proposed scheme The uniform replication Δ ΔRate

1 2 3 81.11 GB 162.27 GB 242.98 GB 78.66 GB 157.32 GB 235.97 GB +2.45GB +4.95GB +7.01GB 103.11% 103.15% 102.97%

Experimental results above show that the proposed scheme can improve the data availability at the similar storage overhead under both the file-level deduplication mode and chunk-level deduplication mode.

6. Related works

As data have been growing explosively, the data deduplication technology widely employed in the modern storage systems. A number of solutions have been proposed and developed in both industry and academia.

Venti[2] is a content-addressable network storage system. Venti provides inherent integrity checking mechanism for the data corruption detection and employs RAID to improve the data reliability. HYDRAstor[3] is a more recent commercial implementation of a content addressable storage delivering global deduplication. HYDRAstor improves the data reliability by employing the general replication and erasure coding methods. All the data in the two systems share the same data redundancy scheme simply regardless of the properties of data itself.

Deep store[4] is also a block-level storage system for archival data, which employs a virtual content addressable storage with multiple methods for inter-file and intra-file compression. The paper [4] discussed side effects of data compression on data reliability which is very similar to our research, and proposed a replication scheme based on chunk weight. In the paper [5], a high reliability provision mechanism for large-scale deduplication archival storage systems is proposed. It packs variable length data chunks into fixed sized objects, and exploits ECC codes to encode the objects and distributes them among the storage nodes in a redundancy group. However, without a

theoretical

model, the replication scheme mentioned in the [4] and [5]

(9)

cannot obtain

the optimal replication degree for every individual data

chunk. The paper [6] has outlined a few reliability analysis problems that arise from the deduplication of a erasure-coded key-value store. However,

they have not discussed the solutions in detail.

Besides, there are several other studies have investigated on deduplication techniques in the academia. Some of them focus on fast duplicate data detection techniques[7], some of them focus on parallel deduplication processing[8][9], and some others focus on content defined chunking techniques[10] [11]. But most of them ignored the data availability issues in deduplication storage and leave the problem to the operation systems or hardware simply.

Unlike the mentioned related works above, we focus on the data availability issues in deduplication storage systems and propose the optimized schemes by exploiting the properties of data itself. Analyzed both data compression and data redundancy synthetically, the proposed scheme aims to achieve high data availability at the least storage overhead.

7. Conclusion

A data availability optimization model in the context of deduplication storage system is presented in this paper. The novel replication scheme based on the model is proposed to the overall data availability by exploiting the statistical characteristic of the data object commonality for the deduplication storage system. In order to make the scheme more feasible, several fast algorithms are proposed to relieve computational burden of the proposed scheme. To verify the feasibility and effectiveness of the proposed scheme, we performed the evaluation experiments and the experimental result shows that the data availability has significantly improved by our proposed strategy.

8. References

[1] "The Digital Universe Decade – Are You Ready?" IDC white paper, May 2010.

[2] Sean Quinlan, Sean Dorward. "Venti: a new approach to archival storage", In Proceedings of the 1st USENIX conference on File and Storage Technologies, pp.89-101, 2002.

[3] Cezary Dubnicki, Leszek Gryz, Lukasz Heldt, Michal Kaczmarczyk, Wojciech Kilian, Przemyslaw Strzelczak, Jerzy Szczepkowski, Cristian Ungureanu, Michal Welnicki, "HYDRAstor: a Scalable Secondary Storage", in Proceedings of the 7th USENIX Conference on File and Storage Technologies, pp.197-210, 2009.

[4] Deepavali Bhagwat, Kristal Pollack, Darrell D. E. Long, Thomas Schwarz, Ethan L. Miller, "Providing High Reliability in a Minimum Redundancy Archival Storage System", in Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pp. 413-421, 2006.

[5] Chuanyi Liu, Yu Gu, Linchun Sun, Bin Yan, Dongsheng Wang, "R-ADMAD: High reliability provision for large-scale de-duplication archival storage systems", In Proceedings of the 23rd international conference on Supercomputing, pp. 370–379, 2009.

[6] Xiaozhou Li, Mark Lillibridge, "Reliability Analysis of Deduplicated and Erasure-Coded Storage", ACM SIGMETRICS Performance Evaluation Review, vol.38, no.3, pp.4-9, 2011.

[7] Benjamin Zhu, Kai Li, Hugo Patterson, "Avoiding the Disk Bottleneck in the Data Domain Deduplication File System", In Proceedings of the 6th USENIX Conference on File and Storage Technologies, pp. 269-282, 2008.

[8] Wei Dong, Fred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, Philip Shilane, "Tradeoffs in scalable data routing for deduplication clusters", In Proceedings of the 9th USENIX Conference on File and Storage Technology, pp.15-29, 2011.

[9] Yujuan Tan, Dan Feng, Fangting Huang, Zhichao Yan, "SORT: A Similarity-Ownership Based Routing Scheme to Improve Data Read Performance for Deduplication Clusters", IJACT, Vol. 3, No. 9, pp. 270- 277, 2011

[10] Erik Kruus, Cristian Ungureanu, Cezary Dubnicki, "Bimodal content defined chunking for backup streams", In Proceedings of the 8th Conference on File and Storage Technologies, pp. 18- 31, 2010.

[11] Jiansheng Wei, Ke Zhou, Lei Tian, Hua Wang, Dan Feng, "A Fast Dual-level Fingerprinting Scheme for Data Deduplication", JDCTA, Vol. 6, No. 1, pp. 271-282, 2012

High Availability Replication Strategy for Deduplication Storage System