Study of existing cloud data storage Techniques with respect to optimized Duplication: Deduplication

(1)

Study of existing cloud data storage Techniques with respect to optimized Duplication: Deduplication

Nipun Chhabra^a*, Manju Bala^b

aPh.D. Research Scholar, I. K. Gujral Punjab Technical University, Kapurthala (Punjab), India.

bDirector, Khalsa College of Engineering & Technology, Amritsar , (Punjab), India.

Abstract

Data Deduplication techniques were invented to eradicate duplicate data which result in storage of single copies of data only. Data Deduplication decreases the disk space required to store the back-ups in the storage space, tracks and eliminate the second copy of data inside the storage unit. It allows only one instance data occurrence to be stored originally and then following instances will be given reference pointer to the original data stored. In a Big data storage environment, huge amount of data needs to be secure. For this proper management, work, fraud detection, analysis of data privacy is an important topic to be considered. This paper examines and evaluates the prevailing deduplication techniques and which are presented in tabular form. In this survey, it was observed that the confidentiality and safety of data has been compromised at many levels in prevalent techniques for deduplication. Although much research is being carried out in various areas of cloud computing still work pertaining to this topic is scant.

Keywords— Data Deduplication, Data Security, clod computing, Storage Complexity in Cloud, Cloud Service Provider, Privacy Preserving in clod, inline deduplication, Bandwidth, Source based deduplication, post-process deduplication, target based deduplication, local deduplication..

I. INTRODUCTION

In recent years with the initiation of cloud computing Industries, Organization and many Business fields have been relying on cloud for storing their valuable data. It is a new way to provide services and data in a shared manner over the internet. Cloud computing is preferred over other storage system[1], to host applications in cloud due to factors such as reduced capital expenditures and operational overhead, better IT responsiveness and efficiency. Cloud Computing has empowered the individual user by providing seemingly unlimited storage space, available access of data anytime and anywhere.

Applications are purchased, licensed and streamed over the cloud network [2] in place of user’s desktop and the payment is done for the usage only. Whereas the cloud providers are using lot of storage space for the data backups and recovery of huge amount of data being outsourced from data disasters, leading to consumption of large amount of storage space as well as bandwidth resulting in low efficiency and throughput of the system. By incorporating data deduplication into cloud storage, the Cloud service providers are able to increase the storage.

Data deduplication supporting storage [1,2] space reduces the amount of storage needed for a given set of files. Network deduplication is used to decrease the number of bytes to be moved between endpoints resulting in lesser amount of bandwidth required, providing enhanced storage consumption and an efficient way of handling same data [3, 4].

In this paper, different techniques of data deduplication that can be implemented on different data types are discussed in Section II. Survey of related work is briefed in section III Table of comparison

(2)

of different techniques is presented in section IV. For secure deduplication a new methodology is suggested in section V.

II. TYPES OF DEDUPLICATION

Deduplication is basically a compression technique for removing redundant data and to improve storage efficiency in big scale storage systems. File is broken down into fixed or different size blocks prior the deduplication process [5].

In Data deduplication different blocks are compared and similar one are removed. The unique one is stored and updations are done in index table. Deduplication can be generalized in four steps as follows.

•For each chunk of data key value is calculated by making use of cryptographic [6,7] hash function.

• Compare the values of chunks with present hash value.

• Similar values of hash points to duplicate chunk, and a logical /reference pointer is given to the data chunk existing in the storage.

• If new data chunk [2] is there, it is inserted and then index table is updated.

Data Deduplication techniques can be classified on following conditions.

• Deduplication based on Source or target level

• Deduplication based on Inline or post-process

•Deduplication based on Local or distributed-/global level.

Fig. 1. Deduplication Techniques [8,9]

A) Location based Deduplication: Depending on location of deduplication:

1. Source-based deduplication:- Deduplication is applied, before transferring the data to the back up target, duplicate copies of data are removed before transmitting to the target machine resulting in less

(3)

usage of bandwidth [8] as only unique data is being transferred. There is reduction in Storage and time also. It requires very less hardware but more processing resources at source or client side.

2. Target-based deduplication:-in this, deduplication is done on the targeted [4] backup servers. Backup data without deduplication is transmitted to target machine and there redundant data can be eliminated.

This method leads to increase the bandwidth cost but gives good performance in comparison [6] to source based deduplication. No extra operating cost on the data source is implied .This technique is used for Big data storage[3].

B) Deduplication based on time:

1. Inline deduplication- This is applied at the client side i.e. before the storage of data onto disk data will be deduplicated .Only unique data set is transmitted to the target side i.e server for storage. It reduces the network overhead, but requires high processing power for deduplication [10,11] at the side of source machine.

2. Post process - Deduplication is applied at the server side. All incoming data is stored on the disk without eliminating any redundant data, then later at server side unique copy of the data is saved in storage. Performance of post process is better in comparison to inline deduplication, but it needs more disk space for data storage [10] and fast disk cache is required.

C) Deduplication based on Local or Global Level.

1. Local Level deduplication. It is used in removing deduplication at local level i.e Long haul Network.

As, it works on a single VM for removing duplicacy, So it has a major drawback on performance and it is unable to remove the duplicacy at major level. In multi-node environment [12,13,14] it shows better performance if copled with indexing and made to run parallel on multiple nodes.

2. Global Level deduplication. Is performed across multiple data sets in distributed environment. It is also called multi node deduplication because it has groups/clusters of many nodes which work together as a unit. It compares the data set of all the nodes to achieve deduplication in distributed storage having many storage servers. Additional overhead of hashing is there.

Data type is an important factor to be considered in deduplication techniques [14, 15] development.

These are detailed on the basis of data type. According to different storage formats image, text data and video data are categorized in different types. The data format of stored information is crucial for searching matched information. Data deduplication provides efficient storage systems with improved bandwidth usage still security [16, 17] and privacy of data is major issue.

III. LITERATURE SURVEY

According to author Feng et al, near-exact defragmentation (NED) scheme is used fragments are identified and rewritten in cloud backup; segment reference analysis is used for defragmentation.

Restore performance on cloud backup is improved

Liu et al, paper describe backtracking sub-blocks alongwith Sliding blocking algorithm is used. Chunk- level Duplication is done; checksum algorithm, Adler-32 checksum and MD5 hash algorithm are used for weak and strong hash vales. In this duplicate data in sub-blocks is efficiently detected.

Lim Sh. Et al, De FFS known as Duplicated eliminated Flash File System technique is used which increases flexibility with the use of different size blocks, non-overlapping duplication of chunking algorithm and decreased duplicate storage with extended flash memory cycles

(4)

Kiliyan et.al says based on stream and disk context deduplication , Context-based rewriting (CBR )method is used, it rewrites extremely fragmented replicas which improves storage capacity and bandwidth utilization.

According to Li et.al, data domain deduplication file system (DDFS) technique is used. Which utilizes bloom filters in identifying segments which are new? Stream-related metadata pre-fetch, locality conserved caching preserves locality of fingerprint

In this paper author Ferreira focused on the technique involving Hash challenges is used, less metadata is required in identifying redundant chunks without any complexity. Communication overheads are also reduced

Pinkas et al., cross-user deduplication technique is used. It is a Hybrid Approach which preserves privacy of user data with reduced risk of data leakage

In this paper author Greenan used convergent encryption. Keys for encryption are created from fragments of data; similar fragments are encrypted to same code. A unique key is used to encrypt each file. Here the authors have used Asymmetric key pairs for secure data transfer organized for deduplication which is secure over distributed storage.

Author Zhou et al, found whether global/local deduplication to be done at big data environments and analyze the energy consumption to cut short execution time. Here the redundancy plays an important feature which reflects the effects adequately on the deduplication efficiency in big data storage.

Vivek et al, deduplication is used as hash functions. Also Encryption support is done with the help of Convergent Encryption and results to token.

Yan et al, attribute based encryption to deduplicate stored encrypted data in cloud and supports a secure access to data.

Li et al, Simple data files are checked for integrity and safe deduplication by SecCloud. On the other hand encrypted files, integrity auditing and deduplication is performed by SecCloud+.

To generate data tags MapReduce Method was used before uploading the data as well as checking the integrity of data during storage in cloud.

(5)

OMPARATIVE ANALYSIS OF THE PREVALENT DEDUPLICATION TECHNIQUES

Technique Authors Description Findings

Near-exact defragmentation (NED) technique

Feng et al. fragments are identified and rewritten in cloud backup, defragmentation is done on the basis of segment reference analysis

Helps in restoring in cloud backups

Sliding blocking backtracking sub- blocks called SBBS

Liu et al. Duplication at Chunk-level is done, checksum, Adler-32 algorithm used, for weak hash check and for strong hash check MD5 hash is used

Checking of hash values enhances the detection of duplicate data in sub- blocks are efficiently detected

Duplicated

eliminated Flash File System

Lim Sh et al. increase in flexibility with the use of different-size blocks, non- overlapping duplicate chunking algorithm

decreased redundant storage and extended flash memory cycles Context-based

rewriting technique

Kiliyan et al. Stream and disk context are the basis of deduplication , Context- based rewriting (CBR)method is used, it rewrites extremely fragmented replicas

improves storage capacity

and bandwidth

utilization.

Data domain

deduplication file system (DDFS)

Li et al. Data domain deduplication file system (DDFS) technique is used. Which utilizes bloom filters in identifying segments which are new. Stream-related metadata is pre-fetch

High Cache Ratio.

locality conserved caching preserves locality of fingerprint.

Hash challenges Ferreira et al. less metadata is required in identifying redundant chunks without any complexity

decreased communication overheads

Cross-user deduplication

Pinkas et al. Cross-user deduplication technique is used. It is a Hybrid Approach which conserves confidentiality of user data

conserves confidentiality of user data

Convergent encryption

Greenan et al. Keys for encryption are created from fragments of data, similar fragments are encrypted to same code.. A unique key is used to encrypt each file. Here Asymmetric key pairs are used for secure data transfer organized for deduplication which is secure over distributed storage.

secure deduplication for distributed storage systems

Hash indexing Zhou et al.[I] Analysis was done to find whether global/local deduplication to be done at big

Degree of redundancy is an important feature that affects the deduplication

(6)

data environments, energy consumption is low if redundancy level is high enough to cut short execution time.

efficiency in big data storage.

Dedplication applied at Block-level

Puzio et al.[J] Convergent encryption is used ClouDedup has been proposed.

Token generation technique

Vivek et al.[K] Hash functions are used for Deduplication, Encryption is done using Convergent Encryption and then a token is generated.

convergent encryption is used and token is generated

Attribute based encryption.

Yan et al. [L] Attribute based encryption to deduplicate stored encrypted data in cloud and supports a secure access to data

Attributes are used to encrypt the file considering the security.

Map reduce

technique

Li et al. [M] Simple data files are checked for integrity and safe deduplication by SecCloud whereas for encrypted files, integrity auditing and deduplication is performed by SecCloud+.

MapReduce cloud method was used to generate data tags before uploading the data as well as checking the integrity of data during storage in cloud. Proof of Ownership protocol was used for secure deduplication.

IV. OBSERVATIONS

In this paper recent research work concerned with deduplication techniques is presented and is addition to the already existing survey work. These techniques are investigated in detail based on several categories. Different categories of deduplication techniques, depending upon text, image or video data, time and location, can be chosen, according to user’s need. Here an attempt to focus on some issues, which are unexplained in deduplication techniques in Big storage system is done. Recently numerous deduplication techniques have been considered for storage systems and new technologies are developed. From the literature survey it has been assessed there is a he scope for research work on deduplication technique on cloud storage . Data deduplication provides efficient storage systems with improved bandwidth usage still security and privacy of data is major issue. Attribute based encryption is of one of the safest ways to manage and control file sharing in cloud with its special feature of processing attributes as parameters which are user dependant. It can be categorized as public-key encryption, In this user’s secret key and code depends upon number of attributes and decryption is possible only if the attributes of the user’s key matches the attributes of the code

(7)

V. CONCLUSION

In this Paper, comparison on prevalent data deduplication techniques is done and presented in a tabular form. Deduplication is an important technique used in the reduction of storage cost, bandwidth and energy utilization. As these techniques have complete access to storage, confidentiality of user’s data at any level is at risk stake.

In this paper, prevalent data deduplications techniques have been studied and arranged in the tabular form for critical evaluation. As many organizations are shifting to cloud environment, enlarged data sets are used having enormous number of VMs and these companies employ different replication mechanisms for secure approach still security is being breached at many levels.

More research can be done for providing better security across the customer and clod level, while providing efficient strategies of Deduplication.

REFERENCES

[1] E. Manogar and S. Abirami,”A Study on Data Deduplication Techniques for Optimized Storage”,2014 Sixth International Conference on Advanced Computing(lCoAC), IEEE 2014, pp.

161-166.

[2] Anil Lamba, 2014." USES OF CLUSTER COMPUTING TECHNIQUES TO PERFORM BIG DATA ANALYTICS FOR SMART GRID AUTOMATION SYSTEM", International Journal for Technological Research in Engineering, Volume 1 Issue 7, pp.5804-5808,2347-4718.

[3] Zhou, Ruijin, Ming Liu, and Tao Li. "Characterizing the Efficiency of Data Deduplication for Big Data Storage Management." 2013 IEEE International Symposium on Workload Characterization (IISWC) (2013) , PP No: 98-108. Web.

[4] Luo, Shengmei, Guangyan Zhang, Chengwen Wu, Samee Khan, and Keqin Li. "Boafft: Distributed Deduplication for Big Data Storage in the Cloud." IEEE Transactions on Cloud Computing (2015) , PP No: 1- 13. Web.

[5] Q. Liu, Y. Fu, G. Ni, R. Hou,”Hadoop Based Scalable Cluster Deduplication for Big Data” ,2016 IEEE 36th International Conference on Distributed Computing Systems Workshops.

[6] Xu, J., Zhang, W., Ye, S., Wei, J., & Huang, T. (2014). A Lightweight Virtual Machine Image Deduplication Backup Approach in Cloud Environment. 2014 IEEE 38th Annual Computer Software and Applications Conference, 503-508. doi:10.1109/compsac.2014.73.

[7] Waghmare, V., & Kapse, S. (2016). Authorized Deduplication: An Approach for Secure Cloud Environment. Procedia Computer Science, 78, PP no: 815-823. doi:10.1016/j.procs.2016.02.063.

[8] Puzio, P., M olva, R., Onen, M., & Loureiro, S. (2013). ClouDedup: Secure Deduplication with Encrypted Data for Cloud Storage. 2013 IEEE 5th International Conference on Cloud Computing Technology and Science,, PP No:363-370. doi:10.1109/cloudcom.2013.54.

[9] Yan, Zheng, Mingjun Wang, Yuxiang Li, and Athanasios V. Vasilakos. "Encrypted Data Management with Deduplication in Cloud Computing." IEEE Cloud Computing 3.2 (2016), PP No:

138-150. Web.

[10] Wen, Mi, Kejie Lu, Jingsheng Lei, Fengyong Li, and Jing Li. "BDO-SD: An Efficient Scheme for Big Data Outsourcing with Secure Deduplication." 2015 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) (2015), PP no: 214-219. Web.

[11] Hur, J., Koo, D., Shin, Y., & Kang, K. (2016). Secure Data Deduplication with Dynamic Ownership Management in Cloud Storage. IEEE Transactions on Knowledge and Data Engineering, 28(11), 3113-3125. doi:10.1109/tkde.2016.2580139.

(8)

[12] Kirubakaran, R., Prathibhan, C. M., & Karthika, C. (2015). A cloud based model for deduplication of large data. 2015 IEEE International Conference on Engineering and Technology (ICETECH). doi:10.1109/icetech.2015.7275007.

[13] Leesakul, W., Townend, P., & Xu, J. (2014). Dynamic Data Deduplication in Cloud Storage.

2014 IEEE 8th International Symposium on Service Oriented System Engineering, 321-325.

doi:10.1109/sose.2014.46.

[14] Li, J., Li, J., Xie, D., & Cai, Z. (2016). Secure Auditing and Deduplicating Data in Cloud. IEEE Transactions on Computers, 65(8), 2386-2396. doi:10.1109/tc.2015.2389960.

[15] Chen, R., Mu, Y., Yang, G., & Guo, F. (2015). BL-MLE: Block-Level Message-Locked Encryption for Secure Large File Deduplication. IEEE Transactions on Information Forensics and Security, 10(12), 2643-2652. doi:10.1109/tifs.2015.2470221.

[16] A. Venish and K. Siva Sankar,”Study of Chunking Algorithm in Data Deduplication”,’Springer India 2016.

[17] Anil Lamba, 2018. “Protecting “cyber security & resiliency” of nation’s critical infrastructure - energy, oil & gas”, International Journal of Current Research, 10, (12), 76865-76876.