A SECURED DATA DEDUPLICATION IN CLOUD STORAGE WITH CHUNKING AND HASHING

(1)

Vol. 28, No. 13, (2019), pp. 333-343

A SECURED DATA DEDUPLICATION IN CLOUD STORAGE WITH CHUNKING AND HASHING

Saravanan K¹, N.R. Rajalakshmi²

1Department of Computer Science and Engineering, Anna University Regional Campus, Tirunelveli,

2Department of Computer Science and Engineering, Vel Tech Dr. R.R. & Dr. S.R. Technical University

Abstract

Storage services are offered by cloud computing in which data are stored, managed, archived in remote virtual machines and made for user availability through internet. Data duplication is a process of creating exact copy of same data which occupies more space in the cloud storage. Thus data deduplication is used to reduce storage overhead by eliminating redundant copies of data within files and between files. Chunking is identifies as the initial process in data deduplication, which splits the file into chunks. Content defined chunking Algorithm splits file into variable length chunks based on the file content. Public key cryptography Algorithm is used for encryption of file and to preserve privacy. Novel chunking algorithm is used for finding cut point and splits files into chunks which are then passed into Counting Bloom Filter that returns hash type according to which either one of non-cryptographic hash function is used. Results of various graphs shows that proposed method is best in all cases with minimum time taken (Hashing time, chunking time and deduplication time) and gives high throughput and deduplication ratio.

Keywords: Cloud storage, Content defined chunking, counting bloom filter, secure deduplication.

1. Introduction

Cross-user deduplication is the method that compares each ﬁle or block in the disk with the data of other users in the server, and will deduplicate, if an identical copy is already available at the server.(Harnik et al.,2010) Deduplication can be used in two ways they are i) side channel and ii) covert channel. This method greatly reduces the risk of data leakage and provides higher privacy. However there is slight reduction in the bandwidth saving.

Whenever files are encrypted and decrypted using Convergent Key (CE) the disadvantage is decryption time is always higher than encryption time, (Vaishali & Varsha, 2017) hence proposes a secret sharing scheme which takes input file document and divides into number of blocks called shares which are stored on different server. Finally using recovering technique shares can be grouped together into files and given as output for the requested user. Thus by using secret sharing scheme data reliability and high security is achieved.

Jianfeng et al. (2015) developed convergent encryption method for in the cross-user scenario using secure deduplication. Hence most of the current solutions lacks in user traceability for the identification of malicious users, this method is applicable to specific scenarios. Bloom Filters are used in many database applications as well as in network literature. (Andrei and Michael, 2003)Whenever space issue arises the best solution can be given by the bloom filter since it keeps an explicit list. But the drawback of bloom filter is it introduces false positive.

(Yucheng et al., 2015) discuss the challenges faced by Rabin and MAXP based CDC. They are both has low chunking throughput and larger chunk size inconsistency. MAXP has high computational cost and it is hard to reduce the low entropy strings. Rabin has to use 1 OR, 2 XOR, 2shift, 2 array lookup per byte and a conditional branch is needed to judge chunk boundary. Mihir et al. (2013) proposed message locked symmetric encryption method for achieving secure deduplication. It generally adopted to perform cross client deduplication, but it generates huge metadata in both user and the cloud server.

2. LITERATURE SURVEY

(Zhihua et al., 2016) introduces encryption scheme with flexible multiple keyword ranked search method for dynamic update of operations. Greedy Depth-First Search (GDFS) algorithm is implemented here with tree-based

(2)

Vol. 28, No. 13, (2019), pp. 333-343

index structure. Parallel search process has made to reduce the time cost. But, it forces the data owner himself responsible for teh data manipulation and updating it into the cloud server.

(Tao et al., 2016) uses interactive protocol based on both static and dynamic deduplication decision tree. Here, Static tree is built based on static data and updation on tree is not possible whereas dynamic tree is self-generated tree which allows insertion, deletion and modification. Disadvantage of R-MLE2 is that it uses the equality-testing algorithm to classify the duplicate ciphertext which is inefficient. To overcome this, convergent approach of both static and dynamic, named as μR-MLE2 scheme is proposed. The advantage is by making continues interaction with the client, time complexity of deduplication equality test is greatly reduced in server side.

Exponentiations modulo is considered as the resource expensive function in discrete-logarithm-based cryptographic protocols and highly difficult for the resource-limited devices. (Xiaofeng et al., 2014). Hence, efficient outsource-secure algorithm is proposed to outsource the operations to cloud server using concurrent modular exponentiations. Here, ‘Rand’ subroutine is used to increase the computational speed. A simultaneous modular exponentiation is implemented using two modular exponentiations. An advantage of this algorithm is it is superior in checkability as well as efficiency.

(Saranya et al.,2014) In RSA Algorithm two types of keys are used. Public key will be known to everyone, whereas private key will be kept as a secret key. RSA Algorithm is used for secure data transmission. If encryption is made with low encryption exponent, the ciphertext can be easily decrypted so it is preferable to use higher encryption exponent value so that ciphertext cannot be easily decrypted thus higher security can be achieved.

AE can achieve better deduplication ration by implementing local extreme value in asymmetric window (variable size) without any backtrack hence it requires 1 comparison operation and 2 conditional branch operations for each scanned byte (Yucheng et al., 2015). Thus the simplicity of AE makes it fast and it can eliminate the low entropy strings.

Message-Locked Encryption (MLE) is another cryptographic scheme in which both the encryption & decryption keys are constructed from the message itself. MLE is viewed as the concept obtained by inferring the Convergent Encryption (CE). (Mihir et al.,2013). The difficult part of the convergent encryption is that managing the convergent keys while the number of user increases. Also, convergent encryption fails when the brute-force attack is used.

Block-Level Message-Locked Encryption (BL-MLE) method offers dual-level deduplication, which can handle deduplication in both file-level & block-level. (Rongmao et al., 2015) Efficient deduplication can be obtained for larger (encrypted) files. It uses small set of meta data and can achieve efficient space storage in the cloud. Since it employs public-key cryptograpy in constructing the tags, it takes more computational time than the MLE schemes.

(Andrei and Michael, 2003) Bloom Filters are used for summarizing contents, for locating of resources and then for the measurement of infrastructure at recording heavy flows, ip traceback. It is used in web cache sharing as a basic routing protocol on P2P network and geographical routing. It also used in packet routing for detecting of loops in Icarus, in queue management for stochastic fair blue and in multicast. Deduplication can be done by chunking the data/file, generating hash values and followed by redundancy reduction. Data deduplication is a type of data compression. The data compression can be classified into Delta compression and XDelta compression. (Naresh et al.,2016) For generating hash values it can either use MD5 or SHA. MD5 algorithm accepts input data stream of variable length and converts them into fixed size cryptographic hash value of 128 bits whereas SHA1 converts them into hash value of 160 bits.

(Byung et al.,2016) The PHISA (Partial Hash Information String Algorithm) is a fixed-size chunking algorithm that uses MD5 hashfunction to produce chunk digests. Each MD5 has the fixed length of 128 bits. In this PHISA scheme, leftmost 12-bit alone considered for hashing rather than the whole 128 bit hash value. Each 12-bit hash value is concatenated to form the partial hash string, which is known as file digest to measure the file similarity. By using this, managing the file storage can be efficient with high performance is also achieved. Secure file storage in cloud using access policies are introduced (Nishana & Saravanan, 2013) with deduplication strategy. (Jianfeng et al., 2015) proposed a novel data deduplication to increase the user traceability ratio based on proof of ownership. This interactive randomized convergent encryption scheme has traceable signatures, which is generated for each user

(3)

Vol. 28, No. 13, (2019), pp. 333-343

while uploading data copy in the cloud. When the duplicate copy is identified by the tracing agent, he will disclose the identity of malicious users. But, the identity of other users will be kept secret.

3. ARCHITECTURE

In Fig.1 User registration is done then after successful login he/she can select a file for uploading then file as well as filename will be encrypted with the public key of the RSA Algorithm for preserving the privacy. The input file can be splitted in chunks by using Content defined chunking algorithms such as either by RAM or AE Algorithm.

Now the chunks of the input file are given as input for the Counting Bloom Filter which fixes the vector size. Also Counting Bloom Filter return a hash type such as -1,0 and 1.If the hash type is returned as 0 jenkins hash will be called and if returned as 1 murmur hash is called. Jenkins or Murmur finds the hash values and vector size of CBF will be filled. At the same time Chunks are passed into checksum and hash values are found and updated in the index table. If it is a new file is decrypted with private key of RSA Algorithm and stored inside particular folder of the user. Now when a duplicate file is given as input again the chunks are found and after calculating hash values, if the values are already present in table it came to know that it is duplicate, then for increasing the probability that it is a duplicate check it with CBF by calculating hash values as done previously so confirm that the file is already present in the drive hence override it thereby saving the space. If multiple users uploads same file the first user will have the file and for other user the file will be placed in a separate folder named duplicate.

In Fig.1 User registration is done then after successful login he/she can select a file for uploading then file as well as filename will be encrypted with the public key of the RSA Algorithm for preserving the privacy. The input file can be splitted in chunks by using Content defined chunking algorithms such as either by RAM or AE Algorithm. Now the chunks of the input file are given as input for the Counting Bloom Filter which fixes the vector size. Also Counting Bloom Filter return a hash t ype such as -1,0 and 1.If the hash type is returned as 0 jenkins hash will be called and if returned as 1 murmur hash is called.

Jenkins or Murmur finds the hash values and vector size of CBF will be filled. At the same time Chunks are passed into checksum and hash values are found and updated in the index table. If it is a new file is decrypted with private key of RSA Algorithm and stored inside particular folder of the user. Now when a duplicate file is given as input again the chunks are found and after calculating hash values, if the values are already present in table it came to know that it is duplicate, then for increasing the probability that it is a duplicate check it with CBF by calculating hash values as done previously so confirm that the file is already present in the drive hence override it thereby saving the space. If multiple users uploads same file the first user will have the file and for other user the file will be placed in a separate folder named duplicate.

(4)

Vol. 28, No. 13, (2019), pp. 01-08

Fig.1 : Architecture diagram for secure data deduplication

4. MODULES DESCRIPTION

REGISTRATION

New user needs to enter the required data to the register form such as first name, last name, email id and password. These data will be stored in database for future authentication purpose. MySql is configured as database server for maintaining the local files in the system.

FILE UPLOADING

User can upload different types of files such as pdf, pptx, word document, text document, images formats, etc. Files uploaded by every user will be stored in a folder with corresponding mail id and files. Database will contain details about all users and also files uploaded by them.

RSA ALGORITHM FOR ENCRYPTION

(5)

Vol. 28, No. 13, (2019), pp. 01-08

Public key is used to encrypt the file and can be known by everyone. Filename will be passed to encrypt a file, also file path have to be encrypted. During encrypting a file, filename will be renamed as filename_1.Encrypted file will be in increased size, it cannot be read and after encrypting the file again name is changed to original name.

RAM & AE Cutpoints

Fig.2 : AE working with string of 14 Bytes Fig.3: RAM working with string of 14 Bytes

In AE algorithm, for example in Fig.2, it scanned from 0x89 to 0xEA. Cut-point is identified while scanning 0xEA because it is larger than any byte in the fixed-sized window.

In RAM algorithm, as denoted in Fig.3,0xA1 is a maximum valued byte in Fixed-size window. Subsequent to all the bytes scanned in the fixed-sized window, RAM algorithm will compare each scanned-byte value with the maximum value. While this happens, if scanned-byte value is found bigger than the maximum value, cut-point is identified. Here 0xEA is the cut point because it is bigger than 0xA1.

COUNTING BLOOM FILTER

Counting Bloom Filter (CBF) is a space efficient probabilistic data structure which is generally used to verify that an element is available in a set or not.CBF is constructed by series of tiny array of element addresses using hashing function. It implements fast set to support the membership queries. It uses counting vector instead of using bit vector which means to avoid usage of 0’s and 1’s the vector counter has been used. Element insertion can be made by incrementing the vector counter. Deletion of elements can be done by decrementing the vector counter. The vector counter should be large enough to avoid overflow so that 4 bits per counters has to be used for comparison.

MURMUR HASHING

It is a non-cryptographic hash function used in hash based lookup. It can perform multiplication, rotation and XOR operation. Murmur hash usually works with 4 bytes. It contains data(in bytes) and seed as input. Calculates hash using bytes from 0 to length, with the provided seed value, which is a random number. Since it passes both chi square and avalanche test it is more efficient and accurate.

JENKINS HASHING

It is a hash algorithm, that can generally yield 32 bit or 64 bit hash value which can be used for hash table lookups. It takes 2n instructions per byte for mixing instead of 3n.In general, keys can be character strings, numbers, bit-arrays etc. In Jenkins lookup 3, keys are unaligned variable-length byte arrays with average key length ranged from 8 bytes to 200 bytes. hashword() indicates the variable length array of 4-byte integers hash. For a byte array (like a character string), use hashlittle() which hashes a variable-length key into a 32-bit value. Mixing followed by rotation helps to achieve maximum avalanche in Jenkins.

RSA ALGORITHM FOR DECRYPTION

Private key is used to decrypt the file and should be kept secret. While decrypting a file again filename will be renamed as filename_1.Encrypted file along with private key will decrypt the file path and file. Decrypted file can be read, it will be in original file size and with original name.

(6)

Vol. 28, No. 13, (2019), pp. 01-08

5. RESULTS

FILE SELECTION

Fig 4: File Selection for uploading

After successful login every user can select different types of files such as PDF, PPT, Text document, Word Document for uploading to the cloud. PHP is used as front

Fig 5: RSA encryption of file

end web development and MySQL is used for backend storage. Server system with Intel Xeon E3-1225 v5, 16GB RAM and 10TB SATA Hard Disk is used for experimentation.

The selected file shall be encrypted with the help of RSA Algorithm. Public key will be loaded when encryption starts File name will be renamed here baby has been renamed as baby.jpg_1 after encryption again original name will be changed. Now the file cannot be read.

RAM Chunks

Fig 6: File splitted into chunks after finding cutpoint

Using RAM Algorithm cutpoint can be found so that the file can be splitted into number of smaller files called chunks.

(7)

Vol. 28, No. 13, (2019), pp. 01-08

Fig 7: RSA decryption of file

Files can be decrypted by loading private key of RSA Algorithm again file name will be renamed as baby.jpg_1 after decryption is over again renamed to original name now the file can opened and can be read.

Duplicate identification

Fig 8: Uploading same file

Whenever same user or different user uploads a same file, file comparison takes place and identified as duplicates. Warning message to the list of configured emails also sent immediately.

Fig 9: Duplicates stored separately

A separate folder will be created and duplicate files have been moved to that thus space in cloud can be saved efficiently. After some point of time, these duplicate files are archived in the back up storage. Consequently, these duplicate files will be deleted upon approval of the system admin. Cloud space is minimized significantly using the proposed method. It saves the memory as well as indexing efforts in the cloud storage. 20% of cloud storage is saved by the proposed approach, if it runs on daily routine.

(8)

Vol. 28, No. 13, (2019), pp. 01-08

6. PERFORMANCE METRICS

Fig 10: Table data for plotting the graph

The above table value can be used to obtain five different graphs and in each graph all techniques can be compared and RAM & MURMUR appears to be best.

Fig 11: File size Vs Hashing time

Though file size increases, the hashing time of RAM & Murmur is lower compared to RAM & Jenkins, AE & MURMUR and AE & Jenkins.AE & Jenkins takes higher hashing time which is not preferable. Since RAM & MURMUR takes lower time it is best.

Fig 12: File size Vs Chunking time

Though file size increases, the chunking time of RAM & Murmur is lower compared to RAM & Jenkins, AE & MURMUR and AE & Jenkins.AE & Jenkins takes higher chunking time which is not good algorithm because chunking will be 1st step of data deduplication.

(9)

Vol. 28, No. 13, (2019), pp. 01-08

Fig 13: File size Vs Deduplication time

Though file size increases, the deduplication time of RAM & Murmur is lower compared to RAM & Jenkins, AE & MURMUR and AE & Jenkins.AE & Jenkins takes higher deduplication time so it not preferred.

Fig 14: File size Vs Throughput

Fig 15: Data deduplication Ratio

7. CONCLUSION

Thus user can login, can select different types of files such as pdf, ppt, word document, text document, jpg images. File will be encrypted by using the Public key of RSA Algorithm. Then encrypted file is splitted into chunks with content defined chunking

(10)

Vol. 28, No. 13, (2019), pp. 01-08

algorithms such as with RAM or AE then chunks are passed into counting bloom filter which returns hash type according to that either MURMUR or JENKINS hashing are used. By comparing the values in the index table it can be found that whether the file is duplicate or not. If it is a unique file,then save it. In case if the file is duplicate overwrite the original file thereby saves space for single user and if duplicate files exist between multiple users one user will have the file and for remaining user files will be removed and moved to separate folder. At the end, RSA Algorithm works with the help of private key loaded files can be decrypted and used by the user. Also various graphs with File size Vs hashing time, File size Vs chunking time, File size Vs deduplication time, File size Vs throughput and data deduplication ratio are shown and clearly indicates that RAM &

MURMUR algorithm works better in all the ways.

References

Anthopoulos, L., & Fitsilis, P. (2010, July). From digital to ubiquitous cities: Defining a common architecture for urban development. In Intelligent Environments (IE), 2010 Sixth International Conference on (pp. 301-306). IEEE.

Bellare, M., Keelveedhi, S., & Ristenpart, T. (2013, May). Message-locked encryption and secure deduplication. In Annual International Conference on the Theory and Applications of Cryptographic Techniques (pp. 296-312). Springer, Berlin, Heidelberg.

Broder, A., & Mitzenmacher, M. (2004). Network applications of bloom filters: A survey. Internet mathematics, 1(4), 485-509.

Chen, R., Mu, Y., Yang, G., & Guo, F. (2015). BL-MLE: block-level message-locked encryption for secure large file deduplication. IEEE Transactions on Information Forensics and Security, 10(12), 2643-2652.

Chen, X., Li, J., Ma, J., Tang, Q., & Lou, W. (2014). New algorithms for secure outsourcing of modular exponentiations. IEEE Transactions on Parallel and Distributed Systems, 25(9), 2386-2396.

Dhokne, M. V. K., & Patil, V. (2017). Secure Data Deduplication System with Tag Consistency. IJETT, 1(2).

Harnik, D., Pinkas, B., & Shulman-Peleg, A. (2010). Side channels in cloud services:

Deduplication in cloud storage. IEEE Security & Privacy, 8(6), 40-47.

Jiang, T., Chen, X., Wu, Q., Ma, J., Susilo, W., & Lou, W. (2017). Secure and efficient cloud data deduplication with randomized tag. IEEE Transactions on Information Forensics and Security, 12(3), 532-543.

Kim, B. K., Oh, S. J., Jang, S. B., & Ko, Y. W. (2017). File similarity evaluation scheme for multimedia data using partial hash information. Multimedia Tools and Applications, 76(19), 19649-19663.

Kumar, N., Malik, P., Bhardwaj, S., & Jain, S. C. (2016, December). Comparative analysis of deduplication techniques for enhancing storage space. In Parallel, Distributed and Grid Computing (PDGC), 2016 Fourth International Conference on(pp. 480-487).

IEEE.

Nishana Rahim & Saravanan. K, "Secured Image Sharing and Deletion in the Cloud Storage Using Access Policies", International Journal on Computer Science and Engineering (IJCSE), Engg Journals Publications, vol.5, no.4, pages-230, year-2013.

Saranya, V. (2014). Vasumathi,“A Study on RSA Algorithm for Cryptography”.

International Journal of Computer Science and Information Technologies, 5(4).

Wang, J., Chen, X., Li, J., Kluczniak, K., & Kutylowski, M. (2015, November). A new secure data deduplication approach supporting user traceability. In Broadband and

(11)

Vol. 28, No. 13, (2019), pp. 01-08

Wireless Computing, Communication and Applications (BWCCA), 2015 10th International Conference on (pp. 120-124). IEEE.

Xia, Z., Wang, X., Sun, X., & Wang, Q. (2016). A secure and dynamic multi-keyword ranked search scheme over encrypted cloud data. IEEE transactions on parallel and distributed systems, 27(2), 340-352.

Zhang, Y., Jiang, H., Feng, D., Xia, W., Fu, M., Huang, F., & Zhou, Y. (2015, April).

AE: An asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication. In Computer Communications (INFOCOM), 2015 IEEE Conference on (pp. 1337-1345). IEEE.

AUTHOR BIOGRAPHY

Dr.K.Saravanan, is working as a Senior Assistant professor, Department of Computer Science & Engineering at Anna University, Regional Campus, Tirunelveli, Tamilnadu. He has published papers in 12 internationa l conferences and 25 international journals. He has also written 6 book chapters and three edited book with international publishers. He is an active researcher and academician. Also, he is reviewer for many reputed journals in elsevier, IEEE etc

N.R.Rajalakshmi is currently working as an Associate professor in the School of Computing department at Vel Tech University, Tamilnadu, India. She has done her doctorate in the field of Cloud Computing in Anna University, Chennai and completed Master degree in Software Engineering in 2007. She has 14 years of teaching and research experience. Also, she published many papers in international journals and conferences.