SECURE DEDUPLICATION OF DATA IN CLOUD STORAGE

(1)

Babaso D. Aldar and Vidyullata Devmane ijesird, Vol. III, Issue III, September 2016/190

SECURE DEDUPLICATION OF DATA IN CLOUD STORAGE

1Babaso D. Aldar , ²Vidyullata Devmane

1,2Computer Engg. Dept, Shah and Anchor Kutchhi Engg. College, Mumbai University.

1[email protected], ²[email protected]

Abstract—Cloud storage is the service model in which data is maintained, backed up remotely and made available to users over a network. Deduplication is the method for eliminating duplicate copies of repeating data. It has been used in cloud storage to decrease the amount of storage space. To make the management of data scalable in cloud storage, deduplication method is used. To better protect the security of data, the deduplication of data should be authoritative. Different from conventional deduplication systems, the differential privileges of users can be considered in duplicate check of the data. The user is only allowed to perform the duplicate check for files noticeable with the corresponding privileges. The system can be enhanced by supporting stronger security and encrypting the files with differential privilege keys. To protect the privacy of susceptible data while supporting deduplication, the convergent encryption method can be used to encrypt the information before outsourcing. The data is encrypted by using convergent key, which is the hash value of the same information.

Keywords— Deduplication, Convergent encryption, Authorized duplicate check, privacy, Authenticity.

I. INTRODUCTION

Storage as a service is a business model in which cloud service providers rents highly available storage space and massively parallel computing resources to a smaller company or individual. The advent of Cloud Storage motivates enterprises and organizations to outsource data storage to third party cloud providers. An increasing quantity of information is being stored in the cloud storage and shared by users with particular privileges, which define the access rights of the stored data. Google drive is an example of cloud storage, which is used by most of us regularly. There is a major problem of cloud storage services is the management of the increasing quantity of information. To make the information management scalable in cloud storage, deduplication is a technique. Data deduplication is data compression method for eliminating duplicate copies of repeated data in storage [1]. If it was possible, Information Technology organizations

would only protect the distinctive information from their backups. Instead of saving the whole thing again and again, the ideal scenario is one where only the new or unique information is saved [2].

Data deduplication is also known as single instancing or intellectual compression technique.

Instead of keeping number of data copies with the same content on cloud storage, data deduplication eliminates duplicate data by keeping only one physical copy of the data and referring other redundant data to that copy, which is used to improve storage utilization. Previous deduplication systems cannot support differential authorization duplicate check [2], which is important in many applications. In an authoritative deduplication system, each user has been given a set of privileges during system initialization. Each data file uploaded to the storage server is also bounded by a set of privileges to identify which kind of users is permissible to perform the duplicate check and right to use the files. For every data file which is at cloud site having unique token. By comparing the token of new file with the tokens, which are already stored at cloud site, if that token matches with existing tokens then the file is getting duplicated.

Otherwise new file with token information‟s getting stored at cloud site. The deduplication system should hold the differential authorization and authorised duplicate check, to solve the problem of privacy preserving deduplication of data in cloud storage. In differential authorization [2], each authoritative user is able to get his/her individual tokens of his files to perform duplicate check based on his/her privileges. In authorised duplicate check [2], the authorized user is able to generate a query for certain file according to the privileges he/she owned with the help of trusted third party, while the cloud server performs duplicate check directly and

(2)

Babaso D. Aldar and Vidyullata Devmane ijesird, Vol. III, Issue III, September 2016/191 tells the user if there is any duplicate. The security

requirements are considered for security of file token and the security of data file. The unauthorized users without proper privileges or file should be prevented from getting or generating the file tokens for duplicate check of any file stored at the cloud service provider. It requires that any user without querying the trusted third party for some file token, he cannot get any useful information from the token, which includes the file information or the privilege information. For the protection of the data file, the data file should be secured that is illegal users without appropriate privileges or files, including the cloud service provider and the trusted third party, should be prevented from access to the encrypted file stored at cloud service provider.

Convergent encryption [2] key is the hash value of the same file, which is used to encrypt the file. It is known that some money-making cloud storage providers, such as Bitcasa, also deploy convergent encryption [1], [2].Security and privacy is maintained by uploading the encrypted data files and on necessity downloads the encrypted data files and decrypt it by using symmetric encryption algorithm. Deduplication can also decrease the amount of network bandwidth required for backup processes, and in some cases, it can speed up the backup and recovery process. Deduplication is actively used by a number of cloud backup providers (e.g. Bitcasa) as well as various cloud services provider (e.g. Dropbox) [1]. Jin et al. [2], today‟s money-making cloud storage services, such as Mozy, Memopal, have been applying deduplication to user data to save maintenance cost.

conventional encryption requires different users to encrypt their data with their own keys by which indistinguishable data copies of different users will lead to different cipher texts, making deduplication impossible [3]. The solution for balancing secrecy and efficiency in deduplication was described by M.

Bellare et al [4] called convergent encryption. It has been proposed to enforce data privacy while making deduplication. It encrypts/decrypts a data files with a convergent key, which is originated by computing the cryptographic hash value of the content of the data copy itself.

II. REVIEW OF LITERATURE There are different types of the data deduplication, which are Inline deduplication, Post- process deduplication, Source based-deduplication, Target based-deduplication and Global deduplication. In Inline deduplication, it is performed at the time of storing data on storage system. It reduces the raw disk space needed in system. It increases CPU overhead in the production environment but limits the total amount of data ultimately transferred to backup storage.

This technique allows most of the hard work to be done in RAM, which minimizes I/O overhead. In Post-Process deduplication, it refers to the type of system where software processes, filters the redundant data from a data set only after it has already been transferred to a data stored location.

It writes the backup data into a disk cache before it starts the deduplication process. It doesn't necessarily write the full backup to disk before starting the process; once the data starts to hit the disk the deduplication process begins. The deduplication process is separate from the backup process so we can deduplicate the data outside the backup window without degrading our backup performance. The following table summarizing the operations of Inline method and Post-process method:

TABLE I.

SUMMARY OF THE INLINE AND POST-PROCESS METHOD

In Source side deduplication i.e. client side deduplication, It uses client software to compare new data blocks on the primary storage device with previously backed up data blocks [5]. It is different

(3)

Babaso D. Aldar and Vidyullata Devmane ijesird, Vol. III, Issue III, September 2016/192 from all other forms of deduplication in that

duplicate data is first only identified before it has to be sent over the network. Previously stored data blocks are not transmitted. Source-based deduplication uses less bandwidth for data transmission. This will definitely create burden on the CPU but at the same time reduces the load on the network. From following fig, there are three copies of files A and B each. So only unique data is sent on the network (only one copy of file A and one copy file B) to cloud storage because of this network bandwidth is saved and for client and it requires CPU utilization more. It requires different backup software on the client where this deduplication technique will be used. They may be in data center and they may be in a remote datacenter, or they can even be a laptop. This software tells that the file already exists in the storage server, no need to upload it on the storage server.

Fig. 1 Source-Based Data Deduplication [5].

In Target-based deduplication [5], [6], it will remove the redundancies from a backup transmission as and when it passes through an appliance that is present between the source and the target. Target deduplication appliances represent the high end of modern deduplication.

These deduplication applicen- ces sit between our network and our storage, and provide fast, high- capacity dedup-lication for our data. Unlike source deduplication, the target deduplication does not reduce the total amount of data that need to be transferred across a WAN or LAN during the backup, but it reduces the amount of storage space required.

Fig. 2 Target-Based Data Deduplication [5].

However, because target deduplication provides faster performance for large (multiple terabyte) data sets, it is typically employed by companies with larger datasets and fewer bandwidth constraints. Data deduplication at the source is best suited to use at remote offices for backup to a central data centre or the cloud. It is usually a feature of the backup application. Target deduplication is best suited to use in the data centre for the reduction of very large data sets and is usually carried out with a hardware appliance. In Global data deduplication [6], there is a procedure of eliminating redundant data when backing up data to more number of deduplication devices. This situation might require backing up data to more than one target deduplication system or in the case of source deduplication it might require backing up to multiple backup nodes which will themselves be backing up multiple clients. When data is sent from one node to another, the second node recognizes that the first node already has a copy of the data, and does not make an additional copy.

There are two methods of the deduplication, first one is file level deduplication and second is block level deduplication. For file level deduplication, it eliminates duplicate copies of the same file. It is commonly known as single-instance storage. It compares a file that has to be archived or backup that has already been stored by checking all its attributes against the index. The index is updated and stored only if the file is unique, if not than only a pointer to the existing file that is stored references. Only the single instance of file is saved in the result and relevant copies are replaced by

"stub" which points to the original file [7]. E.g.

Structure-Aware File and Email Deduplication (SAFE) [8] first uses a file-level deduplication for unstructured file types such as multimedia content

(4)

Babaso D. Aldar and Vidyullata Devmane ijesird, Vol. III, Issue III, September 2016/193 and encrypted content and Duplicate Data

Elimination (DDE) is designed for IBM Storage Tank, a heterogeneous, scalable SAN file system[9]. In Block-level data deduplication, which is operates on the basis of the sub-file level. As the name implies, that the file is being broken into segments blocks or chunks that will be examined for previously stored information verses redundancy. Fixed-size chunking [7] is conceptually simple and fast. Chunks can be of variable size. The popular approach to determine redundant data is by assigning identifier to chunk of data, by using hash algorithm. Files can be broken into fixed length blocks or variable length blocks.

Metadata Aware Deduplication [9] is an accurate deduplication network backup service which works at both the file level and the block level.

III. PROPOSED SYSTEM

The proposed system will eliminate the duplicate data from the cloud storage. The objective of the proposed system is to solve the problem of deduplication of data with differential privileges in cloud storage. In this new deduplication system, cloud architecture is introduced to solve the problem. The deduplication system supports the differential authorization and authorised duplicate check [10], to solve the problem of privacy preserving deduplication of data in cloud storage. In differential authorization, each authorized user is able to get his/her individual token of his file to perform duplicate check and In authorised duplicate check, the authorized user is able to generate a query for certain file and the privileges he/she owned with the help of trusted third party, while the cloud server performs duplicate check directly and tells the user if there is any duplicate. To get a file token, the user needs to send a request to the trusted third party. It will also check the user‟s identity before issuing the corresponding file token to the user. The authorized duplicate check for this file can be performed by the user with the cloud server before uploading this file. Based on the results of duplicate check, the user either uploads this file or runs Proof of Ownership (PoW). An identification protocol ∏ = (Proof, Verify) is also defined, where

Proof and Verify are the proof and verification algorithm respectively. The security requirements considered are security of file token and security of data files.

A. SECURITY OF FILE TOKEN

For the security of file token, two aspects are defined as unforgeability and indistinguishability of file token. In Unforgeability of file token/duplicate- check token [10], unauthorized users without appropriate privileges or file should be prevented from getting or generating the file tokens for duplicate check of any file stored at the Cloud Service Provider (CSP). In Indistinguishability of file token/duplicate-check token [10], It requires that any user without querying the trusted third party for some file token, he cannot get any useful information from the token, which includes the file information or the privilege information.

B. SECURITY OF DATA FILES

Unauthorized users without appropriate privileges or files, including the CSP and the trusted third party, should be prevented from access to the underlying plaintext stored at CSP.

The convergent encryption [11], [12] technique has been proposed to encrypt the data before outsourcing to the cloud storage server. M. Bellare [12] designs a system, DupLESS that combines a CE-type i.e. convergent encryption scheme with the ability to obtain message-derived keys. It encrypts/

decrypts a data copy with a convergent key, which is obtained by computing the cryptographic hash value of the content of the data copy. After key generation and data encryption, users retain the keys and send the cipher text to the cloud server.

Since the encryption operation is deterministic and is derived from the data content, identical data copies will generate the same convergent key and hence the same cipher text. A user can download the encrypted file with the pointer from the server, which can only be decrypted by the corresponding data owners with their convergent keys. Thus, convergent encryption allows the cloud server to perform deduplication on the cipher texts and the proof of ownership prevents the unauthorized user to access the file.

(5)

Babaso D. Aldar and Vidyullata Devmane ijesird, Vol. III, Issue III, September 2016/194 C. SYSTEM MODEL

There are three entities in the system, that is, Owner/User, trusted third party and Cloud Service Provider. The CSP performs deduplication by checking if the contents of two files are the same and stores only one of them. The access right to a file is defined based on a set of privileges. The exact definition of a privilege varies across applications. Each privilege is represented in the form of a short message called token. Each file is associated with some file tokens, which denote the tag with specified privileges. Owner/User is the entity, which wants to outsource his/her data and later access it. Owner only uploads unique data but does not upload any duplicate data to save the upload bandwidth. This data may be owned by the same user or different users. Users have access to trusted third party, which will aid in performing deduplicable encryption by generating file tokens for the requesting users. Each file is protected with the convergent encryption key and privilege keys to realize the authorized deduplication with differential privileges. The Trusted Third Party (TTP) is an entity introduced for facilitating user‟s secure usage of cloud service. Since the computing resources at data owner/user side are restricted and the cloud server is not fully trusted party, therefore TTP is able to provide data owner/user with an execution environment and infrastructure working as an interface between owner/user and the CSP.

The secret key for the privileges are managed by the TTP, who answers the file token requests from the users. The token of users are issued from the TTP. He cannot get any useful information from the token, which includes the file information or the privilege information. The interface offered by the TTP allows user to submit files and queries to be securely stored and computed respectively. The Cloud Service Provider (CSP) is an entity that provides a data storage service. The CSP provides the data outsourcing service and stores data on behalf of the owner. To reduce the storage cost, the CSP eliminates the storage of redundant data via deduplication and keeps only unique data.

Owner/User can upload and download the files from CSP but TTP provides the security for that data. That means only the authorized person can

upload and download the files from the cloud server. The tag F_tag of a file F will be determined by the Owner of file F. A secret key kp will be bounded with a privilege p to generate a file token.

Let µ′_{F, p} = TokenGenerate (F_tag, kp) denote the token of F that is only allowed to access by the owner/user with privilege p. In another word, the token µ′_{F, p} could only be computed by the owner. If a file has been uploaded by a user with a duplicate token µ′F, p, then a duplicate check sent from another user will be successful if and only if he also has the file F and privilege p. Such a token generation function could be easily implemented as H (F, kp), where H ( ) denotes a cryptographic hash function.

(i) File Uploading

The data owner of a file F wants to upload that file F on a cloud server with privilege p. First, He/she has to compute the tag by using µ′′_{F, p} = TagGenerate (F).

Fig. 3 Architecture of Secure deduplication of Data

After calculating the tag value, Owner of a file has to send that tag value to TTP to compute token.

This token value will be unique for the file containing unique data. The Token generated will be send by owner to cloud service provide for duplicate check. If a duplicate is found by the CSP, the user proceeds proof of ownership of this file with the CSP. If the proof is passed, the user will be assigned a pointer, which allows him to access the file. Otherwise, if no duplicate is found, the user

(6)

Babaso D. Aldar and Vidyullata Devmane ijesird, Vol. III, Issue III, September 2016/195 computes the encrypted file C_F= Encrption_CE(k_c ,

FILE) with the convergent key k_c = KeyGenerationCE(FILE) and uploads the encrypted file along with File ID, Token and Privileges to the cloud server. The convergent key k_c is stored by the user locally.

(ii) File Downloading

User wants to download a file. First, he/she has to get the privilege key from the TPA. After getting the privilege key, user request for the file to cloud server.

Fig. 4 Procedure for file upload.

Fig. 5 Procedure for file download

Cloud server verifies that the user under the privileges. If that user is valid to retrieve the file then CSP provide the encrypted file to the user else abort the communication with that user. If User succeeds to download the encrypted file then he/she decrypts that file using convergent key which is stored locally and save that decrypted file on local disk.

IV. IMPLEMENTATION

We implement a prototype of the proposed secure deduplication system, in which we model three entities as separate JSP programs. An Owner/User program is used to model the data users to carry out the file upload process. A TTP program is used to model the trusted third party which handles the file token computation. A CSP program is used to model the Cloud Storage Server which stores and deduplicates files. We implement cryptographic operations of hashing and encryption with the OpenSSL library[13]. We also implement the communication between the entities based on HTTP, Thus, users can issue HTTP Post requests to the servers. Our implementation of the Owner/User provides the following methods to support token generation and deduplication along the file upload process.

 FileTag (F): Tag of a File „F‟ is generated using SHA-1 algorithm.

(7)

 TknReq (Tag, Userid): This request is generated for the token by providing the tag and user ID of a user.

 DupChkReq (Token): This is used for checking the file is duplicated on cloud server or not.

 FileEncryptAlgo (F): By AES algorithm, the file „F‟ is encrypted. The encryption is done using convergent key which is the hash value of the original file and generated using SHA-256.

 FileUpReq (FileID, F, Token): If file is not duplicate on cloud server then request to store that file on cloud by providing FileID, Encrypted file, Token and Privilege key.

 ShareTknReq (Token, Privilege): User request to the trusted third party for sharing the file token to the other users. According to the privilege the other users will get file token. By using this token and privilege, they can check the duplicate file or can access the file.

Our implementation of the TTP includes corresponding request handlers for the token generation.

 TknGen (Tag, UserID): TTP takes the tag value received from owner/user to generate the token by using HMAC-SHA-1 algorithm.

 ShareTkn (Token, Priv): According to owner/user request, TTP shares the token among the specified users.

Our implementation of the CSP provides deduplication and data storage maintains a map between existing files and associated token.

 DupChk (Token): It maps the generated token with existing tokens for duplicate check. If duplicate found then reply with “file duplicate”

otherwise reply with “No File Duplicate” to owner/user.

 FileStore (FileID, F, Token): It store the encrypted file „F‟, file ID and Token and maps the update.

(A) PROOF OF OWNERSHIP (POW)

The notion of proof of ownership is to solve the problem of using a small hash value as a proxy for the entire file in data deduplication [10]. PoW is a protocol enables users to prove their ownership of data copies to the storage server. PoW is

implemented as an interactive algorithm run by a user and a storage server. The storage server derives a short value ∏ from a data copy M. User needs to send ∏‟ and run a proof algorithm with the storage server. The formal security definition for PoW roughly follows the threat model in a content distribution network, where an attacker does not know the entire file, but has accomplices who have the file. To solve the problem of using a small hash value as a proxy for the entire file, we want to design a solution where a client proves to the server that it indeed has the file. If ∏=∏‟ then storage server allows otherwise abort the communication.

V. EVALUTION

We conduct testbed on this prototype. Our evolution includes file token generation; share token generation, convergent encryption and file upload steps. We evaluate the overhead by varying different factors, including file size and number of stored files. We conduct the experiments with three machines equipped with an Intel Core-2-Duo processor 1.66 GHz CPU, 1.5 GB RAM and installed with windows-7 Home Premium 64-Bit Operating System. The systems are connected with 100Mbps LAN cables. The upload process is divided into six steps; Tag generation, Token Generation, Duplicate file check, Encryption of the file, Transfer of the encrypted file and share the token with users. For each step, we recorded the starting and ending time and obtained the difference of the start and end time for each step. Following fig. shows the average time taken in each step. To find the time spent on different steps, we have chosen the different sized files. We uploaded 50 files and recorded the time required for each step.

The time spent on tag generation, encryption of file and transfer of file increases linearly with the file size, because these steps involve the actual file content and the I/O with the file.

(8)

Fig. 6 Time required for different file size

Token generation, duplicate check and share token uses the metadata i.e. data about the file for computation and time slowly getting increased as compared to the other steps. Token checking is done with a hash tables for duplicate check and to avoid the collision, token searching was linear.

VI. CONCLUSION

Thus, the notion of the secure data deduplication is to protect and deduplicate the files using differential privileges of the users.

Owners/users deduplicate the files by calculating the tag value. Tokens are calculated by trusted third party. Using token concept, our system easily identifies the duplicate files and store the unique files. If a person having the file and wants to upload file but the same file is already stored by another person then cloud server runs the proof of ownership protocol to verify the ownership and gives the reference to that stored file. To protect the uploading files from internal and external attacks, it is encrypted using convergent encryption technique.

To encrypt the file, convergent key is used which is derived from file itself. To avoid brute force, attacks only authorized users are allowed to request the duplicate check request. As a proof, we implemented a system according to the proposed secure duplicate check scheme and conducted testbed experiment on this system. Secure deduplication is done on file level but it can be performed on block level. If two files are duplicate then their blocks are also same.

REFERENCES

[1] X.Alphonseinbaraj, “Authorized data deduplication check in hybrid cloud with cluster as a service”, November-2014.

[2] Jin. Li, Yan Kit Li, Xiaofeng, P. Lee, and W. Lou., “A Hybrid Cloud Approach for Secure Authorised Deduplication” ,In IEEE Transactions on Parallel and Distributed Systems, 2014.

[3] Boga Venkatesh, Anamika Sharma, Gaurav Desai, Dadaram Jadhav, “Secure Authorised Deduplication by Using Hybrid Cloud Approach”, November 2014.

[4] Pasquale Puzio, Refik Molva, Melek Onen, Sergio Loureiro, “ ClouDedup: Secure Deduplication with Encrypted Data for Cloud Storage”.

[5] T.Y.J. NagaMalleswari, D.Malathi, "Deduplication Techniques:

A Technical Survey”, International Journal for Innovative Research in Science & Technology, December 2014.

[6] Pooja S Dodamani, Pradeep Nazareth,“A Survey on Hybrid Cloud with De-Duplication”, International Journal of Innovative Research in Computer and Communication Engineering, December 2014.

[7] Usha Dalvi, Sonali Kakade, Priyanka Mahadik, Arati Chavan,

“A Secured and Authenticated Mechanism for Cloud Data

Deduplication Using Hybrid Clouds”, International Journal of Engineering Research and Review, December 2014.

[8] Kim, Sejun Song, Baek-Young Choi, “SAFE: Structure-Aware File and Email Deduplication for Cloud-based Storage Systems”.

[9] Deepak Mishra, Dr. Sanjeev Sharma, “Comprehensive study of data de-duplication”, International Conference on Cloud, Big Data and Trust 2013, Nov 13-15.

[10] Jin. Li, Xiaofeng Chen, M. Li, J. Li, P. Lee, and W. Lou.,

“Secure Deduplication with Efficient and Reliable Convergent Key Management”, In IEEE Transactions on Parallel and Distributed Systems,June- 2014.

[11] P.Gokulraj, K.Kiruthika Devi, “Deduplication Avoidance Using Convergent Key Management in Cloud”, International Journal On Engineering Technology and Sciences, Volume I, Issue III, July-2014.

[12] M. Bellare, S. Keelveedhi, and T. Ristenpart, “ Message-locked encryption and secure deduplication”, in Proc. IACR Cryptology ePrint Archive, 2012.

[13] OpenSSL Project. http://www.openssl.org/, April-2015.