ENSURING DATA INTEGRITY AND DATA AVAILABILITY IN DISTRIBUTED CLOUD STORAGE

(1)

203 | P a g e

ENSURING DATA INTEGRITY AND DATA

AVAILABILITY IN DISTRIBUTED CLOUD STORAGE

J.Janila

¹

, J. John Jino

²

, J. John Femila

³

1Computer Science and Engineering, Mar Ephraem College of Engineering and Technology, (India)

2Electronics and Communication Engineering, PSN Engineering College, (India)

3Computer Science and Engineering, St.Xavier’s Catholic College of Engineering, (India)

ABSTRACT

Cloud storage enables users to outsource their data to remote cloud and access them wherever and whenever necessary in a pay-as-you-go manner. Though outsourcing data reduces the burden of local data storage management for the user, it indispensably brings new security challenges related to data integrity and data availability. To restore security assurances eroded by cloud environments, researchers have proposed several approaches to client verification of file availability and integrity. But their capabilities of handling confidential data and multi-user auditing remains vague, which inevitably limits their full applicability in cloud storage scenarios. This paper proposes a flexible distributed multi-user integrity and availability auditing mechanism for confidential data, utilizing unique identifiers, Advanced Encryption Standard, Reed Solomon Code, and homomorphic token. The proposed scheme is both efficient and flexible in dealing with confidential data and resilient against byzantine failure, malicious data modification attack, and even server colluding attacks.

Keywords: Cloud Storage, Data Availability, Data integrity, Error Recovery, Homomorphic Token

I. INTRODUCTION

Cloud storage is becoming a long-familiar buzzword at present. It is opening up a new era of data outsourcing. It is a model of networked online storage where data is stored in virtualized pools of storage which are generally hosted by third parties[9]. Several organizations are looking towards cloud storage to outsource their data. Cloud user relinquishes the control over their data to cloud service provider. Though data outsourcing eases the user from the burden of local data storage management, it still obviates their physical control of data integrity and data availability. Hence the users are at the clemency of cloud service provider for the availability and integrity of their outsourced data. This plight becomes gloomier when the outsourced data is more confidential. The trailblazers of cloud storage are Amazon, Salesforce, Windows Azure and the like. Recently these cloud storages have faced a stern downtime due to several internal and external threats leading to outages and data loss incidents persisting for a certain extent[16]. Besides that, the cloud service provider may chuck out seldomly accessed data for memory considerations or may hide the data loss incidents from the cloud user for retaining their reputation. The malicious insider or employee at provider‟s site can steal the confidential data as well [11].

These cases bestow more focus to data integrity and data availability problem of cloud storage.

Traditionally, verifying integrity and availability of outsourced data is the responsibility of the data owner. But the data owner reckons it as a time consuming and abominable task. Hence the cloud users prefer to delegate the control of verification to third party auditors(TPA). Third party auditor performs the data integrity and

(2)

204 | P a g e availability check on behalf of data owner. The whole data file is not divulged to third party auditor while auditing. Only a part of file or some information regarding the file is provided to TPA for auditing. However TPAs should not bring in new vulnerability or threat to cloud security and should not impose additional online burden to cloud user as well. In case of confidential data, auditing should not disclose any part of data or any information that reveals the data to any third party.

Several approaches have been introduced to assist the client verification of data integrity and availability in cloud storage [1], [2], [3], [4], [5], [6], [7], [8], [10], [12]. But their applicability for confidential data is unclear.

Some of these approaches lack dynamic property (insertion, deletion, updation) and multi-server support as well. Also as the users resort to third party auditing for data integrity and availability verification, several approaches pertinent to TPA have been proposed [13], [14], [15]. Though some of these approaches are privacy preserving, they are not meant for confidential data. Besides that, the auditing process carried out by these approaches are intended only for single user. In cloud ambience, there may be possibility of more than one user reaching out the third party auditor for verification. When multiple user access the third party auditor, the above stated approaches miscarry the request for verification and entail ambiguous predicament.

In this paper, we propose a multi-user auditing mechanism which verifies the integrity and availability of confidential data outsourced in the cloud. To avert unauthorized disclosure of confidential data, Advanced Encryption Standard is employed before outsourcing the data. For accomplishing redundancy and error recovery, Reed Solomon Coding is adopted. This assures the data dependability against byzantine failure and so forth. A homomorphic token is utilized to check the integrity of data and to localize the error whenever data corruption has been discovered. Inorder to assist multi-user auditing we rely on unique identifiers which identifies the respective data owner and cloud server of the data while auditing. This excludes the ambiguity when more than one user access the third party auditor for auditing.

The remainder of the paper is structured as following. Section 2 depicts the related work. The multi-user auditing scheme for confidential data is portrayed in Section 3. Section 4 presents the analysis. Section 5 concludes the paper.

II. RELATED WORK

Ateniese et al.[1] depicted a model called provable data possession (PDP). Provable Data Possession(PDP) is a cryptographic technique that permits users to store their data at an untrusted server and have probabilistic assurances that the server possesses the original data. In the PDP model, the data is preprocessed by the client and is moved to the untrusted server for storage, while keeping a small amount of meta-data. The client later requests the server to prove that the stored data has not been tampered with or deleted (without downloading the actual data). Subsequently, Ateniese et al.[2] modified this scheme with somewhat limited dynamism. They have developed a dynamic PDP solution called Scalable PDP. Later on Chris Erway et al.[7] rendered a definitional framework and efficient constructions for dynamic provable data possession (DPDP), which exserted the PDP model to aid updates on the stored data. Reza Curtmola et al.[5] stretched PDP to apply to multiple replicas so that a client that initially stores t replicas can later obtain an assurance that the storage system can develop t replicas, each of which can be used to rebuild the original file data. The major drawback of PDP is that it checks only the possession of data and it does not recover data in case of a failure(does not support data availability).

(3)

205 | P a g e Juels and Kaliski Jr.[8] proposed a scheme called Proof Of Retrievability(POR). A POR is a challenge-response protocol that facilitates a prover (Cloud Service Provider) to exhibit to a verifier (client) that a file F is retrievable, i.e., recoverable without any loss or corruption. This J-K POR Scheme utilizes special blocks hidden among other blocks in the data. During the verification phase, the client requests for randomly selected sentinels and verifies whether they are intact. If the server modifies or deletes parts of the data, then sentinels would also be affected with a certain probability. Bowers et al. [4] portrayed a theoretical framework for the blueprint of PORs. Shacham et al. [12] introduced the first secured proof-of-retrievability schemes. Nevertheless these are bounded use schemes. Dodis et al.[6] constructed the first unbounded-use POR scheme. The difference between PDP and POR is that POR checks the possession of data and it can recover data in case of any failure. The major drawback of POR is that it is focusing only on single server scenario.

Bowers et al.[3] introduced HAIL (High-Availability and Integrity Layer), a distributed cryptographic system that allows a set of servers to prove to a client that a stored file is intact and retrievable. HAIL verifies and reactively reapportions file shares. It is robust against an active, mobile adversary. The weakness of this scheme is it is mainly focusing on static data and it is underengineered as well. Lanxiang Chen [10] proposed an algebraic signature based Remote Data Possession Checking(RDPC) scheme. Algebraic signature can improve efficiency and the running of algebraic signature can achieve tens to hundreds of megabytes per second. It allows verification without the need for the challenger to compare against the original data. Though these above mentioned scheme has several advantages, they are not meant for confidential data.

Cloud users aspire to treat the cloud storage as local storage. Mehul et al.[13] believed that third party auditing would diminish the burden of correctness verification and brought the concept of third party auditing into cloud storage. Wang et al.[14] applied the concept of public auditing for dynamic data. The introduction of public auditing obviated the involvement of the client in auditing, which contributes more to the economies of scale for cloud computing. Wang et al.[15] proposed a privacy preserving auditing mechanism merging the homomorphic linear authenticator with random masking technique. However these approaches do not support multi-user auditing.

III. PROPOSED WORK

In cloud storage, data owners move their data in the cloud and no more hold the data topically. Among these data, the data owners may also store their confidential data in the cloud. Hence the integrity and availability of the ordinary data and/or confidential data stored in the cloud storage should be assured. In our proposed scheme, Advanced Encryption Standard is utilized to protect the confidential data from unauthorized disclosure. To ensure the confidential data integrity and availability, homomorphic token is used. Also to assist multi-user auditing, each token generated by the cloud service provider tagged with unique identifiers for data owners and cloud servers holding the respective data.

3.1 File Confidentiality

As the data to be stored on the cloud storage is confidential data, it should be hidden from unauthorized user.

Hence Advanced Encryption Standard is used here for data confidentiality. AES[17] is a symmetric block cipher. The algorithm commences with an add round key stage succeeded by 9 rounds of four stages and a tenth

(4)

206 | P a g e round of three stages. The decryption algorithm is the inverse of its counterpart in the encryption algorithm. The input block is of 128 bit size and the secret key size may be of 128, 192 or 256.

The four stages of the Advanced Encryption Standard are as follows:

Step 1: Substitute bytes. Every byte in the state is replaced by another one.

Step 2: Shift rows. Every row in the 4x4 array is shifted a certain amount to the left.

Step 3: Mix Columns. This is for mixing up of the bytes in each column separately. The goal is to further scramble up the 128-bit input block.

Step 4: Add Round Key. This is for adding the round key to the output of the previous step.

The tenth round leaves out the Mix Columns stage. The first nine rounds of the decryption algorithm consist of the following:

1. Inverse Shift rows 2. Inverse Substitute bytes 3. Inverse Add Round Key 4. Inverse Mix Columns

3.2 File Distribution Preparation

Reed Solomon codes[18] are used to tolerate multiple failures in distributed storage systems. In cloud data storage, this technique is used to disperse the data file F redundantly across a set of n = m+ k distributed servers.

An (m, k) Reed-Solomon erasure-correcting code is used to create k redundancy parity vectors from m data vectors in such a way that the original m data vectors can be reconstructed from any m out of the m + k data and parity vectors. By placing each of the m + k vectors on a different server, the original data file can survive the failure of any k of the m + k servers without any data loss, with a space overhead of k/m.

Let there be m storage devices, D₁, D₂, ...., D_m, each of which holds l bytes. These are called the “Data Devices”. Let there be k more storage devices C1,C₂,...,C_k , each of which also holds l bytes. These are called the “Checksum Devices”. The contents of each checksum device will be calculated from the contents of the data devices. The goal is to define the calculation of each C_i such that if any k of D₁,D₂,...,D_m,C₁,C₂,...,C_k fail, then the contents of the failed devices can be reconstructed from the non-failed devices.

1) Calculating and Maintaining ChecksumWords

Consider the data and checksum words as the vectors D and C and the functions F_i as rows of the matrix F, then the state of the system adheres to the following equation:

FD=C (1) Where, F is a m x n Vandermonde matrix: f_i,j=j^i-1.

2) Recovering From Failures

To explain recovery from errors, consider the following,

AD=E (2)

Each device in the system as having a corresponding row of the matrix A and the vector E . When a device fails, the failure is reflected by deleting the device‟s row from A and from E . It results in a new matrix A‟and a new vector E‟ that adhere to the equation:

A‟D=E‟ (3)

Suppose exactly m devices fail. Then A‟ is a n x m matrix. Because matrix A‟ is defined to be a Vandermonde matrix, every subset of n rows of matrix A is guaranteed to be linearly independent. Thus, the matrix A‟ is non-

(5)

207 | P a g e singular, and the values of D may be calculated from A‟D=E‟ using Gaussian Elimination. Hence all data devices can be recovered.

3.3 Challenge Token Precomputation

In order to achieve assurance of data storage correctness and data error localization simultaneously, this scheme entirely relies on the precomputed verification tokens. The main idea is as follows: before file distribution the user encrypts the individual vector G^(j) using Advanced Encryption Standard and precomputes a certain number of short verification tokens on individual vector G‟^(j) (j Є {1,2,...,n}), each token covering a random subset of data blocks. Later, when the user wants to make sure the storage correctness for the data in the cloud, he challenges the cloud servers with a set of randomly generated block indices. Upon receiving challenge, each cloud server computes a short “signature” over the specified blocks and returns them to the user. The values of these signatures should match the corresponding tokens precomputed by the user.

Suppose the user wants to challenge the cloud servers t times to ensure the correctness of data storage. Then, the user must precompute t verification tokens for each G‟^(j) (j Є {1,2,...,n}), using a PRF f(.), a PRP Φ, a challenge key k_chal, and a master permutation key K_PRP .

Procedure

Choose parameters l, n and function f,Φ;

Choose the number t of tokens;

Choose the number r of indices per verification;

Generate master key K_PRP and challenge key k_chal; for vector G‟(j), j ←1, n do

for round i← 1, t do

Derive α_i = f_kchal(i) and k⁽ⁱ⁾_prp from K_PRP Compute v_i^(j) = Σ^r_q=1 α_i^q * G‟^(j)[Φkⁱprp (q)]

end for end for

Store all the v_i‟s locally.

end procedure

Fig. 1 Algorithm for Challenge Token Precomputation

3.4 Correctness Verification and Error Localization

When multiple users approach the third party auditor for data verification, the auditor asks the tokens for corresponding data from the cloud server by issuing the unique identifier of user. On receiving the requested tokens, the auditor differentiates among various data‟s token by the owner‟s identifier which is tagged with the token. Error localization is a key prerequisite for eliminating errors in storage systems. This checks whether the data is modified or not. If any modififcation detected, it is localized by the identifier of corresponding server for further recovery.

(6)

208 | P a g e

Fig. 2 Algorithm for Correctness Verification and Error Localization

3.5 File Retrieval and Error Recovery

Whenever the data corruption is detected, the comparison of precomputed tokens and received response values can guarantee the identification of misbehaving server(s). Therefore, the user can always ask servers to send back blocks of the r rows specified in the challenge and regenerate the correct blocks by erasure correction. The newly recovered blocks can then be redistributed to the misbehaving servers to maintain the correctness of storage.

Procedure

% Assume the block corruptions have been detected among the specified r rows;

Download r rows of blocks from servers;

Treat s servers as erasures and recover the blocks.

Resend the recovered blocks to corresponding servers.

end procedure

Fig 3 Algorithm for File Retrieval and Error Recovery

procedure CHALLENGE(i)

Recompute α_i = f_kchal(i) and k⁽ⁱ⁾_prp from K_PRP ; Send { α_i, k⁽ⁱ⁾_prp} to all cloud servers;

Receive from servers:

{ Ri(j)

= Σ^rq=1 αiq

* G‟^(j)[Φkⁱprp (q)]|1<j<n;

ID_owner=ID; ID_server=ID;

}

for (j← m + 1, n) do

R(j)← R(j)- Σ^r_q=1 fk_j(S_Iq,j). αiq, Iq = Φk⁽ⁱ⁾_prp(q)

end for

if ((R_i⁽¹⁾,...R_i^(m)) . P == (R_i^(m+1),...R_i⁽ⁿ⁾)) then Accept and ready for the next challenge.

else

for (j ← 1, n) do if (R_i^(j) != v_i^(j)) then return ID_server end if

end for end if end procedure

(7)

209 | P a g e

IV. RESULTS AND EVALUATION

The extend of achieving integrity and availability of confidential data using multi-user auditing is analysed in two scenarios i.e., with data corruption and without data corruption. When multiple users approach the cloud server for integrity verification, the token received will be distinguished by the unique identifier of the owner.

Also the auditing mechanism used here is made unbounded by merging random numbers with the token.

4.1 Without Data Corruption

The confidential data stored in the cloud was not corrupted. In this scenario, the token received from the cloud server will be same as the original token. Hence the data integrity will be verified.

4.2 With data corruption

The confidential data stored in the cloud was corrupted. In this scenario, the token received from the cloud server will not match the original token. Hence data corruption will be detected. The server in which the data is corrupted is identified by the unique identifier of server tagged with the token. Then using the Reed Solomon decoding, the data which is corrupted will be recovered. Hence data availability will be guaranteed.

V. CONCLUSION

A flexible multi-user confidential data integrity and availability auditing mechanism, utilizing unique identifiers, Advanced Encryption Standard, Reed Solomon Code and homomorphic token is proposed to ensure data integrity and availability in cloud storage. The unique identifiers is introduced to identify the data owner and cloud server individually during verification process which assist multi-user auditing.

Advanced Encryption Standard is used to protect the data from unwanted disclosure. Using Reed Solomon Code, n+k encoding symbols can be produced from n source symbols and is able to recover the source block from any set of encoding symbols only slightly more in number than the number of source symbols.

By utilizing the homomorphic token with distributed verification of erasure coded data, our proposed scheme achieves the integration of storage correctness insurance and data error localization. The proposed scheme is resilient against byzantine failure, malicious data modification attack, and even server colluding attacks.

REFERENCES

[1] Ateniese, R. Burns, R. Curtmola, J. Herring, L. Kissner, Z. Peterson, and D.Song,(Oct. 2007) „Provable Data Possession at Untrusted Stores‟, Proc. 14th ACM Conf. Computer and Comm. Security (CCS ‟07), pp. 598-609.

[2] Ateniese, R.D. Pietro, L.V. Mancini, and G. Tsudik,(2008) „Scalable and Efficient Provable Data Possession‟, Proc. Fourth Int‟l Conf. Security and Privacy in Comm. Netowrks (SecureComm ‟08), pp. 1- 10.

[3] Bowers, A. Juels, and A. Oprea, (2009) „HAIL: A High-Availability and Integrity Layer for Cloud Storage‟, Proc. ACM Conf. Computer and Comm. Security (CCS ‟09), pp. 187-198.

[4] Bowers, A. Juels, and A. Oprea,(2009) „Proofs of Retrievability: Theory and Implementation‟, Proc.

ACM Workshop Cloud Computing Security (CCSW‟09), pp. 43-54.

(8)

210 | P a g e [5] Curtmola, O. Khan, R. Burns, and G. Ateniese, (2008) „MR-PDP: Multiple-Replica Provable Data

Possession‟, Proc. IEEE 28th Int‟l Conf. Distributed Computing Systems (ICDCS ‟08), pp. 411-420.

[6] Dodis, S. Vadhan, and D. Wichs,(Mar. 2009) „Proofs of Retrievability via Hardness Amplification‟, Proc.

Sixth Theory of Cryptography Conf. (TCC ‟09).

[7] Erway, A. Kupcu, C. Papamanthou, and R. Tamassia, (2009) „Dynamic Provable Data Possession‟, Proc.

16th ACM Conf. Computer and Comm. Security (CCS ‟09), pp. 213-222.

[8] Juels and B.S. Kaliski Jr., (Oct. 2007) „PORs: Proofs of Retrievability for Large Files‟, Proc. 14th ACM Conf. Computer and Comm. Security (CCS ‟07), pp. 584-597.

[9] Klaus Julisch and Michael Hall, (2010) „Security and Control in the Cloud‟, Information Security Journal: A Global Perspective(299-309).

[10] Lanxiang Chen, (2012) „Using algebraic signatures to check data possession in cloud storage‟, Future Generation Computer Systems.

[11] Mervat Adib Bamiah, Sarfraz Nawaz Brohi, (2011), „Seven Deadly Threats and Vulnerabilities in Cloud Computing‟, International Journal Of Advanced Engineering Sciences And Technologies Vol No. 9, Issue No. 1, 087 – 090

[12] Shacham and B. Waters, (2008) „Compact Proofs of Retrievability‟, Proc. 14th Int‟l Conf. Theory and Application of Cryptology and Information Security: Advances in Cryptology (Asiacrypt ‟08), pp. 90- 107.

[13] Shah, M. Baker, J.C. Mogul, and R. Swaminathan, (2007) „Auditing to Keep Online Storage Services Honest‟, Proc. 11th USENIX Workshop Hot Topics in Operating Systems (HotOS ‟07), pp. 1-6.

[14] Wang, C. Wang, K. Ren, W. Lou, and J. Li, (2011) „Enabling Public Auditability and Data Dynamics for Storage Security in Cloud Computing‟, IEEE Trans. Parallel and Distributed Systems, vol. 22, no. 5, pp.

847-859.

[15] Wang, S.S.M. Chow, Q. Wang, K. Ren, and W. Lou, (2012), „Privacy- Preserving Public Auditing for Secure Cloud Storage‟, IEEE Trans. Computers, preprint.

[16] www.iwgcr.org

[17] Advanced Encryption Standard, http://csrc.nist.gov/ publications/ fips/fips197/ fips-197.pdf [18] Reed Solomon Codes, http://www.cs.utk.edu/plank/ papers/CS-03-504.html