Secure Deduplication of Encrypted Data without Additional Servers
Jian Liu
Aalto University
[email protected]
N. Asokan
Aalto University and University of Helsinki
[email protected]
Benny Pinkas
Bar Ilan University
[email protected]
Abstract
Encrypting data on the client-side before uploading it to cloud storage is essential for protecting users’ privacy. How- ever client-side encryption is at odds with the standard prac- tice of deduplication in cloud storage services. Reconciling client-side encryption with cross-user deduplication has been an active research topic. In this paper, we present the first secure cross-user deduplication scheme that supports client- side encryption without requiring any additional independent servers. We demonstrate that our scheme provides better se- curity guarantees than previous work. We also demonstrate that its performance is reasonable via simulations using re- alistic datasets.
1. INTRODUCTION
Cloud storage is a service that enables people to store their data remotely with a storage provider. With a rapid growth in user base, storage providers tend to save storage costs via cross-user deduplication: if two clients want to up- load the same file, the storage server detects the duplication and stores only a single copy. Deduplication achieves high storage savings [20] and is adopted by many cloud service providers. It is also adopted widely in backup systems for enterprise workstations.
Clients who care about privacy want to encrypt their data on the client-side using semantically secure encryption mech- anisms. However, na¨ıve application of encryption thwarts deduplication. Reconciling deduplication and encryption is an active research topic. The current solutions are either dependent on the aid of additional independent servers [3, 23, 25] or only suitable for Peer-to-Peer (P2P) systems [13].
Reliance on independent servers is a strong assumption that is very difficult to meet in commercial contexts, and since most of the popular cloud storage services (e.g. Dropbox, Google Drive) are client-server systems, turning them into P2P systems is unrealistic.
In this paper, we present the first single-server scheme for secure cross-user deduplication with client-encrypted data.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
Our scheme allows two clients uploading the same file to the cloud storage to securely agree on the same file encryption key. We do this by building upon a well-known crypto- graphic primitive known as password authenticated key ex- change (PAKE) [6]. A PAKE allows two parties to agree on a session key if and only if they share a short secret (“pass- word”), which might have low entropy. In our deduplication scheme, PAKE protocol runs are facilitated by the storage server. It avoids the need for any additional independent servers and thus can be adopted easily into popular cloud storage services. In terms of security, our scheme makes use of per-file rate limits (namely, limiting the number of queries allowed per file), rather than per-client rate limits (limiting the number of queries allowed for each client, as was used in [3]). This change makes the scheme significantly more resistant against brute-force guessing attacks by colluding clients or a malicious storage server that attempts to mas- querade as multiple clients.
We make the following contributions:
• presenting the first single-server scheme that si- multaneously enables the server to do cross-user dedu- plication and clients to encrypt their data on client-side using semantically secure encryption keys. (Section 5);
• showing that our scheme has better security guar- antees than previous work (Section 6). As far as we know, our proposal is the first scheme that can prevent online brute-force guessing attacks by malicious clients or storage server, without the aid of an identity server;
• demonstrating, via simulations with realistic datasets, that our scheme provides its privacy benefits while re- taining a utility level (in terms of deduplication effec- tiveness) on par with standard industry practice.
Our use of per-file rate limits implies that an incom- ing file will not be checked against all existing files in storage. Surprisiingly, our scheme still achieves high deduplication effectiveness (Section 7); and
• implementing our scheme to show that it incurs min- imal overhead (computation/communication) com- pared to traditional deduplication schemes (Section 8).
2. PRELIMINARIES 2.1 Deduplication
Deduplication strategies can be categorized according to
the basic data units they handle. One category is file-level
deduplication which eliminates redundant files [12]. The
other is block-level deduplication, in which files are segmented
into blocks and duplicated ones are eliminated [24]. In this paper we use the term ‘file’ to refer both files and blocks.
Deduplication strategies can also be categorized according to the side in which deduplication happens. In server-side deduplication, all files are uploaded to the storage server S, which then deletes the duplicated files. Clients C are un- aware of deduplication. This strategy saves storage but not bandwidth. In client-side deduplication, C checks the ex- istence of its files on S (by sending cryptographic hashes) before uploading them. Duplicates are not uploaded to the S. This strategy saves both storage and bandwidth.
The effectiveness of deduplication is customarily expressed by the deduplication ratio, defined as the “the number of bytes input to a data deduplication process divided by the number of bytes output” [14]. In the case of a cloud storage service, this is the ratio between total size of files uploaded by all clients and the total storage space used by the service.
Dedpulication percentage (sometimes referred to as “space reduction percentage”) [14, 17] is 1 −
deduplication ratio1. A perfect deduplication scheme will detect all duplicates. Hence its deduplication percentage will be close to 100%.
The level of deduplication achieveable depends on a num- ber of factors. In common business settings, deduplication ratios in the range of 4:1 (75%) to 500:1 (99.8%) are typi- cal. Wendt et al [26] suggest that a figure in the range 10:1 (90%) to 20:1 (95%) is a realistic expectation.
Although deduplication benefits storage providers (and hence, indirectly, their users), it also constitutes a privacy threat for users. For example, a cloud storage server that supports client-side, cross-user deduplication can be exploited as an oracle that asks queries of the form “did anyone upload this file?”. An adversary A can do so by uploading a file and observing whether deduplication takes place [17]. For a pre- dictable file that is known to be drawn from a small space, A can construct all possible files from that space, upload them and observe which causes deduplication so that A can iden- tify the target file. Harnik et al. [17] propose a randomized threshold approach to address such online brute-force guess- ing attacks. Specifically, for each file F , S keeps a random threshold t
F(t
F≥ 2) and a counter c
Fthat indicates the number of Cs that have previously uploaded F . Client-side deduplication happens only if c
F≥ t
F. Otherwise, S does a server-side deduplication.
2.2 Hash Collisions
A hash function H: Φ → {0, 1}
nis a function that deter- ministically maps a binary string in Φ of arbitrary length to a binary string h of fixed length n, where h is called the hash value or simply the hash, and n is called the hash length of H. A cryptographic hash function can map distinct ele- ments to a hash value such that it is infeasible to invert the function and find collisions (namely, the function is collision resistant).
If a hash function is not collision resistant, its collisions can be calculated as
|Φ|2n, on condition that the hash values are evenly distributed across the available range. While a hash function with a long hash length has small collisions, a hash function with a small hash length will have high col- lisions. We will use such a short hash to protect clients’
privacy. For a predictable file F , drawn from a small space Φ, if A knows H(F ), she can easily perform an offline brute- force guessing attack. But if A only knows SH(F ), where SH() is a short hash function, it is hard for A to guess the
exact content of F due to the high collision rate of SH().
2.3 Additively Homomorphic Encryption
A public key encryption scheme is additively homomor- phic if given two ciphertexts c
1= Enc(pk, m
1; r
1) and c
2= Enc(pk, m
2; r
2) (i.e., encryptions of the messages m
1, m
2using the same public key pk but different random values r
1, r
2), it is possible to efficiently compute Enc(pk, m
1+ m
2; r) with independent randomness r, even without knowl- edge of the corresponding private key. Examples of such encryption are Paillier’s encryption [21], or El Gamal en- cryption [15] where addition is done in the exponent.
In this paper, we use E()/D() to denote symmetric en- cryption/decryption, and use Enc()/Dec() to denote addi- tively homomorphic encryption/decryption. We abuse the notation and use Enc(pk, m) to denote the random variable induced by Enc(pk, m; r) where r is chosen uniformly at ran- dom. In addition, we use ⊕ and to denote homomorphic addition and subtraction respectively.
2.4 Password Authenticated Key Exchange
Password-based protocols are commonly used for user au- thentication. However, such protocols are vulnerable to of- fline brute-force guessing attacks (also referred to as “dic- tionary attacks”) since users tend to choose passwords with relatively low entropy that are hence guessable. Bellovin and Merritt [6] were the first to propose a password authen- ticated key exchange (PAKE) protocol, in which an attacker making a password guess cannot verify the guess without an online attempt to authenticate itself with the password. The protocol is based on using the password as a symmetric key to encrypt the messages of a standard key exchange protocol (e.g., Diffie-Hellman [11]), so that two parties with the same password will successfully generate a session key without re- vealing their passwords.
Following this seminal work, many protocols were pro- posed to improve PAKE in several aspects, e.g., describ- ing provably secure protocols [5, 7], weakening the assump- tion (i.e., working in standard model without random ora- cles) [16, 2], achieving a stronger proof model [10, 9] and improving the round efficiency [18, 5, 19].
We use PAKE as a building block of our deduplication protocol. The ideal functionality F
pakeis shown in Fig- ure 1. Alice and Bob hold passwords pw
aand pw
brespec- tively. F
pakeoutputs the same key to both of them if and only if pw
a= pw
b. Otherwise, it outputs two independent random keys to Alice and Bob.
We use PAKE as a building block for our deduplication protocol. The chosen PAKE protocol should satisfy the fol- lowing requirements:
• Implicit key exchange: It must be implicit, meaning that at the end of the protocol each party outputs a key but neither party learns if the passwords matched or not. In fact, many PAKE protocols were designed to be explicit where parties will know whether the ex- change succeeded or not at the end of the protocol.
• Single round: It must be single-round so that it can be easily facilitated by the storage server.
• Pseudo-random keys: It must guarantee that if the
passwords are different than neither party can distin-
guish the key output by the other party from a random
key.
Inputs:
• Alice’s input is a password pw
a;
• Bob’s input is a password pw
b. Outputs:
• Alice’s output is k
a;
• Bob’s output is k
b.
where if pw
a= pw
bthen k
a= k
b, and if pw
a6= pw
bthen Bob (resp. Alice) cannot distinguish k
a(resp. k
b) from a random string of the same length.
Figure 1: The ideal functionality F
pakefor password authenticated key exchange.
• Concurrent executions: It must allow multiple PAKE instances to run in parallel. There are two common security notions for PAKE under concurrent execu- tions. One notion is “concurrent PAKE”, defined by [5, 7]. The other stronger notion is that of “UC secure PAKE” [10], which guarantees security for composition with arbitrary protocols, and with arbitrary, unknown and possibly correlated password distributions. Pro- tocols secure according to the notion of UC security (e.g. in [10]) are currently much less efficient than protocols secure according to the notion of concurrent PAKE. We therefore use a concurrent PAKE protocol in our work.
Our deduplication scheme uses the SPAKE2 protocol of Ab- dalla and Pointcheval [1], which is described in Figure 2.
This protocol is secure in the concurrent setting, in the ran- dom oracle model. It is very efficient, requiring each party to compute just three exponentiations, and send just a sin- gle group element to the other party. Theorem 5.1 in [1]
states that this protocol is secure assuming that the com- putational Diffie-Hellman problem is hard in the group used by the protocol.
3. PROBLEM STATEMENT 3.1 General Setting
Our generic setting for cloud storage systems consists of a storage server S and a set of clients (Cs) who store their files on S. Cs never communicate directly, but exchange messages with S, and S processes the messages and/or for- wards them as needed. In addition, third parties (T Ps) can be introduced to the system to assist deduplication. For ex- ample, T Ps can be used for key generation [3], keeping track of file popularities [25] and even encryption/decryption [23].
But independent T Ps are expensive to deploy in practice and can be bottlenecks for both security and performance.
Introducing a T P which is independent from S is often un- realistic in commercial setting.
We assume that parties communicate through secure chan- nels in the system, so that an adversary A cannot eavesdrop and tamper any data in the channel. Thus we assume that S can authenticate and authorize Cs, and Cs can authenticate S as well.
We introduce new notations as needed. A summary of
Inputs:
• The input of the uploading client C is F ;
• Each existing client C
ihas inputs F
iand k
Fi;
• S’s input is a set of existing encrypted files {E(k
Fi, F
i)}.
Outputs:
• C learns a key k
Fwith which to encrypt its file. If F is identical to an existing file F
ithen k
F= k
Fi. Otherwise k
Fis random;
• Each C
i’s output is empty;
• S learns the encryption of F with the key k
F. Note that if there exists F
i= F then this en- cryption is identical to the encryption of F
i. Figure 3: The ideal functionality F
dedupof dedupli- cating encrypted data.
notations appears in Appendix A.
3.2 Ideal Model
We define an ideal model that attempts to capture the ideal functionality F
dedupof deduplicating encrypted data.
The model has three types of participants: the storage server S, an uploading client C attempting to upload a new file, and existing clients {C
i} who have already uploaded a file. The ideal functionality is shown in Figure 3.
A deduplication system for encrypted files is secure if it implements F
dedup. In particular, no participant learns more information than is defined in the output of the functional- ity. Our system implements this functionality with the fol- lowing relaxation: S learns a short hash of F (say, 13 bits long), and other clients that have previously uploaded files with this hash value might learn that a new file with this short hash is being uploaded. In a large scale system this short hash agrees with many files and therefore leaks limited information about the uploaded file.
3.3 Design Goals
Threat Model: An adversary A might control any subset of Cs of the system. It might also have control over S: it can gain access to the data stored at S, perform offline brute- force guessing attacks, and might even mount active attacks by deviating from the deduplication protocol and potentially masquerading as Cs of the system. A may also control any third party T P used by the deduplication scheme.
When client-side deduplication takes place then A mas-
querading as C uploading a file, and wishing to identify
whether a specific file was already uploaded by a different
C, can apply an online brute-force attack as described by
Harnik et al. [17]. The randomized approach of [17], de-
scribed in Section 2.1, can mitigate that attack. However,
even if the deduplication scheme is secure against offline
brute-force guessing attacks, an A controlling both S and
some Cs can still mount a similar online brute-force attack
by running the deduplication protocol for every “guess” of a
predictable file and checking if deduplication occurs. This
attack is “online” since, as will become clearer in Section 4,
Public information: A finite cyclic group G of prime order p generated by an element g. Public elements M
u∈ G associated with user u. A hash function H modeled as a random oracle.
Secret information: User u has a password pw
u. The protocol is run between Alice and Bob:
1. Each side performs the following computation:
• Alice chooses x ∈
RZ
pand computes X = g
x. She defines X
∗= X · (M
A)
pwA.
• Bob chooses y ∈
RZ
pand computes Y = g
y. He defines Y
∗= Y · (M
B)
pwB. 2. Alice sends X
∗to Bob. Bob sends Y
∗to Alice.
3. Each side computes the shared key:
• Alice computes K
A= (Y
∗/(M
B)
pwA)
x. She then computes her output as SK
A= H(A, B, X
∗, Y
∗, pw
A, K
A).
• Bob computes K
B= (X
∗/(M
A)
pwB)
y. He then computes his output as SK
B= H(A, B, X
∗, Y
∗, pw
B, K
B).
Figure 2: The SPAKE2 protocol of Abdalla and Pointcheval [1].
A has to run a protocol invocation with the uploading C for every “guess” of the content of the uploaded file.
Security Goals: With the above threat model in mind, we aim to propose a deduplication scheme that both enables S to do deduplication and Cs to encrypt their files with a semantically secure encryption scheme. Our security goals are as follows.
S1 Avoid relying on any T P.
S2 Prevent brute-force guessing attacks:
S2.1 online by compromised Cs S2.2 offline by compromised passive S
S2.3 online by compromised active S (masquerading as multiple Cs)
Functional goals: In addition, our protocol should also meet certain functional goals.
F1 Maximize deduplication effectiveness (exceed realistic expectations, as discussed in Section 2.1).
F2 Minimize computational and communication overhead (i.e., the computation/communication costs should be comparable to storage systems without deduplication).
4. OVERVIEW OF THE SOLUTION
We first motivate salient design decisions in our dedupli- cation protocol before describing the details in Section 5.
When an uploader C wants to upload a file F to S, we need to address two problems: (a) determining if S already has an encrypted copy of F in its storage and (b) if so, then securely arranging to have the encryption key transferred to C from some C
iwho uploaded the original encrypted copy of F .
In traditional client-side deduplication, when C wants to upload a file F to S, she first sends a cryptographic hash h of F to S so that it can check for the existence of the file in its storage. Na¨ıvely adapting this approach to the case of encrypted storage is insecure because a compromised S can easily do offline brute-force guessing attacks on h if F is predictable. Therefore, instead of h, we let C send a short hash sh = SH(F ). Due to the high collision rate of the short hash function SH(), S cannot reliably guess the content of F offline.
Now, suppose that another C
i, previously uploaded E(k
Fi, F
i) using k
Fias the symmetric file encryption key for F
iand that sh = sh
i. Our protocol needs to determine if this hap- pened because F = F
iand, in that case, arrange to have k
Fisecurely transferred from C
ito C. We do this by having C and C
iengage in an oblivious key sharing protocol which al- lows C to receive k
Fiiff F = F
iand a random key otherwise.
We say that C
iplays the role of a checker in this protocol.
The oblivious key sharing protocol could be implemented using generic solutions for secure two-party computation, such as Yao’s protocol [28], which express the desired func- tionality as a boolean circuit and then securely compute it.
Protocols of this type have been demonstrated to be very efficient, even with security against malicious adversaries.
In our setting the circuit representation is actually quite compact, but the problem in using this approach is that the inputs of the parties are relatively long (say, 288 bits long, comprising of a full-length hash value and a key), and known protocols require an invocation of oblivious transfer, namely of public-key operations, for each input bit. There are known solutions for oblivious transfer extension, which use a preprocessing step to reduce the online computation time of oblivious transfer. However, in our setting the secure computation is run between two clients that do not have any pre-existing relationship, and therefore preprocessing cannot be computed before the protocol is run.
Our solution for an efficient oblivious key sharing protocol is based on having C and C
irun a PAKE instance, with the hash values of their files, namely h and h
i, as their respective input “passwords”. The protocol results in C
igetting k
iand C getting k
0i, which are equal if h = h
iand are independent otherwise. The next step of the protocol, which will be described in Section 5, uses these keys to deliver to C a key k
F, which is equal to k
Fiiff k
i= k
0i. C uses this key to encrypt her uploaded file, and S can deduplicate that file if this file is equal to the file uploaded by C
i. Several additional issues need to be solved:
1. A compromised S can perform an online brute-force
guessing attack where it initiates many interactions
with C/C
ito identify F /F
i. Each protocol interaction
essentially enables a single guess about the content of
the relevant file. Cs therefore use a per-file rate limiting
strategy to prevent such attacks. Specifically, they set
root
sh
1E(k
F1, F
1)E(k
F2, F
2) ...
C
1C
2...
sh
2...
Figure 4: S’s record structure.
a bound on the maximum number of PAKE requests they would service as a checker or an uploader for each file. (See Section 5.1.) Our experiments with real data sets in Section 7 show that this rate limiting does not affect the success of deduplication.
2. What if C
iis not online? If S has a large number of clients, it can find enough online checkers. If there are not enough online checkers, we can let the uploader run PAKE with the currently available checkers and with additional dummy checkers to hide the number of available checkers. (See Section 5.2.) Again, our experiments in Section 7 show that this does not affect the success of deduplication (since it is likely to find checkers for popular files).
3. How can S do client-side deduplication securely? S will know whether C and C
ihave the same file from the results of PAKE. If they do have the same file, S just inform C that she need not upload her file. In order to solve the problem of online brute-force guessing attacks by C, when client-side deduplication takes place, we can use the same randomized threshold strategy as [17]
(see Section 5.3).
5. DEDUPLICATION PROTOCOL
In this section we describe our deduplication protocol in detail. The data structure maintained by S is shown in Fig- ure 4. Clients Cs who want to upload a file also upload the corresponding short hash sh. Since different files may have the same short hash, a hash sh is associated with the list of different encrypted files whose plaintext has this hash value.
S also keeps track of clients (C
1, C
2, . . .) who have uploaded the same encrypted file (therefore, with overwhelming prob- ability, these clients have the same plaintext file encrypted with the same key).
Figure 5 shows a basic secure deduplication protocol. Each C
ithat uploads a file F
istores at S a short hash sh
iof that file (Step 0). When a client C wishes to upload a file F it sends the short hash of this file, sh, to S (Step 1). S identifies clients {C
i} that have uploaded files with the same short hash (Step 2), and runs the following protocol with each of them (in the more refined version of the system, there is a bound on the number of these protocol invocations).
Consider a specific C
iwho uploaded a file F
i. It runs a PAKE protocol with C, where their inputs are the cryp- tographic hash values of F
iand F respectively, and their
outputs are keys k
iand k
0irespectively (Step 3). We as- sume that the keys are sufficiently long so that k
ican be divided to a left key and right key, k
i= k
iL||k
iR, where k
iLis long enough so that collisions are unlikely (namely,
|k
iL| >> 2 log n where n is the number of clients), and k
iRcan be used for symmetric encryption. The same also holds for k
0i.
At this point, a naive solution would have been for C
ito send to C an encryption of k
Fiwith k
i, namely E(k
Fi, k
i).
However, this would enable a subtle attack by the C to iden- tify whether F was previously uploaded. Therefore, the protocol continues as follows: each C
isends to S the pair k
iL, (k
iR+ k
Fi), where addition is done in a field whose size will be defined next. C sends to S the pair k
0iLand Enc(pk, k
0iR+ r), as well as its public key pk, where this is an additively homomorphic encryption (Steps 4-5).
After receiving these messages from all C
is, S looks for a pair i for which k
iL= k
0iL. This equality happens iff F
i= F (except with negligible probability). If S finds such a pair it sends to C the value e = Enc(pk, k
iR+ k
Fi) Enc(pk, k
0iR+ r) = Enc(k
Fi− r). Otherwise it sends e = Enc(r
0), where r
0is chosen randomly by S (Step 6). C calculates k
F= Dec(e) + r, and sends E(k
F, F ) to S (Step 7). Note that if F was already uploaded to the server then E(k
F, F ) is equal to the previously stored encrypted version of the file.
This basic protocol has several issues that are addressed in the following sections: it allows for online brute-force guess- ing attacks (which we prevent using rate-limiting in Sec- tion 5.1); it does not specify how checkers are selected by S, if too many checkers are available (Section 5.2), and, if client-side deduplication is used, it leaks to the client whether F was already uploaded (this is solved in Section 5.3 using a randomized threshold approach).
Theorem 1. The deduplication protocol in Figure 5 im- plements F
dedupin malicious model if the PAKE protocol is secure against malicious adversaries, the additively ho- momorphic encryption is semantically secure and the hash function is modeled as a random oracle.
Proof. We assume there is a trusted party that imple- ments F
dedupin ideal model. We then construct a simula- tor, and show that the execution of F
dedupin ideal model is computationally indistinguishable from the execution of the deduplication protocol in real model. For the purpose of the proof we assume that the PAKE protocol is implemented as an oracle to which the parties sends their inputs and re- ceive their outputs. (Note that the PAKE functionality is assumed to be implemented using a fully-secure protocol; in fact, it could be implemented using generic secure computa- tion protocols. Therefore this assumption is justified.)
A corrupt uploader C: We first assume that S and the C
is are honest and construct a simulator for the uploader C.
The simulator operates as follows: The simulator records the calls that C makes to the hash function, which is modeled as a random oracle, and records tuples of the form (F
j, h
j, sh
j) of the inputs and outputs for these calls. The simulator first receives a value sh from the C. The simulator now obtains C’s inputs to the different PAKE invocations. If such an input is equal to an h
jvalue which was in the same tuple as sh the simulator invokes F
dedupwith F
jand receives a key k
Ffrom F
dedup.
Note that an honest C uses the same h value in all PAKE
invocations. A corrupt C might use different h values. Also,
0. When each C
iuploaded its file F
i, it uploads sh
iand E(k
Fi, F
i).
1. C attempts to upload a file F . It calculates the cryptographic hash h and a short hash sh of F , and sends sh to S.
2. S finds the clients {C
i} which uploaded files {F
i} whose short hash sh
iis equal to sh, and lets them run PAKE protocols with C.
aC’s input is h and C
i’s input is h
i.
3. After the invocation of the PAKE protocols, each C
igets a session key k
iand C gets a set of session keys {k
i0} corresponding to the different C
i’s.
b4. Each C
isplits k
ito k
iL||k
iRand sends k
iLand (k
iR+ k
Fi) to S.
5. C splits each k
0ito k
0iL||k
iR0, and sends to S its public key pk, {k
iL0} and {Enc(pk, k
iR0+ r)} where r is chosen randomly.
6. After receiving these messages from all C
is, S checks if there is an index i for which k
iL= k
0iL.
(a) If this is the case then it uses the homomorphic properties of the encryption to send to C a value e = Enc(pk, k
iR+ k
Fi) Enc(pk, k
0iR+ r) = Enc(k
Fi− r);
c(b) Otherwise it sends e = Enc(r
0), where r
0is chosen randomly by S.
7. C calculates k
F= Dec(e) + r, and sends E(k
F, F ) to S.
8. If S already stores E(k
F, F ) it deletes E(k
F, F ) and allows C to access the stored file E(k
Fi, F
i). Otherwise, S stores E(k
F, F ).
a
The PAKE is run via S. There is no direct interaction between the C and each C
i. C sends its first message of the PAKE protocol in Step 1.
b
Note that, with overwhelming probability, k
i= k
0iiff F
i= F .
c
If more than a single k
iLmatches, then the server picks at random one of the matching values and computes the encryption using the corresponding k
iR. This event does not happen if C is honest, but might happen if C uses in the PAKE protocols different h values that agree with the same short hash sh.
Figure 5: The deduplication protocol.
since sh is a short hash, there might be several (F
j, h
j, sh
j) tuples with the same sh value and therefore several legiti- mate h values for C to use. The simulator might therefore learn several k
Fvalues. The simulator stores for each h value a key k
F, which is equal to the corresponding result from F
dedup, or is random if h was not in a tuple with sh.
The simulator sends back to C a random key k
ifor each PAKE invocation. It splits this key as k
i= k
iL||k
iR, and stores the result together with the h value used in this PAKE invocation, and the key k
Fthat corresponds to h.
The simulator receives from C a set of pairs ({k
0iL, Enc(k
iR0+ r)}. Let K = {k
iL0| k
iL0= k
iL}, namely the set of k
0iLvalues that agree with the values stored by the simulator. Let K
0be a maximal subset of K where each k
0iL∈ K
0corresponds to a different h value. Then if K
0is not empty the simulator chooses at random an element k
iL0∈ K
0and sends back to C the encryption Enc((k
iR+ k
F) − (k
0iR+ r)), computed using the homomorphic properties. Otherwise it sends to C an en- cryption of a random value. (Note that if F is not identical to any of the files F
ithen the key k
Fthat C will retrieve is random, even if the simulator found a match of a the k
iLvalue.)
A corrupt C
i: In the ideal model C
ialways provides an input to the functionality. We prove security with relation to the relaxed functionality described in Section 3.2, where C
ialso learns whether the uploaded file has the same short hash as F
i.
The simulator interacts with C
iin the real protocol and should extract C
i’s input to the functionality in the ideal model, namely (F
i, k
Fi). We can assume that C
ihas previ- ously sent E(k
Fi, F
i) to the server, and therefore the simu-
lator should only find k
Fiand use it to compute F
i. The simulator first observes whether sh matches the short hash of C
i. If that is not the case then the simulator pro- vides a random k
Fito the ideal functionality. Otherwise, the simulator observes C
i’s input h
ito the PAKE protocol, its output k
i, and the message (k
iL, k
iR+ k
Fi) that C
isends to the server. If k
iLis different than the left part of k
ithen the simulator provides to the functionality a random value k
Fi. Otherwise, it subtracts the right part of k
i, namely k
iR, from k
iR+ k
Fi, and provides the result as the input to the functionality.
A corrupt Server: In the protocol the server learns whether the k
iLvalue matched or not. This should be done in the simulation, too, but this information can be deduced from the output in the ideal model.
Now we discuss several extensions to the basic protocol to account for the types of issues we alluded to in Section 4.
5.1 Rate Limiting
A compromised active S can do online brute-force guess- ing attacks on either C or C
i. Specifically, if F
iis predictable, S can pretend to be uploader and send PAKE requests to C
iattempting to guess F
i. S can also pretend to be checkers and send PAKE responses to C to guess F . Both upload- ers and checkers should limit the number of PAKE runs for each file in their respective roles. We use RL
cto denote the rate limit for checkers, i.e., each C
ican process at most RL
cPAKE requests for F
iand will ignore further requests.
Similarly, RL
uis the rate limit for uploaders: i.e. an C will
send at most RL
uPAKE requests to upload F . This strat-
egy can both improve security (shown in Section 6.3) and
reduce overhead (shown in Section 7).
5.2 Checker Selection
On receiving an upload request, S selects matching files in descending order of popularity (in terms of number of Cs who own the file) and then choose checkers for that file in ascending order of engagement (in terms of the number of PAKE requests they have serviced so far as checkers for that file). However, the number of PAKE instances an uploader is asked to run for a given file may reveal information about the popularity of that file. To avoid this, S may need participate in fake PAKE runs with C using random inputs.
More concretely, after receiving an upload request from C, S generates a random value r
1∈ [1, RL
u− 1], and sets r
2as RL
u− r
1.
• If no short hash matches sh
a, S runs r
1times PAKE immediately using random inputs.
• If S finds a sh
ithat matches sh and there are x files under sh
iand y of them have owners whose clients are currently online,
– if y < r
1, S runs (r
1− y) times PAKE using ran- dom inputs; Then, for each of the y files, selects an online checker to run PAKE, and marks the other (x − y) files as unchecked ;
– if r
1≤ y, S selects r
1of the y files based on popularity; and for each of them, selects an online checker to run PAKE. Then marks the other (x − r
1) files as unchecked.
At the end of the above procedure, if no match has been found, S picks a random PAKE instance and ask C to use the corresponding key to encrypt the file she wants to up- load. Assume there are z unchecked files after the above procedure,
• if z < r
2, S runs (r
2−z) times PAKE using random in- puts at random time interval, and whenever a checker for any of the z files comes online, sends a PAKE re- quest to him;
• if r
2≤ z, S selects r
2of the z unchecked files based on popularity, and whenever a checker for any of the r
2files comes online, sends a PAKE request to it.
During the above procedure, whenever a match is found, S stops sending PAKE requests to other checkers. But it still uses random inputs to make sure C has run PAKE r
1times before uploading and r
2times after uploading.
5.3 Randomized Threshold
The protocol in Figure 5 is a sever-side deduplication scheme. To save bandwidth, we use the randomized thresh- old approach in [17] to transform it to client-side dedupli- cation. For each file F , S keeps a random threshold t
F(t
F≥ 2), and a counter c
Fthat indicates the number of Cs that have previously uploaded F .
In step 6 of the deduplication protocol,
• In the case of 6a, if c
Fi< t
Fi, S tells C to upload E(k
F, F ) and then deletes it locally. Otherwise, S just informs C that the file is duplicated and there is no need to upload it.
• In the case of 6b, if there are still unchecked files, it asks C to run further PAKE instances when checkers owning those files come online.
– if S finally finds k
jL= k
jL0, S puts C as an owner of E(k
Fj, F
j),
∗ if c
Fj≥ t
Fj, S tells C to switch the key of F to k
Fjand then deletes E(k
F, F ) from its local storage;
∗ otherwise, S treats F as another copy of F
jand keeps Enc(pk, k
Fj− r). If c
Fjexceeds t
Fjlater, it sends Enc(pk, k
Fj− r) to C, tells her to switch the key of F and then deletes E(k
F, F ) from its local storage.
– Otherwise, S puts E(k
F, F ) under sh
aand puts C as an owner of E(k
F, F ).
6. SECURITY ANALYSIS
In this section, we demonstrate how our deduplication scheme satisfies the security requirements in section 3.3. Our scheme doesn’t rely on any T P, so it satisfies S1.
6.1 Compromised Client Only
A can compromise one or more legitimate Cs, and attempt online brute-force guessing attacks. From Section 5.2, we know that a C receives r
1PAKE replies immediately after sending an upload request. Then C is either required to upload the encrypted file or is given access to an already existing encdrypted file. If C was asked to upload the en- crypted file, it cannot still learn how many other Cs, if any, have already uploaded the same file. If C was not required to upload the encrypted file, or if after being asked to run a further r
2PAKE instances, C was informed that the encryp- tion key has changed, then C can only conclude that its file is sufficiently popular (i.e., has more than the specific ran- dom threshold number of owners). Therefore, our protocol satisfies S2.1.
6.2 Compromised Passive Server
Next, we consider the case where A has compromised S such that A can gain access to each message received during the protocol as well as each file stored in S. A can thus attempt offline brute-force guessing attacks on predictable files. But all sensitive data (i.e., files, hashes and keys) are encrypted using a semantically secure encryption scheme with keys known only to the participating Cs. The only thing that leaks information is the short hash. We can set its length to make it leak as little information as required.
As long as the range of the short hash function is sufficiently smaller than the plaintext space from which a file is drawn, A cannot succeed in a brute-force guessing via a compro- mised passive S. Therefore, our scheme satisfies S2.2.
6.3 Compromised Active Server
Finally, we consider the case where A has compromised S and can make it masqurade as as many Cs as she wants (this is equivalent to A compromising S and the required number of legitimate Cs). Such an adversary can attempt to mount online brute-force guessing attacks on predictable files. As introduced in Section 5.1, we use a per-file rate limiting strategy to prevent such attacks. Recall that RL
u(RL
c) is the rate limit for uploaders (checkers), n is the
length of the short hash and Φ is the space of a predictable file F , then A can uniquely identify F only if
|Φ| ≤ 2
n· x · (RL
u+ RL
L) (1) where x is the number of Cs who potentially own F .
As a comparison, DupLESS [3] uses a per-client rate lim- iting strategy to prevent such online brute-force guessing attacks from one C. Since they need to choose the rate limit so that it does not degrade usability for a C that needs to le- gitimately upload a large number of files within a brief time interval (such as backing up a local file system), the authors of DupLEss choose the large bound 825 000 as the total number of requests a single C can make during one week.
That is to say, for a predictable file whose size of file space
|Φ| is 825 000, a compromised C has to use at least one week to uniquely identify it. But if A has compromised multiple Cs, this time lag will be linearly decreased. Recall that a compromised active S can masquerade as any number of Cs needed.
To prevent a compromised active S from masquerading multiple Cs, authors in [25] introduce another T P called identity server. When Cs first join the system, the identity server is responsible for verifying their identity and issuing credentials to them. But an identity server is hard to deploy in real world. As far as we know, our scheme is the first deduplication protocol that satisfies S2.3 without the aid of an identity server.
7. SIMULATION
Our approach incurs some extra computation and com- munication due to the number of PAKE runs whenever a C wants to upload a file. Furthermore, the use of rate lim- its and randomized thresholds can impact deduplication ef- fectiveness. In this section, we use realistic simulations to study the effect of various parameter choices in our protocol on its computation/communication overhead and dedupli- cation effectiveness.
Datasets. We want to consider two types of storage envi- ronments. The first consists predominantly of media files, such as audio and video files. We did not have access to a dataset from such an environment. Instead, we use a dataset comprising Android application prevalence data to represent an environment with media files. This is based on the assumption that the popularity of Android appli- cations is likely to be similar to that of media files: both are created and published by a small number of authors (artists/developers), made available on online stores or other distribution sites, and are acquired by consumers either for free or for a fee. We call this the media dataset. We use a publicly available dataset.
1It consists of data collected from 77 782 Android devices. For each device, the dataset identi- fies the set of (anonymized) application identifiers found on that device. We treat each application identifier as a “file”
and presence of a app on a device as an “upload request” to add the corresponding “file” to the storage. This dataset has 7 396 235 “upload requests” in total, of which 178 396 are for distinct files.
The second is the type storage environments found in an enterprise backup systems. We use data gathered by Debian Popularity Contest
2to approximate such an environment.
1
https://se-sy.org/projects/malware/
2
http://popcon.debian.org
The popularity-contest package on a Debian device regularly reports the list of packages installed on that device. The resulting data consists of a list of debian packages along with the number of devices which reported that package. We took a snapshot of this data on Nov 27, 2014. It consists of data collected from 175 903 Debian users and. From this, we generate our enterprise dataset : it has 217 927 332 “upload requests” (debian package installations) of which 143 949 are for distinict files (unique debian packages).
Figure 6 shows the file popularity distribution (i.e., the number of upload requests for each file) in logarithmic scale for both datasets. We map each dataset to a stream of upload requests by generating the requests in random order.
Specifically, for a file that has x copies, we generate x upload requests for it at random time intervals.
File ID
100 101 102 103 104 105 106
Number of Upload Requests
100 101 102 103 104 105 106
File popularity in media dataset File popularity in enterprise dataset
Figure 6: File popularity in both datasets.
Parameters. To facilitate comparison with DupLESS [3], we set |Φ| as 825 000. We then set the length of the short hash n to 13. Inequality 1 in Section 6.3 then implies a lower bound of 100 for (RL
u+RL
c): with these parameter choices, A can mount a successful online brute-force guessing attack to uniquely identify a predictable file held by only one C.
We use these parameter values in our simulations.
We measure overhead as the average number of PAKE runs
3, which can be calculated as:
µ = T otal number of PAKE runs
T otal number of upload requests (2) We measure deduplication effectiveness using deduplication percentage, introduced in Section 2.1. We assume that all files are of equal size so that deduplication percentage ρ is:
ρ = 1 − N umber of all f iles in storage
T otal number of upload requests (3) First we assume that all Cs are online during the simula- tion, and study the impact of rate limits. Once we choose specific rate limits, we evaluate how absent Cs and the use of randomized thresholds impact deduplication effectiveness.
3
We do not include fake PAKE runs by S (Section 5.2) since
we are interested in estimating the average number of real
PAKE runs. In a full implementation, an uploader will al-
ways run r
1PAKE runs, where r
1∈ [1, RL
u− 1].
Rate limiting. Having selected RL
u+RL
cto be 100, we now see how the average number of PAKE runs and the deduplication effectiveness change when divide this total number in different ways between RL
uand RL
c. Figure 7 shows the average number of PAKE runs resulting from dif- ferent values of RL
u(and hence RL
c) in both datasets. We also ran the simulation without any rate limits, which led to an average of 26.88 PAKE runs in media dataset and 13.19 PAKE runs in enterprise dataset. They are signifi- cantly larger than the results with rate limiting. Figure 8 shows the deduplication percentage resulting from different rate limit choices in both datasets. We see that setting RL
u(RL
c) to 30 (70), maximizes the deduplication percentage to values (97.58% and 99.9332%) close to perfect deduplication (97.59% and 99.9339%) in both datasets.
RLu(RL
c)
Average Number of PAKE Runs
1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75
10(90) 20(80) 30(70) 40(60) 50(50) 60(40) 70(30) 80(20) 90(10) Average number of PAKE runs in media dataset
Average number of PAKE runs in enterprise dataset
Figure 7: Average number of PAKE runs VS. rate limits.
We conclude that the use of rate limits not only improves security, but can also reduce the overhead without negatively impacting deduplication effectiveness.
RLu(RLc)
Deduplication Percentage %
96.5 97 97.5 98 98.5 99 99.5 100
10(90) 20(80) 30(70) 40(60) 50(50) 60(40) 70(30) 80(20) 90(10) Deduplication percentage in media dataset Deduplication percentage in enterprise dataset Perfect deduplication in media dataset Perfect deduplication in enterprise dataset
Figure 8: Deduplication percentage VS. rate limits.
Randomized threshold and offline rate. The possibility
of some Cs being offline may adversely impact deduplication effectiveness. For example, as we described in Sections 5.2–
5.3, if S finds that the upload request from C does not match any online checker available at the time of the upload, it will ask C to upload its encrypted file. S will continue to ask C to run r
2PAKE instances afterwards. If one of these finds a checker whose existing file F matches C’s file but c
Fnever exceeds t
F, S can no longer deduplicate C’s copy of F . Had the original uploader of F been online at the time C wanted to upload its file, S would have successfully con- ducated server-side deduplication of F . In other words, the possibility of clients being offline will impact deduplication effectiveness.
To estimate this impact, we assign an offline rate to each C as its probability to be offline during one upload request.
Using the chosen rate limit values (RL
u= 30 and RL
c= 70) and fixing the upper limit of the random threshold (i.e., 2 ≤ t
F≤ 10), we measured the deduplication percentage by varying the offline rate value. The results for both datasets are shown in Figure 9. It shows that the deduplication per- centage is still high when the offline rate is as high as 90%.
Offline Rate
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Deduplication Percentage %
96.5 97 97.5 98 98.5 99 99.5 100
Deduplication percentage in media dataset Deduplication percentage in enterprise dataset
Figure 9: Deduplication percentage VS. offline rates.
Evolution of deduplication effectiveness: Deduplica- tion percentage will continue to increase as more files are added to the storage. Figure 10 shows that the deduplication percentage achieved by our scheme meets the realistic expec- tation (95%) in the very beginning (after receiving 304 160 upload requests in the media dataset, and 121 110 upload requests in enterprise dataset). So our system satisfies F1.
However, our use of rate limits may result in a somewhat slower increase compared to schemes with perfect dedupli- cation. Figure 11 shows how the difference between perfect deduplication and the deduplication percentage achieved by our scheme evolves as more files are added to the storage.
Figure 13 (in Appendix C) shows the secondary deviation.
The difference become stable as more files are uploaded to the storage.
This phenomenon can be explained by the Zipf ’s law [29].
As uploaders set a rate limit for their requests and S always
selects files in descending order of popularity, our system is
similar to web proxy caching, which is a temporary storage
for popular web pages. Breslau et al. observe that page
Number of Upload Requests #106
0 1 2 3 4 5 6 7 8
Deviation in Deduplication Percentage %
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
Deviation from perfect deduplication in media dataset
Number of Upload Requests #108
0 0.5 1 1.5 2 2.5
Deviation in Deduplication Percentage %
#10-3
0 1 2 3 4 5 6 7 8 9
Deviation from perfect deduplication in enterprise dataset
Figure 11: Deviation from perfect deduplication VS. Number of upload requests.
Number of Upload Requests (x30 in enterprise dataset) #106
0 1 2 3 4 5 6 7 8
Deduplication Percentage %
0 10 20 30 40 50 60 70 80 90 100
Evolution of deduplication percentage in media dataset Evolution of deduplication percentage in enterprise dataset
Figure 10: Deduplication percentage VS. Number of upload requests.
requests distribution follows the Zipf’s law, where the fre- quency of a request for the mth most popular page can be calculated as
PN1/mαi=1(1/iα)
, where N is the size of the cache, and α is the value of the exponent characterising the distri- bution [8]. As a result, most of the requested pages can be found in the cache. In our case, most of the upload requests can find a matched file within the rate limit if the file has been uploaded already.
8. PERFORMANCE EVALUATION 8.1 Prototype implementation
Our prototype consists of two parts: (1) a server pro- gram which simulates S and (2) a client program which sim- ulates C (performing file uploading/downloading, encryp- tion/decryption, and assissting S in deduplication). We use
Node.js
4for the implementation of both parts and Redis
5for the implementation of S’s data structure. Furthermore, we use SHA-256 as the cryptographic hash functions and AES with 256-bit keys as the symmetric encryption scheme, both of which are provided by the Crypto module in Node.js.
We truncate the cryptographic hash to get the short hash.
We use the GNU multiple precision arithmetic library
6to implement the public key operations.
8.2 Test setting and methodology
We run the server side program on an Amazon EC2 in- stance (Intel Xeon with 1 CPU at 2.5 GHz)
7. and the client side program on an Intel Core i7 machine with 4 2.2 GHz cores. We measure the running time using the Date mod- ule in javascript and measure the bandwidth usage using TCPdump.
As the downloading phase in our protocol is simply down- loading an encrypted file, we only consider the uploading phase. We set 13 for the length of short hash and 30 for RL
u. We consider the worst case where we let the uploader run PAKE with 30 checkers. So we simulate the uploading phase in our protocol as:
1. An uploader C sends the short hash of the file it wants to upload to server S;
2. S sends requests to 30 checkers C
iand lets them run PAKE with C;
3. S waits to get responses of all instances back from C and {C
i};
4. S chooses one instance and sends the result to C 5. C uses the resulting key to encrypt its file with AES
and uploads it to S.
We measure both the running time and bandwidth usage during the whole procedure above. We compare the results
4
http://nodejs.org
5
http://redis.io
6
https://gmplib.org
7
http://aws.amazon.com/ec2/
to two baselines: (1) simply uploading a file without en- cryption and (2) uploading a file with AES encryption. As in [3], we repeat our experiment using files of size 2
2iKB for i ∈ {0, 1, ..., 8}, which provides a file size range of 1KB to 64 MB. For each file, we upload it 100 times and cal- culate the mean values. For files that are larger than the computer buffer, we do loading, encryption and uploading at the same time by piping the data stream. As a result, uploading encrypted files uses almost the same amount of time with uploading plain files.
8.3 Results
Figure 12 reports the uploading time and bandwidth us- age in our protocol compared to two baselines. For files that are smaller than 1 MB, the extra overhead introduced by our deduplication protocol is relatively high. For exam- ple, it takes 587 ms (2 436 bytes bandwidth) to upload a 1 KB plain file and 594 ms (2 585 bytes) to upload a 1 KB encrypted file, while it takes 1 784 ms (14 7308 bytes) to up- load the same file in our protocol. But the extra overhead introduced by our protocol is independent of the file size, and it become negligible when the file is large enough. So our system meets F2.
9. DISCUSSION
Incentives: In our scheme Cs have to run several PAKE instances as both uploaders and checkers. This imposes a cost on each C. S is the direct beneficiary of the protocol. Cs may indirectly benefit from deduplication in that the ability to effectively deduplicate makes the storage system more efficient and can thus potentially lower the cost of using the storage incurred by each C. Nevertheless, it is desirable to have more direct incentive mechanisms to encourage Cs to do do PAKE checks. For example, if a file uploaded by C is found to be shared by other Cs, S could reward the Cs owning that file by giving them small increases in their respective storage quotas.
Deduplication effectiveness: We can improve dedupli- cation effectiveness by introducing additional checks. For example, an uploader can indicate the (approximate) file size (which will be revealed any way) so that S can limit the selection of checkers to those whose files are of a simi- lar size. Similarly, S can keep track of similarities between Cs based on the number of files they share and use this in- formation while selecting checkers by prioritizing checkers who are similar to the uploader. Nevertheless, as discussed in Section 7, our schemes achieve surpass what is consid- ered as realistic levels of deduplication in the very beginning.
Whitehouse [27] reported that when selecting a deduplica- tion scheme, enterprise administraotrs rated considerations such as ease of deployment and use being more important as deduplication ratio. Therefore, we argue that the small sacrifice in deduplication ratio is offset by the significant advantage of ensuring user privacy without having to use independent third party servers.
Block-level deduplication: Our scheme can be applied to both file-level and block-level deduplication. But block-level deduplication will incur more overhead.
10. RELATED WORK
One technique proposed to enable deduplication with en- crypted data is convergent encryption (CE) [12], which de-
rives the file encryption key deterministically from the file contents. As a result, the same file will produce the same ciphertext, so that S can deduplicate ciphertexts. But if A compromises S, it can easily perform an offline brute-force guessing attack. Bellare et al. recently proposed “message- locked encryption” (MLE), which uses a semantically-secure encryption scheme but produces a tag in a deterministic way [4]. So it still suffers from the same attack.
Puzio et al. propose a block-level deduplication scheme that uses a third party T P [23]. C first encrypts each block with CE and then sends the ciphertexts to T P, who adds an additional layer of encryption to the block. During file retrieval, blocks are decrypted by T P first and sent back to C. If A compromises both S and Cs, she can perform an online brute-force guessing attack by letting multiple Cs upload all of possible files and monitoring S’s storage.
Stanek et al. propose a scheme that only deduplicates popular files [25]. Cs encrypt their files with two layers of encryption: the inner layer is obtained through CE and the outer layer is obtained through a semantically secure thresh- old encryption scheme. S can decrypt the outer layer of a file iff the number of Cs who have uploaded this file reaches the threshold. They introduce a T P as an index server to keep track of the popularity of each file, but the index server has to know the content or hash of each file. In addition, they introduce another T P as an identity server to prevent online brute-force guessing attacks by multiple Cs.
Both of these solutions [23, 25] are vulnerable to offline brute-force guessing attacks if A compromises the T Ps.
Bellare et al. propose DupLESS that uses an additional secret in the key generation process of CE [3]. This secret is identical for all Cs and is provided by a T P. To prevent T P from conducting offline brute-force guessing attacks, Cs run an oblivious PRF protocol with T P to generate the key which guarantees that Cs who input the same file will get the same encryption key without revealing their files. But if A compromises both S and T P, S can get the secret from T P and the scheme will be reduced to normal CE.
Duan proposes a distributed architecture that generates the file encryption key by a threshold number of all Cs [13].
But this scheme is only suitable for peer-to-peer scemarios:
a threshold number of Cs must be online and interact with one another to generate the key. A malicious S can easily generates more than a threshold number of Cs.
In Table 1, we summarize the resilience of these schemes with respect to the design goals from Section 3.3.
XX XX
XX XX X
Schemes
Threat
Compromised
C S (pas.) S (act.), Cs T Ps S, T Ps
[12], [4] √
X X − −
[23] √ √
X X X
[25] √ √ √
X X
[3] √ √
X √
X
[13] √ √
X − −
Our work √ √ √
− −
Table 1: Resilience of deduplication schemes.
Acknowledgments
This work was supported in part by the Academy of Finland
project “Cloud Security Services” (283135).
File Size (KB)
Time Usage (Millisecond)
21 22 23 24 25 26 27 28 29 210 211
20 22 24 26 28 210 212 214 216
Our protocol Upload encrypted files Upload plain files
File Size (KB)
Bandwidth Usage (Byte)
210 212 214 216 218 220 222 224 226 228
20 22 24 26 28 210 212 214 216
Our protocol Upload encrypted files Upload plain files