Secure Deduplication of Encrypted Data without Additional Servers

(1)

Secure Deduplication of Encrypted Data without Additional Servers

Jian Liu

Aalto University

N. Asokan

Aalto University and University of Helsinki

Benny Pinkas

Bar Ilan University

Abstract

Encrypting data on the client-side before uploading it to cloud storage is essential for protecting users’ privacy. How- ever client-side encryption is at odds with the standard prac- tice of deduplication in cloud storage services. Reconciling client-side encryption with cross-user deduplication has been an active research topic. In this paper, we present the first secure cross-user deduplication scheme that supports client- side encryption without requiring any additional independent servers. We demonstrate that our scheme provides better se- curity guarantees than previous work. We also demonstrate that its performance is reasonable via simulations using re- alistic datasets.

1. INTRODUCTION

Cloud storage is a service that enables people to store their data remotely with a storage provider. With a rapid growth in user base, storage providers tend to save storage costs via cross-user deduplication: if two clients want to up- load the same file, the storage server detects the duplication and stores only a single copy. Deduplication achieves high storage savings [20] and is adopted by many cloud service providers. It is also adopted widely in backup systems for enterprise workstations.

Clients who care about privacy want to encrypt their data on the client-side using semantically secure encryption mech- anisms. However, na¨ıve application of encryption thwarts deduplication. Reconciling deduplication and encryption is an active research topic. The current solutions are either dependent on the aid of additional independent servers [3, 23, 25] or only suitable for Peer-to-Peer (P2P) systems [13].

Reliance on independent servers is a strong assumption that is very difficult to meet in commercial contexts, and since most of the popular cloud storage services (e.g. Dropbox, Google Drive) are client-server systems, turning them into P2P systems is unrealistic.

In this paper, we present the first single-server scheme for secure cross-user deduplication with client-encrypted data.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

Our scheme allows two clients uploading the same file to the cloud storage to securely agree on the same file encryption key. We do this by building upon a well-known crypto- graphic primitive known as password authenticated key ex- change (PAKE) [6]. A PAKE allows two parties to agree on a session key if and only if they share a short secret (“pass- word”), which might have low entropy. In our deduplication scheme, PAKE protocol runs are facilitated by the storage server. It avoids the need for any additional independent servers and thus can be adopted easily into popular cloud storage services. In terms of security, our scheme makes use of per-file rate limits (namely, limiting the number of queries allowed per file), rather than per-client rate limits (limiting the number of queries allowed for each client, as was used in [3]). This change makes the scheme significantly more resistant against brute-force guessing attacks by colluding clients or a malicious storage server that attempts to mas- querade as multiple clients.

We make the following contributions:

• presenting the first single-server scheme that si- multaneously enables the server to do cross-user dedu- plication and clients to encrypt their data on client-side using semantically secure encryption keys. (Section 5);

• showing that our scheme has better security guar- antees than previous work (Section 6). As far as we know, our proposal is the first scheme that can prevent online brute-force guessing attacks by malicious clients or storage server, without the aid of an identity server;

• demonstrating, via simulations with realistic datasets, that our scheme provides its privacy benefits while re- taining a utility level (in terms of deduplication effec- tiveness) on par with standard industry practice.

Our use of per-file rate limits implies that an incom- ing file will not be checked against all existing files in storage. Surprisiingly, our scheme still achieves high deduplication effectiveness (Section 7); and

• implementing our scheme to show that it incurs min- imal overhead (computation/communication) com- pared to traditional deduplication schemes (Section 8).

2. PRELIMINARIES 2.1 Deduplication

Deduplication strategies can be categorized according to

the basic data units they handle. One category is file-level

deduplication which eliminates redundant files [12]. The

other is block-level deduplication, in which files are segmented

(2)

into blocks and duplicated ones are eliminated [24]. In this paper we use the term ‘file’ to refer both files and blocks.

Deduplication strategies can also be categorized according to the side in which deduplication happens. In server-side deduplication, all files are uploaded to the storage server S, which then deletes the duplicated files. Clients C are un- aware of deduplication. This strategy saves storage but not bandwidth. In client-side deduplication, C checks the ex- istence of its files on S (by sending cryptographic hashes) before uploading them. Duplicates are not uploaded to the S. This strategy saves both storage and bandwidth.

The effectiveness of deduplication is customarily expressed by the deduplication ratio, defined as the “the number of bytes input to a data deduplication process divided by the number of bytes output” [14]. In the case of a cloud storage service, this is the ratio between total size of files uploaded by all clients and the total storage space used by the service.

Dedpulication percentage (sometimes referred to as “space reduction percentage”) [14, 17] is 1 −

deduplication ratio¹

. A perfect deduplication scheme will detect all duplicates. Hence its deduplication percentage will be close to 100%.

The level of deduplication achieveable depends on a num- ber of factors. In common business settings, deduplication ratios in the range of 4:1 (75%) to 500:1 (99.8%) are typi- cal. Wendt et al [26] suggest that a figure in the range 10:1 (90%) to 20:1 (95%) is a realistic expectation.

Although deduplication benefits storage providers (and hence, indirectly, their users), it also constitutes a privacy threat for users. For example, a cloud storage server that supports client-side, cross-user deduplication can be exploited as an oracle that asks queries of the form “did anyone upload this file?”. An adversary A can do so by uploading a file and observing whether deduplication takes place [17]. For a pre- dictable file that is known to be drawn from a small space, A can construct all possible files from that space, upload them and observe which causes deduplication so that A can iden- tify the target file. Harnik et al. [17] propose a randomized threshold approach to address such online brute-force guess- ing attacks. Specifically, for each file F , S keeps a random threshold t

F

(t

F

≥ 2) and a counter c

F

that indicates the number of Cs that have previously uploaded F . Client-side deduplication happens only if c

F

≥ t

F

. Otherwise, S does a server-side deduplication.

2.2 Hash Collisions

A hash function H: Φ → {0, 1}

ⁿ

is a function that deter- ministically maps a binary string in Φ of arbitrary length to a binary string h of fixed length n, where h is called the hash value or simply the hash, and n is called the hash length of H. A cryptographic hash function can map distinct ele- ments to a hash value such that it is infeasible to invert the function and find collisions (namely, the function is collision resistant).

If a hash function is not collision resistant, its collisions can be calculated as

^|Φ|₂n

, on condition that the hash values are evenly distributed across the available range. While a hash function with a long hash length has small collisions, a hash function with a small hash length will have high col- lisions. We will use such a short hash to protect clients’

privacy. For a predictable file F , drawn from a small space Φ, if A knows H(F ), she can easily perform an offline brute- force guessing attack. But if A only knows SH(F ), where SH() is a short hash function, it is hard for A to guess the

exact content of F due to the high collision rate of SH().

2.3 Additively Homomorphic Encryption

A public key encryption scheme is additively homomor- phic if given two ciphertexts c

1

= Enc(pk, m

1

; r

1

) and c

2

= Enc(pk, m

2

; r

2

) (i.e., encryptions of the messages m

1

, m

2

using the same public key pk but different random values r

1

, r

2

), it is possible to efficiently compute Enc(pk, m

1

+ m

2

; r) with independent randomness r, even without knowl- edge of the corresponding private key. Examples of such encryption are Paillier’s encryption [21], or El Gamal en- cryption [15] where addition is done in the exponent.

In this paper, we use E()/D() to denote symmetric en- cryption/decryption, and use Enc()/Dec() to denote addi- tively homomorphic encryption/decryption. We abuse the notation and use Enc(pk, m) to denote the random variable induced by Enc(pk, m; r) where r is chosen uniformly at ran- dom. In addition, we use ⊕ and to denote homomorphic addition and subtraction respectively.

2.4 Password Authenticated Key Exchange

Password-based protocols are commonly used for user au- thentication. However, such protocols are vulnerable to of- fline brute-force guessing attacks (also referred to as “dic- tionary attacks”) since users tend to choose passwords with relatively low entropy that are hence guessable. Bellovin and Merritt [6] were the first to propose a password authen- ticated key exchange (PAKE) protocol, in which an attacker making a password guess cannot verify the guess without an online attempt to authenticate itself with the password. The protocol is based on using the password as a symmetric key to encrypt the messages of a standard key exchange protocol (e.g., Diffie-Hellman [11]), so that two parties with the same password will successfully generate a session key without re- vealing their passwords.

Following this seminal work, many protocols were pro- posed to improve PAKE in several aspects, e.g., describ- ing provably secure protocols [5, 7], weakening the assump- tion (i.e., working in standard model without random ora- cles) [16, 2], achieving a stronger proof model [10, 9] and improving the round efficiency [18, 5, 19].

We use PAKE as a building block of our deduplication protocol. The ideal functionality F

pake

is shown in Fig- ure 1. Alice and Bob hold passwords pw

a

and pw

b

respec- tively. F

pake

outputs the same key to both of them if and only if pw

a

= pw

b

. Otherwise, it outputs two independent random keys to Alice and Bob.

We use PAKE as a building block for our deduplication protocol. The chosen PAKE protocol should satisfy the fol- lowing requirements:

• Implicit key exchange: It must be implicit, meaning that at the end of the protocol each party outputs a key but neither party learns if the passwords matched or not. In fact, many PAKE protocols were designed to be explicit where parties will know whether the ex- change succeeded or not at the end of the protocol.

• Single round: It must be single-round so that it can be easily facilitated by the storage server.

• Pseudo-random keys: It must guarantee that if the

passwords are different than neither party can distin-

guish the key output by the other party from a random

key.

(3)

Inputs:

• Alice’s input is a password pw

a

;

• Bob’s input is a password pw

b

. Outputs:

• Alice’s output is k

a

;

• Bob’s output is k

b

.

where if pw

a

= pw

b

then k

a

= k

b

, and if pw

a

6= pw

b

then Bob (resp. Alice) cannot distinguish k

a

(resp. k

b

) from a random string of the same length.

Figure 1: The ideal functionality F

pake

for password authenticated key exchange.

• Concurrent executions: It must allow multiple PAKE instances to run in parallel. There are two common security notions for PAKE under concurrent execu- tions. One notion is “concurrent PAKE”, defined by [5, 7]. The other stronger notion is that of “UC secure PAKE” [10], which guarantees security for composition with arbitrary protocols, and with arbitrary, unknown and possibly correlated password distributions. Pro- tocols secure according to the notion of UC security (e.g. in [10]) are currently much less efficient than protocols secure according to the notion of concurrent PAKE. We therefore use a concurrent PAKE protocol in our work.

Our deduplication scheme uses the SPAKE2 protocol of Ab- dalla and Pointcheval [1], which is described in Figure 2.

This protocol is secure in the concurrent setting, in the ran- dom oracle model. It is very efficient, requiring each party to compute just three exponentiations, and send just a sin- gle group element to the other party. Theorem 5.1 in [1]

states that this protocol is secure assuming that the com- putational Diffie-Hellman problem is hard in the group used by the protocol.

3. PROBLEM STATEMENT 3.1 General Setting

Our generic setting for cloud storage systems consists of a storage server S and a set of clients (Cs) who store their files on S. Cs never communicate directly, but exchange messages with S, and S processes the messages and/or for- wards them as needed. In addition, third parties (T Ps) can be introduced to the system to assist deduplication. For ex- ample, T Ps can be used for key generation [3], keeping track of file popularities [25] and even encryption/decryption [23].

But independent T Ps are expensive to deploy in practice and can be bottlenecks for both security and performance.

Introducing a T P which is independent from S is often un- realistic in commercial setting.

We assume that parties communicate through secure chan- nels in the system, so that an adversary A cannot eavesdrop and tamper any data in the channel. Thus we assume that S can authenticate and authorize Cs, and Cs can authenticate S as well.

We introduce new notations as needed. A summary of

Inputs:

• The input of the uploading client C is F ;

• Each existing client C

i

has inputs F

i

and k

F_i

;

• S’s input is a set of existing encrypted files {E(k

F_i

, F

i

)}.

Outputs:

• C learns a key k

F

with which to encrypt its file. If F is identical to an existing file F

i

then k

F

= k

F_i

. Otherwise k

F

is random;

• Each C

i

’s output is empty;

• S learns the encryption of F with the key k

F

. Note that if there exists F

i

= F then this en- cryption is identical to the encryption of F

i

. Figure 3: The ideal functionality F

dedup

of dedupli- cating encrypted data.

notations appears in Appendix A.

3.2 Ideal Model

We define an ideal model that attempts to capture the ideal functionality F

dedup

of deduplicating encrypted data.

The model has three types of participants: the storage server S, an uploading client C attempting to upload a new file, and existing clients {C

i

} who have already uploaded a file. The ideal functionality is shown in Figure 3.

A deduplication system for encrypted files is secure if it implements F

dedup

. In particular, no participant learns more information than is defined in the output of the functional- ity. Our system implements this functionality with the fol- lowing relaxation: S learns a short hash of F (say, 13 bits long), and other clients that have previously uploaded files with this hash value might learn that a new file with this short hash is being uploaded. In a large scale system this short hash agrees with many files and therefore leaks limited information about the uploaded file.

3.3 Design Goals

Threat Model: An adversary A might control any subset of Cs of the system. It might also have control over S: it can gain access to the data stored at S, perform offline brute- force guessing attacks, and might even mount active attacks by deviating from the deduplication protocol and potentially masquerading as Cs of the system. A may also control any third party T P used by the deduplication scheme.

When client-side deduplication takes place then A mas-

querading as C uploading a file, and wishing to identify

whether a specific file was already uploaded by a different

C, can apply an online brute-force attack as described by

Harnik et al. [17]. The randomized approach of [17], de-

scribed in Section 2.1, can mitigate that attack. However,

even if the deduplication scheme is secure against offline

brute-force guessing attacks, an A controlling both S and

some Cs can still mount a similar online brute-force attack

by running the deduplication protocol for every “guess” of a

predictable file and checking if deduplication occurs. This

attack is “online” since, as will become clearer in Section 4,

(4)

Public information: A finite cyclic group G of prime order p generated by an element g. Public elements M

u

∈ G associated with user u. A hash function H modeled as a random oracle.

Secret information: User u has a password pw

u

. The protocol is run between Alice and Bob:

1. Each side performs the following computation:

• Alice chooses x ∈

R

Z

p

and computes X = g

^x

. She defines X

^∗

= X · (M

A

)

^pw^A

.

• Bob chooses y ∈

R

Z

p

and computes Y = g

^y

. He defines Y

^∗

= Y · (M

B

)

^pw^B

. 2. Alice sends X

^∗

to Bob. Bob sends Y

^∗

to Alice.

3. Each side computes the shared key:

• Alice computes K

A

= (Y

^∗

/(M

B

)

^pw^A

)

^x

. She then computes her output as SK

A

= H(A, B, X

^∗

, Y

^∗

, pw

A

, K

A

).

• Bob computes K

B

= (X

^∗

/(M

A

)

^pw^B

)

^y

. He then computes his output as SK

B

= H(A, B, X

^∗

, Y

^∗

, pw

B

, K

B

).

Figure 2: The SPAKE2 protocol of Abdalla and Pointcheval [1].

A has to run a protocol invocation with the uploading C for every “guess” of the content of the uploaded file.

Security Goals: With the above threat model in mind, we aim to propose a deduplication scheme that both enables S to do deduplication and Cs to encrypt their files with a semantically secure encryption scheme. Our security goals are as follows.

S1 Avoid relying on any T P.

S2 Prevent brute-force guessing attacks:

S2.1 online by compromised Cs S2.2 offline by compromised passive S

S2.3 online by compromised active S (masquerading as multiple Cs)

Functional goals: In addition, our protocol should also meet certain functional goals.

F1 Maximize deduplication effectiveness (exceed realistic expectations, as discussed in Section 2.1).

F2 Minimize computational and communication overhead (i.e., the computation/communication costs should be comparable to storage systems without deduplication).

4. OVERVIEW OF THE SOLUTION

We first motivate salient design decisions in our dedupli- cation protocol before describing the details in Section 5.

When an uploader C wants to upload a file F to S, we need to address two problems: (a) determining if S already has an encrypted copy of F in its storage and (b) if so, then securely arranging to have the encryption key transferred to C from some C

i

who uploaded the original encrypted copy of F .

In traditional client-side deduplication, when C wants to upload a file F to S, she first sends a cryptographic hash h of F to S so that it can check for the existence of the file in its storage. Na¨ıvely adapting this approach to the case of encrypted storage is insecure because a compromised S can easily do offline brute-force guessing attacks on h if F is predictable. Therefore, instead of h, we let C send a short hash sh = SH(F ). Due to the high collision rate of the short hash function SH(), S cannot reliably guess the content of F offline.

Now, suppose that another C

i

, previously uploaded E(k

F_i

, F

i

) using k

F_i

as the symmetric file encryption key for F

i

and that sh = sh

i

. Our protocol needs to determine if this hap- pened because F = F

i

and, in that case, arrange to have k

Fi

securely transferred from C

i

to C. We do this by having C and C

i

engage in an oblivious key sharing protocol which al- lows C to receive k

F_i

iff F = F

i

and a random key otherwise.

We say that C

i

plays the role of a checker in this protocol.

The oblivious key sharing protocol could be implemented using generic solutions for secure two-party computation, such as Yao’s protocol [28], which express the desired func- tionality as a boolean circuit and then securely compute it.

Protocols of this type have been demonstrated to be very efficient, even with security against malicious adversaries.

In our setting the circuit representation is actually quite compact, but the problem in using this approach is that the inputs of the parties are relatively long (say, 288 bits long, comprising of a full-length hash value and a key), and known protocols require an invocation of oblivious transfer, namely of public-key operations, for each input bit. There are known solutions for oblivious transfer extension, which use a preprocessing step to reduce the online computation time of oblivious transfer. However, in our setting the secure computation is run between two clients that do not have any pre-existing relationship, and therefore preprocessing cannot be computed before the protocol is run.

Our solution for an efficient oblivious key sharing protocol is based on having C and C

i

run a PAKE instance, with the hash values of their files, namely h and h

i

, as their respective input “passwords”. The protocol results in C

i

getting k

i

and C getting k

⁰i

, which are equal if h = h

i

and are independent otherwise. The next step of the protocol, which will be described in Section 5, uses these keys to deliver to C a key k

F

, which is equal to k

F_i

iff k

i

= k

⁰i

. C uses this key to encrypt her uploaded file, and S can deduplicate that file if this file is equal to the file uploaded by C

i

. Several additional issues need to be solved:

1. A compromised S can perform an online brute-force

guessing attack where it initiates many interactions

with C/C

i

to identify F /F

i

. Each protocol interaction

essentially enables a single guess about the content of

the relevant file. Cs therefore use a per-file rate limiting

strategy to prevent such attacks. Specifically, they set

(5)

root

sh

1

E(k

F1

, F

1

)E(k

F2

, F

2

) ...

C

1

C

2

...

sh

2

...

Figure 4: S’s record structure.

a bound on the maximum number of PAKE requests they would service as a checker or an uploader for each file. (See Section 5.1.) Our experiments with real data sets in Section 7 show that this rate limiting does not affect the success of deduplication.

2. What if C

i

is not online? If S has a large number of clients, it can find enough online checkers. If there are not enough online checkers, we can let the uploader run PAKE with the currently available checkers and with additional dummy checkers to hide the number of available checkers. (See Section 5.2.) Again, our experiments in Section 7 show that this does not affect the success of deduplication (since it is likely to find checkers for popular files).

3. How can S do client-side deduplication securely? S will know whether C and C

i

have the same file from the results of PAKE. If they do have the same file, S just inform C that she need not upload her file. In order to solve the problem of online brute-force guessing attacks by C, when client-side deduplication takes place, we can use the same randomized threshold strategy as [17]

(see Section 5.3).

5. DEDUPLICATION PROTOCOL

In this section we describe our deduplication protocol in detail. The data structure maintained by S is shown in Fig- ure 4. Clients Cs who want to upload a file also upload the corresponding short hash sh. Since different files may have the same short hash, a hash sh is associated with the list of different encrypted files whose plaintext has this hash value.

S also keeps track of clients (C

1

, C

2

, . . .) who have uploaded the same encrypted file (therefore, with overwhelming prob- ability, these clients have the same plaintext file encrypted with the same key).

Figure 5 shows a basic secure deduplication protocol. Each C

i

that uploads a file F

i

stores at S a short hash sh

i

of that file (Step 0). When a client C wishes to upload a file F it sends the short hash of this file, sh, to S (Step 1). S identifies clients {C

i

} that have uploaded files with the same short hash (Step 2), and runs the following protocol with each of them (in the more refined version of the system, there is a bound on the number of these protocol invocations).

Consider a specific C

i

who uploaded a file F

i

. It runs a PAKE protocol with C, where their inputs are the cryp- tographic hash values of F

i

and F respectively, and their

outputs are keys k

i

and k

⁰_i

respectively (Step 3). We as- sume that the keys are sufficiently long so that k

i

can be divided to a left key and right key, k

i

= k

iL

||k

iR

, where k

iL

is long enough so that collisions are unlikely (namely,

|k

iL

| >> 2 log n where n is the number of clients), and k

iR

can be used for symmetric encryption. The same also holds for k

⁰_i

.

At this point, a naive solution would have been for C

i

to send to C an encryption of k

F_i

with k

i

, namely E(k

F_i

, k

i

).

However, this would enable a subtle attack by the C to iden- tify whether F was previously uploaded. Therefore, the protocol continues as follows: each C

i

sends to S the pair k

iL

, (k

iR

+ k

F_i

), where addition is done in a field whose size will be defined next. C sends to S the pair k

⁰_iL

and Enc(pk, k

⁰_iR

+ r), as well as its public key pk, where this is an additively homomorphic encryption (Steps 4-5).

After receiving these messages from all C

i

s, S looks for a pair i for which k

iL

= k

⁰_iL

. This equality happens iff F

i

= F (except with negligible probability). If S finds such a pair it sends to C the value e = Enc(pk, k

iR

+ k

F_i

) Enc(pk, k

⁰_iR

+ r) = Enc(k

F_i

− r). Otherwise it sends e = Enc(r

⁰

), where r

⁰

is chosen randomly by S (Step 6). C calculates k

F

= Dec(e) + r, and sends E(k

F

, F ) to S (Step 7). Note that if F was already uploaded to the server then E(k

F

, F ) is equal to the previously stored encrypted version of the file.

This basic protocol has several issues that are addressed in the following sections: it allows for online brute-force guess- ing attacks (which we prevent using rate-limiting in Sec- tion 5.1); it does not specify how checkers are selected by S, if too many checkers are available (Section 5.2), and, if client-side deduplication is used, it leaks to the client whether F was already uploaded (this is solved in Section 5.3 using a randomized threshold approach).

Theorem 1. The deduplication protocol in Figure 5 im- plements F

dedup

in malicious model if the PAKE protocol is secure against malicious adversaries, the additively ho- momorphic encryption is semantically secure and the hash function is modeled as a random oracle.

Proof. We assume there is a trusted party that imple- ments F

dedup

in ideal model. We then construct a simula- tor, and show that the execution of F

dedup

in ideal model is computationally indistinguishable from the execution of the deduplication protocol in real model. For the purpose of the proof we assume that the PAKE protocol is implemented as an oracle to which the parties sends their inputs and re- ceive their outputs. (Note that the PAKE functionality is assumed to be implemented using a fully-secure protocol; in fact, it could be implemented using generic secure computa- tion protocols. Therefore this assumption is justified.)

A corrupt uploader C: We first assume that S and the C

i

s are honest and construct a simulator for the uploader C.

The simulator operates as follows: The simulator records the calls that C makes to the hash function, which is modeled as a random oracle, and records tuples of the form (F

^j

, h

^j

, sh

^j

) of the inputs and outputs for these calls. The simulator first receives a value sh from the C. The simulator now obtains C’s inputs to the different PAKE invocations. If such an input is equal to an h

^j

value which was in the same tuple as sh the simulator invokes F

dedup

with F

^j

and receives a key k

F

from F

dedup

.

Note that an honest C uses the same h value in all PAKE

invocations. A corrupt C might use different h values. Also,

(6)

0. When each C

i

uploaded its file F

i

, it uploads sh

i

and E(k

F_i

, F

i

).

1. C attempts to upload a file F . It calculates the cryptographic hash h and a short hash sh of F , and sends sh to S.

2. S finds the clients {C

i

} which uploaded files {F

i

} whose short hash sh

i

is equal to sh, and lets them run PAKE protocols with C.

^a

C’s input is h and C

i

’s input is h

i

.

3. After the invocation of the PAKE protocols, each C

i

gets a session key k

i

and C gets a set of session keys {k

i⁰

} corresponding to the different C

i

’s.

^b

4. Each C

i

splits k

i

to k

iL

||k

iR

and sends k

iL

and (k

iR

+ k

F_i

) to S.

5. C splits each k

⁰_i

to k

⁰_iL

||k

_iR⁰

, and sends to S its public key pk, {k

_iL⁰

} and {Enc(pk, k

_iR⁰

+ r)} where r is chosen randomly.

6. After receiving these messages from all C

i

s, S checks if there is an index i for which k

iL

= k

⁰iL

.

(a) If this is the case then it uses the homomorphic properties of the encryption to send to C a value e = Enc(pk, k

iR

+ k

F_i

) Enc(pk, k

⁰_iR

+ r) = Enc(k

F_i

− r);

^c

(b) Otherwise it sends e = Enc(r

⁰

), where r

⁰

is chosen randomly by S.

7. C calculates k

F

= Dec(e) + r, and sends E(k

F

, F ) to S.

8. If S already stores E(k

F

, F ) it deletes E(k

F

, F ) and allows C to access the stored file E(k

F_i

, F

i

). Otherwise, S stores E(k

F

, F ).

a

The PAKE is run via S. There is no direct interaction between the C and each C

i

. C sends its first message of the PAKE protocol in Step 1.

b

Note that, with overwhelming probability, k

i

= k

⁰i

iff F

i

= F .

c

If more than a single k

iL

matches, then the server picks at random one of the matching values and computes the encryption using the corresponding k

iR

. This event does not happen if C is honest, but might happen if C uses in the PAKE protocols different h values that agree with the same short hash sh.

Figure 5: The deduplication protocol.

since sh is a short hash, there might be several (F

^j

, h

^j

, sh

^j

) tuples with the same sh value and therefore several legiti- mate h values for C to use. The simulator might therefore learn several k

F

values. The simulator stores for each h value a key k

F

, which is equal to the corresponding result from F

dedup

, or is random if h was not in a tuple with sh.

The simulator sends back to C a random key k

i

for each PAKE invocation. It splits this key as k

i

= k

iL

||k

iR

, and stores the result together with the h value used in this PAKE invocation, and the key k

F

that corresponds to h.

The simulator receives from C a set of pairs ({k

⁰_iL

, Enc(k

_iR⁰

+ r)}. Let K = {k

_iL⁰

| k

_iL⁰

= k

iL

}, namely the set of k

⁰_iL

values that agree with the values stored by the simulator. Let K

⁰

be a maximal subset of K where each k

⁰_iL

∈ K

⁰

corresponds to a different h value. Then if K

⁰

is not empty the simulator chooses at random an element k

_iL⁰

∈ K

⁰

and sends back to C the encryption Enc((k

iR

+ k

F

) − (k

⁰_iR

+ r)), computed using the homomorphic properties. Otherwise it sends to C an en- cryption of a random value. (Note that if F is not identical to any of the files F

i

then the key k

F

that C will retrieve is random, even if the simulator found a match of a the k

iL

value.)

A corrupt C

i

: In the ideal model C

i

always provides an input to the functionality. We prove security with relation to the relaxed functionality described in Section 3.2, where C

i

also learns whether the uploaded file has the same short hash as F

i

.

The simulator interacts with C

i

in the real protocol and should extract C

i

’s input to the functionality in the ideal model, namely (F

i

, k

F_i

). We can assume that C

i

has previ- ously sent E(k

F_i

, F

i

) to the server, and therefore the simu-

lator should only find k

F_i

and use it to compute F

i

. The simulator first observes whether sh matches the short hash of C

i

. If that is not the case then the simulator pro- vides a random k

F_i

to the ideal functionality. Otherwise, the simulator observes C

i

’s input h

i

to the PAKE protocol, its output k

i

, and the message (k

iL

, k

iR

+ k

F_i

) that C

i

sends to the server. If k

iL

is different than the left part of k

i

then the simulator provides to the functionality a random value k

Fi

. Otherwise, it subtracts the right part of k

i

, namely k

iR

, from k

iR

+ k

F_i

, and provides the result as the input to the functionality.

A corrupt Server: In the protocol the server learns whether the k

iL

value matched or not. This should be done in the simulation, too, but this information can be deduced from the output in the ideal model.

Now we discuss several extensions to the basic protocol to account for the types of issues we alluded to in Section 4.

5.1 Rate Limiting

A compromised active S can do online brute-force guess- ing attacks on either C or C

i

. Specifically, if F

i

is predictable, S can pretend to be uploader and send PAKE requests to C

i

attempting to guess F

i

. S can also pretend to be checkers and send PAKE responses to C to guess F . Both upload- ers and checkers should limit the number of PAKE runs for each file in their respective roles. We use RL

c

to denote the rate limit for checkers, i.e., each C

i

can process at most RL

c

PAKE requests for F

i

and will ignore further requests.

Similarly, RL

u

is the rate limit for uploaders: i.e. an C will

send at most RL

u

PAKE requests to upload F . This strat-

egy can both improve security (shown in Section 6.3) and

(7)

reduce overhead (shown in Section 7).

5.2 Checker Selection

On receiving an upload request, S selects matching files in descending order of popularity (in terms of number of Cs who own the file) and then choose checkers for that file in ascending order of engagement (in terms of the number of PAKE requests they have serviced so far as checkers for that file). However, the number of PAKE instances an uploader is asked to run for a given file may reveal information about the popularity of that file. To avoid this, S may need participate in fake PAKE runs with C using random inputs.

More concretely, after receiving an upload request from C, S generates a random value r

1

∈ [1, RL

u

− 1], and sets r

2

as RL

u

− r

1

.

• If no short hash matches sh

a

, S runs r

1

times PAKE immediately using random inputs.

• If S finds a sh

i

that matches sh and there are x files under sh

i

and y of them have owners whose clients are currently online,

– if y < r

1

, S runs (r

1

− y) times PAKE using ran- dom inputs; Then, for each of the y files, selects an online checker to run PAKE, and marks the other (x − y) files as unchecked ;

– if r

1

≤ y, S selects r

1

of the y files based on popularity; and for each of them, selects an online checker to run PAKE. Then marks the other (x − r

1

) files as unchecked.

At the end of the above procedure, if no match has been found, S picks a random PAKE instance and ask C to use the corresponding key to encrypt the file she wants to up- load. Assume there are z unchecked files after the above procedure,

• if z < r

2

, S runs (r

2

−z) times PAKE using random in- puts at random time interval, and whenever a checker for any of the z files comes online, sends a PAKE re- quest to him;

• if r

2

≤ z, S selects r

2

of the z unchecked files based on popularity, and whenever a checker for any of the r

2

files comes online, sends a PAKE request to it.

During the above procedure, whenever a match is found, S stops sending PAKE requests to other checkers. But it still uses random inputs to make sure C has run PAKE r

1

times before uploading and r

2

times after uploading.

5.3 Randomized Threshold

The protocol in Figure 5 is a sever-side deduplication scheme. To save bandwidth, we use the randomized thresh- old approach in [17] to transform it to client-side dedupli- cation. For each file F , S keeps a random threshold t

F

(t

F

≥ 2), and a counter c

F

that indicates the number of Cs that have previously uploaded F .

In step 6 of the deduplication protocol,

• In the case of 6a, if c

F_i

< t

F_i

, S tells C to upload E(k

F

, F ) and then deletes it locally. Otherwise, S just informs C that the file is duplicated and there is no need to upload it.

• In the case of 6b, if there are still unchecked files, it asks C to run further PAKE instances when checkers owning those files come online.

– if S finally finds k

jL

= k

_jL⁰

, S puts C as an owner of E(k

F_j

, F

j

),

∗ if c

F_j

≥ t

F_j

, S tells C to switch the key of F to k

F_j

and then deletes E(k

F

, F ) from its local storage;

∗ otherwise, S treats F as another copy of F

j

and keeps Enc(pk, k

F_j

− r). If c

F_j

exceeds t

F_j

later, it sends Enc(pk, k

F_j

− r) to C, tells her to switch the key of F and then deletes E(k

F

, F ) from its local storage.

– Otherwise, S puts E(k

F

, F ) under sh

a

and puts C as an owner of E(k

F

, F ).

6. SECURITY ANALYSIS

In this section, we demonstrate how our deduplication scheme satisfies the security requirements in section 3.3. Our scheme doesn’t rely on any T P, so it satisfies S1.

6.1 Compromised Client Only

A can compromise one or more legitimate Cs, and attempt online brute-force guessing attacks. From Section 5.2, we know that a C receives r

1

PAKE replies immediately after sending an upload request. Then C is either required to upload the encrypted file or is given access to an already existing encdrypted file. If C was asked to upload the en- crypted file, it cannot still learn how many other Cs, if any, have already uploaded the same file. If C was not required to upload the encrypted file, or if after being asked to run a further r

2

PAKE instances, C was informed that the encryp- tion key has changed, then C can only conclude that its file is sufficiently popular (i.e., has more than the specific ran- dom threshold number of owners). Therefore, our protocol satisfies S2.1.

6.2 Compromised Passive Server

Next, we consider the case where A has compromised S such that A can gain access to each message received during the protocol as well as each file stored in S. A can thus attempt offline brute-force guessing attacks on predictable files. But all sensitive data (i.e., files, hashes and keys) are encrypted using a semantically secure encryption scheme with keys known only to the participating Cs. The only thing that leaks information is the short hash. We can set its length to make it leak as little information as required.

As long as the range of the short hash function is sufficiently smaller than the plaintext space from which a file is drawn, A cannot succeed in a brute-force guessing via a compro- mised passive S. Therefore, our scheme satisfies S2.2.

6.3 Compromised Active Server

Finally, we consider the case where A has compromised S and can make it masqurade as as many Cs as she wants (this is equivalent to A compromising S and the required number of legitimate Cs). Such an adversary can attempt to mount online brute-force guessing attacks on predictable files. As introduced in Section 5.1, we use a per-file rate limiting strategy to prevent such attacks. Recall that RL

u

(RL

c

) is the rate limit for uploaders (checkers), n is the

(8)

length of the short hash and Φ is the space of a predictable file F , then A can uniquely identify F only if

|Φ| ≤ 2

ⁿ

· x · (RL

u

+ RL

L

) (1) where x is the number of Cs who potentially own F .

As a comparison, DupLESS [3] uses a per-client rate lim- iting strategy to prevent such online brute-force guessing attacks from one C. Since they need to choose the rate limit so that it does not degrade usability for a C that needs to le- gitimately upload a large number of files within a brief time interval (such as backing up a local file system), the authors of DupLEss choose the large bound 825 000 as the total number of requests a single C can make during one week.

That is to say, for a predictable file whose size of file space

|Φ| is 825 000, a compromised C has to use at least one week to uniquely identify it. But if A has compromised multiple Cs, this time lag will be linearly decreased. Recall that a compromised active S can masquerade as any number of Cs needed.

To prevent a compromised active S from masquerading multiple Cs, authors in [25] introduce another T P called identity server. When Cs first join the system, the identity server is responsible for verifying their identity and issuing credentials to them. But an identity server is hard to deploy in real world. As far as we know, our scheme is the first deduplication protocol that satisfies S2.3 without the aid of an identity server.

7. SIMULATION

Our approach incurs some extra computation and com- munication due to the number of PAKE runs whenever a C wants to upload a file. Furthermore, the use of rate lim- its and randomized thresholds can impact deduplication ef- fectiveness. In this section, we use realistic simulations to study the effect of various parameter choices in our protocol on its computation/communication overhead and dedupli- cation effectiveness.

Datasets. We want to consider two types of storage envi- ronments. The first consists predominantly of media files, such as audio and video files. We did not have access to a dataset from such an environment. Instead, we use a dataset comprising Android application prevalence data to represent an environment with media files. This is based on the assumption that the popularity of Android appli- cations is likely to be similar to that of media files: both are created and published by a small number of authors (artists/developers), made available on online stores or other distribution sites, and are acquired by consumers either for free or for a fee. We call this the media dataset. We use a publicly available dataset.

¹

It consists of data collected from 77 782 Android devices. For each device, the dataset identi- fies the set of (anonymized) application identifiers found on that device. We treat each application identifier as a “file”

and presence of a app on a device as an “upload request” to add the corresponding “file” to the storage. This dataset has 7 396 235 “upload requests” in total, of which 178 396 are for distinct files.

The second is the type storage environments found in an enterprise backup systems. We use data gathered by Debian Popularity Contest

²

to approximate such an environment.

1

https://se-sy.org/projects/malware/

2

http://popcon.debian.org

The popularity-contest package on a Debian device regularly reports the list of packages installed on that device. The resulting data consists of a list of debian packages along with the number of devices which reported that package. We took a snapshot of this data on Nov 27, 2014. It consists of data collected from 175 903 Debian users and. From this, we generate our enterprise dataset : it has 217 927 332 “upload requests” (debian package installations) of which 143 949 are for distinict files (unique debian packages).

Figure 6 shows the file popularity distribution (i.e., the number of upload requests for each file) in logarithmic scale for both datasets. We map each dataset to a stream of upload requests by generating the requests in random order.

Specifically, for a file that has x copies, we generate x upload requests for it at random time intervals.

File ID

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶

Number of Upload Requests

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶

File popularity in media dataset File popularity in enterprise dataset

Figure 6: File popularity in both datasets.

Parameters. To facilitate comparison with DupLESS [3], we set |Φ| as 825 000. We then set the length of the short hash n to 13. Inequality 1 in Section 6.3 then implies a lower bound of 100 for (RL

u

+RL

c

): with these parameter choices, A can mount a successful online brute-force guessing attack to uniquely identify a predictable file held by only one C.

We use these parameter values in our simulations.

We measure overhead as the average number of PAKE runs

³

, which can be calculated as:

µ = T otal number of PAKE runs

T otal number of upload requests (2) We measure deduplication effectiveness using deduplication percentage, introduced in Section 2.1. We assume that all files are of equal size so that deduplication percentage ρ is:

ρ = 1 − N umber of all f iles in storage

T otal number of upload requests (3) First we assume that all Cs are online during the simula- tion, and study the impact of rate limits. Once we choose specific rate limits, we evaluate how absent Cs and the use of randomized thresholds impact deduplication effectiveness.

3

We do not include fake PAKE runs by S (Section 5.2) since

we are interested in estimating the average number of real

PAKE runs. In a full implementation, an uploader will al-

ways run r

1

PAKE runs, where r

1

∈ [1, RL

u

− 1].

(9)

Rate limiting. Having selected RL

u

+RL

c

to be 100, we now see how the average number of PAKE runs and the deduplication effectiveness change when divide this total number in different ways between RL

u

and RL

c

. Figure 7 shows the average number of PAKE runs resulting from dif- ferent values of RL

u

(and hence RL

c

) in both datasets. We also ran the simulation without any rate limits, which led to an average of 26.88 PAKE runs in media dataset and 13.19 PAKE runs in enterprise dataset. They are signifi- cantly larger than the results with rate limiting. Figure 8 shows the deduplication percentage resulting from different rate limit choices in both datasets. We see that setting RL

u

(RL

c

) to 30 (70), maximizes the deduplication percentage to values (97.58% and 99.9332%) close to perfect deduplication (97.59% and 99.9339%) in both datasets.

RLu(RL

c)

Average Number of PAKE Runs

1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75

10(90) 20(80) 30(70) 40(60) 50(50) 60(40) 70(30) 80(20) 90(10) Average number of PAKE runs in media dataset

Average number of PAKE runs in enterprise dataset

Figure 7: Average number of PAKE runs VS. rate limits.

We conclude that the use of rate limits not only improves security, but can also reduce the overhead without negatively impacting deduplication effectiveness.

RL_u(RL_c)

Deduplication Percentage %

96.5 97 97.5 98 98.5 99 99.5 100

10(90) 20(80) 30(70) 40(60) 50(50) 60(40) 70(30) 80(20) 90(10) Deduplication percentage in media dataset Deduplication percentage in enterprise dataset Perfect deduplication in media dataset Perfect deduplication in enterprise dataset

Figure 8: Deduplication percentage VS. rate limits.

Randomized threshold and offline rate. The possibility

of some Cs being offline may adversely impact deduplication effectiveness. For example, as we described in Sections 5.2–

5.3, if S finds that the upload request from C does not match any online checker available at the time of the upload, it will ask C to upload its encrypted file. S will continue to ask C to run r

2

PAKE instances afterwards. If one of these finds a checker whose existing file F matches C’s file but c

F

never exceeds t

F

, S can no longer deduplicate C’s copy of F . Had the original uploader of F been online at the time C wanted to upload its file, S would have successfully con- ducated server-side deduplication of F . In other words, the possibility of clients being offline will impact deduplication effectiveness.

To estimate this impact, we assign an offline rate to each C as its probability to be offline during one upload request.

Using the chosen rate limit values (RL

u

= 30 and RL

c

= 70) and fixing the upper limit of the random threshold (i.e., 2 ≤ t

F

≤ 10), we measured the deduplication percentage by varying the offline rate value. The results for both datasets are shown in Figure 9. It shows that the deduplication per- centage is still high when the offline rate is as high as 90%.

Offline Rate

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

96.5 97 97.5 98 98.5 99 99.5 100

Deduplication percentage in media dataset Deduplication percentage in enterprise dataset

Figure 9: Deduplication percentage VS. offline rates.

Evolution of deduplication effectiveness: Deduplica- tion percentage will continue to increase as more files are added to the storage. Figure 10 shows that the deduplication percentage achieved by our scheme meets the realistic expec- tation (95%) in the very beginning (after receiving 304 160 upload requests in the media dataset, and 121 110 upload requests in enterprise dataset). So our system satisfies F1.

However, our use of rate limits may result in a somewhat slower increase compared to schemes with perfect dedupli- cation. Figure 11 shows how the difference between perfect deduplication and the deduplication percentage achieved by our scheme evolves as more files are added to the storage.

Figure 13 (in Appendix C) shows the secondary deviation.

The difference become stable as more files are uploaded to the storage.

This phenomenon can be explained by the Zipf ’s law [29].

As uploaders set a rate limit for their requests and S always

selects files in descending order of popularity, our system is

similar to web proxy caching, which is a temporary storage

for popular web pages. Breslau et al. observe that page

(10)

Number of Upload Requests #10⁶

0 1 2 3 4 5 6 7 8

Deviation in Deduplication Percentage %

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

Deviation from perfect deduplication in media dataset

Number of Upload Requests _#10⁸

0 0.5 1 1.5 2 2.5

Deviation in Deduplication Percentage %

#10^-3

0 1 2 3 4 5 6 7 8 9

Deviation from perfect deduplication in enterprise dataset

Figure 11: Deviation from perfect deduplication VS. Number of upload requests.

Number of Upload Requests (x30 in enterprise dataset) _#10⁶

0 1 2 3 4 5 6 7 8

0 10 20 30 40 50 60 70 80 90 100

Evolution of deduplication percentage in media dataset Evolution of deduplication percentage in enterprise dataset

Figure 10: Deduplication percentage VS. Number of upload requests.

requests distribution follows the Zipf’s law, where the fre- quency of a request for the mth most popular page can be calculated as

P_N^1/m^α

i=1(1/i^α)

, where N is the size of the cache, and α is the value of the exponent characterising the distri- bution [8]. As a result, most of the requested pages can be found in the cache. In our case, most of the upload requests can find a matched file within the rate limit if the file has been uploaded already.

8. PERFORMANCE EVALUATION 8.1 Prototype implementation

Our prototype consists of two parts: (1) a server pro- gram which simulates S and (2) a client program which sim- ulates C (performing file uploading/downloading, encryp- tion/decryption, and assissting S in deduplication). We use

Node.js

⁴

for the implementation of both parts and Redis

⁵

for the implementation of S’s data structure. Furthermore, we use SHA-256 as the cryptographic hash functions and AES with 256-bit keys as the symmetric encryption scheme, both of which are provided by the Crypto module in Node.js.

We truncate the cryptographic hash to get the short hash.

We use the GNU multiple precision arithmetic library

⁶

to implement the public key operations.

8.2 Test setting and methodology

We run the server side program on an Amazon EC2 in- stance (Intel Xeon with 1 CPU at 2.5 GHz)

⁷

. and the client side program on an Intel Core i7 machine with 4 2.2 GHz cores. We measure the running time using the Date mod- ule in javascript and measure the bandwidth usage using TCPdump.

As the downloading phase in our protocol is simply down- loading an encrypted file, we only consider the uploading phase. We set 13 for the length of short hash and 30 for RL

u

. We consider the worst case where we let the uploader run PAKE with 30 checkers. So we simulate the uploading phase in our protocol as:

1. An uploader C sends the short hash of the file it wants to upload to server S;

2. S sends requests to 30 checkers C

i

and lets them run PAKE with C;

3. S waits to get responses of all instances back from C and {C

i

};

4. S chooses one instance and sends the result to C 5. C uses the resulting key to encrypt its file with AES

and uploads it to S.

We measure both the running time and bandwidth usage during the whole procedure above. We compare the results

4

http://nodejs.org

5

http://redis.io

6

https://gmplib.org

7

http://aws.amazon.com/ec2/

(11)

to two baselines: (1) simply uploading a file without en- cryption and (2) uploading a file with AES encryption. As in [3], we repeat our experiment using files of size 2

²ⁱ

KB for i ∈ {0, 1, ..., 8}, which provides a file size range of 1KB to 64 MB. For each file, we upload it 100 times and cal- culate the mean values. For files that are larger than the computer buffer, we do loading, encryption and uploading at the same time by piping the data stream. As a result, uploading encrypted files uses almost the same amount of time with uploading plain files.

8.3 Results

Figure 12 reports the uploading time and bandwidth us- age in our protocol compared to two baselines. For files that are smaller than 1 MB, the extra overhead introduced by our deduplication protocol is relatively high. For exam- ple, it takes 587 ms (2 436 bytes bandwidth) to upload a 1 KB plain file and 594 ms (2 585 bytes) to upload a 1 KB encrypted file, while it takes 1 784 ms (14 7308 bytes) to up- load the same file in our protocol. But the extra overhead introduced by our protocol is independent of the file size, and it become negligible when the file is large enough. So our system meets F2.

9. DISCUSSION

Incentives: In our scheme Cs have to run several PAKE instances as both uploaders and checkers. This imposes a cost on each C. S is the direct beneficiary of the protocol. Cs may indirectly benefit from deduplication in that the ability to effectively deduplicate makes the storage system more efficient and can thus potentially lower the cost of using the storage incurred by each C. Nevertheless, it is desirable to have more direct incentive mechanisms to encourage Cs to do do PAKE checks. For example, if a file uploaded by C is found to be shared by other Cs, S could reward the Cs owning that file by giving them small increases in their respective storage quotas.

Deduplication effectiveness: We can improve dedupli- cation effectiveness by introducing additional checks. For example, an uploader can indicate the (approximate) file size (which will be revealed any way) so that S can limit the selection of checkers to those whose files are of a simi- lar size. Similarly, S can keep track of similarities between Cs based on the number of files they share and use this in- formation while selecting checkers by prioritizing checkers who are similar to the uploader. Nevertheless, as discussed in Section 7, our schemes achieve surpass what is consid- ered as realistic levels of deduplication in the very beginning.

Whitehouse [27] reported that when selecting a deduplica- tion scheme, enterprise administraotrs rated considerations such as ease of deployment and use being more important as deduplication ratio. Therefore, we argue that the small sacrifice in deduplication ratio is offset by the significant advantage of ensuring user privacy without having to use independent third party servers.

Block-level deduplication: Our scheme can be applied to both file-level and block-level deduplication. But block-level deduplication will incur more overhead.

10. RELATED WORK

One technique proposed to enable deduplication with en- crypted data is convergent encryption (CE) [12], which de-

rives the file encryption key deterministically from the file contents. As a result, the same file will produce the same ciphertext, so that S can deduplicate ciphertexts. But if A compromises S, it can easily perform an offline brute-force guessing attack. Bellare et al. recently proposed “message- locked encryption” (MLE), which uses a semantically-secure encryption scheme but produces a tag in a deterministic way [4]. So it still suffers from the same attack.

Puzio et al. propose a block-level deduplication scheme that uses a third party T P [23]. C first encrypts each block with CE and then sends the ciphertexts to T P, who adds an additional layer of encryption to the block. During file retrieval, blocks are decrypted by T P first and sent back to C. If A compromises both S and Cs, she can perform an online brute-force guessing attack by letting multiple Cs upload all of possible files and monitoring S’s storage.

Stanek et al. propose a scheme that only deduplicates popular files [25]. Cs encrypt their files with two layers of encryption: the inner layer is obtained through CE and the outer layer is obtained through a semantically secure thresh- old encryption scheme. S can decrypt the outer layer of a file iff the number of Cs who have uploaded this file reaches the threshold. They introduce a T P as an index server to keep track of the popularity of each file, but the index server has to know the content or hash of each file. In addition, they introduce another T P as an identity server to prevent online brute-force guessing attacks by multiple Cs.

Both of these solutions [23, 25] are vulnerable to offline brute-force guessing attacks if A compromises the T Ps.

Bellare et al. propose DupLESS that uses an additional secret in the key generation process of CE [3]. This secret is identical for all Cs and is provided by a T P. To prevent T P from conducting offline brute-force guessing attacks, Cs run an oblivious PRF protocol with T P to generate the key which guarantees that Cs who input the same file will get the same encryption key without revealing their files. But if A compromises both S and T P, S can get the secret from T P and the scheme will be reduced to normal CE.

Duan proposes a distributed architecture that generates the file encryption key by a threshold number of all Cs [13].

But this scheme is only suitable for peer-to-peer scemarios:

a threshold number of Cs must be online and interact with one another to generate the key. A malicious S can easily generates more than a threshold number of Cs.

In Table 1, we summarize the resilience of these schemes with respect to the design goals from Section 3.3.

XX XX

XX XX X

Schemes

Threat

Compromised

C S (pas.) S (act.), Cs T Ps S, T Ps

[12], [4] √

X X − −

[23] √ √

X X X

[25] √ √ √

X X

[3] √ √

X √

X

[13] √ √

X − −

Our work √ √ √

− −

Table 1: Resilience of deduplication schemes.

Acknowledgments

This work was supported in part by the Academy of Finland

project “Cloud Security Services” (283135).

(12)

File Size (KB)

Time Usage (Millisecond)

2¹ 2² 2³ 2⁴ 2⁵ 2⁶ 2⁷ 2⁸ 2⁹ 2¹⁰ 2¹¹

2⁰ 2² 2⁴ 2⁶ 2⁸ 2¹⁰ 2¹² 2¹⁴ 2¹⁶

Our protocol Upload encrypted files Upload plain files

File Size (KB)

Bandwidth Usage (Byte)

2¹⁰ 2¹² 2¹⁴ 2¹⁶ 2¹⁸ 2²⁰ 2²² 2²⁴ 2²⁶ 2²⁸

2⁰ 2² 2⁴ 2⁶ 2⁸ 2¹⁰ 2¹² 2¹⁴ 2¹⁶

Our protocol Upload encrypted files Upload plain files

Figure 12: Time (left) and bandwidth usage (right) VS. file size.

11. REFERENCES

[1] M. Abdalla and D. Pointcheval. Simple

password-based encrypted key exchange protocols. In A. Menezes, editor, Topics in Cryptology - CT-RSA 2005, The Cryptographers’ Track at the RSA Conference 2005, San Francisco, CA, USA, February 14-18, 2005, Proceedings, volume 3376 of Lecture Notes in Computer Science, pages 191–208. Springer, 2005.

[2] B. Barak, R. Canetti, Y. Lindell, R. Pass, and T. Rabin. Secure computation without authentication.

In V. Shoup, editor, Advances in Cryptology - CRYPTO 2005, volume 3621 of Lecture Notes in Computer Science, pages 361–377. Springer Berlin Heidelberg, 2005.

[3] M. Bellare, S. Keelveedhi, and T. Ristenpart. Dupless:

Server-aided encryption for deduplicated storage. In Proceedings of the 22Nd USENIX Conference on Security, SEC’13, pages 179–194, Berkeley, CA, USA, 2013. USENIX Association.

[4] M. Bellare, S. Keelveedhi, and T. Ristenpart.

Message-locked encryption and secure deduplication.

In T. Johansson and P. Nguyen, editors, Advances in Cryptology - EUROCRYPT 2013, volume 7881 of Lecture Notes in Computer Science, pages 296–312.

Springer Berlin Heidelberg, 2013.

[5] M. Bellare, D. Pointcheval, and P. Rogaway.

Authenticated key exchange secure against dictionary attacks. In Preneel [22], pages 139–155.

[6] S. Bellovin and M. Merritt. Encrypted key exchange:

password-based protocols secure against dictionary attacks. In Research in Security and Privacy, 1992.

Proceedings., 1992 IEEE Computer Society Symposium on, pages 72–84, May 1992.

[7] V. Boyko, P. D. MacKenzie, and S. Patel. Provably secure password-authenticated key exchange using diffie-hellman. In Preneel [22], pages 156–171.

[8] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions:

evidence and implications. In INFOCOM ’99.

Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings.

IEEE, volume 1, pages 126–134 vol.1, Mar 1999.

[9] R. Canetti, D. Dachman-Soled, V. Vaikuntanathan, and H. Wee. Efficient password authenticated key exchange via oblivious transfer. In M. Fischlin,

J. Buchmann, and M. Manulis, editors, Public Key Cryptography - PKC 2012, volume 7293 of Lecture Notes in Computer Science, pages 449–466. Springer Berlin Heidelberg, 2012.

[10] R. Canetti, S. Halevi, J. Katz, Y. Lindell, and P. D.

MacKenzie. Universally composable password-based key exchange. In R. Cramer, editor, Advances in Cryptology - EUROCRYPT 2005, 24th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Aarhus, Denmark, May 22-26, 2005, Proceedings, volume 3494 of Lecture Notes in Computer Science, pages 404–421.

Springer, 2005.

[11] W. Diffie and M. E. Hellman. New directions in cryptography. Information Theory, IEEE Transactions on, 22(6):644–654, 1976.

[12] J. Douceur, A. Adya, W. Bolosky, P. Simon, and M. Theimer. Reclaiming space from duplicate files in a serverless distributed file system. In Distributed Computing Systems, 2002. Proceedings. 22nd International Conference on, pages 617–624, 2002.

[13] Y. Duan. Distributed key generation for encrypted deduplication: Achieving the strongest privacy. In Proceedings of the 6th Edition of the ACM Workshop on Cloud Computing Security, CCSW ’14, pages 57–68, New York, NY, USA, 2014. ACM.

[14] M. Dutch. Understanding data deduplication ratios.

SNIA Data Management Forum, 2008.

http://storage.ctocio.com.cn/imagelist/2009/

222/l3pm284d8r1s.pdf.

[15] T. ElGamal. A public key cryptosystem and a signature scheme based on discrete logarithms. In G. Blakley and D. Chaum, editors, Advances in Cryptology, volume 196 of Lecture Notes in Computer Science, pages 10–18. Springer Berlin Heidelberg, 1985.

[16] O. Goldreich and Y. Lindell. Session-key generation using human passwords only. In J. Kilian, editor, Advances in Cryptology - CRYPTO 2001, volume 2139 of Lecture Notes in Computer Science, pages 408–432.

Springer Berlin Heidelberg, 2001.

[17] D. Harnik, B. Pinkas, and A. Shulman-Peleg. Side channels in cloud services: Deduplication in cloud storage. Security Privacy, IEEE, 8(6):40–47, Nov 2010.

Secure Deduplication of Encrypted Data without Additional Servers