Secure and efficient search over outsourced databases

(1)

Secure and Efficient Search over Outsourced

Databases

by

Weipeng Lin

B.Sc., Xiamen University, China, 2012

Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

in the

School of Computing Science Faculty of Applied Science

c

Weipeng Lin 2019 SIMON FRASER UNIVERSITY

Summer 2019

Copyright in this work rests with the author. Please ensure that any reproduction or re-use is done in accordance with the relevant national copyright legislation.

(2)

Approval

Name: Weipeng Lin

Degree: Doctor of Philosophy (Computer Science)

Title: Secure and Efficient Search over Outsourced Databases

Examining Committee: Chair: Qianping Gu

Professor Ke Wang Senior Supervisor Professor Andrei Bulatov Supervisor Professor Tianzheng Wang Internal Examiner Assistant Professor Robert H. Deng External Examiner Professor

School of Information Systems Singapore Management University

(3)

Abstract

The current trend towards outsourcing data storage and management to the cloud has largely been driven by the perceived simplicity and cost-effectiveness. Encrypting sensitive data before outsourc-ing preserves data privacy, but poses an obstacle to delegate search capabilities to the server. Sym-metric searchable encryption (SSE) addresses this issue by allowing an untrusted server to answer queries over encrypted data while protecting the confidentiality of plaintext data and queries. The core of SSE is to meet three design goals, including a strong security guarantee, an efficient search performance and supporting rich types of queries. Usually, an SSE scheme needs to trade in security or efficiency for supporting more expressive queries. In this thesis, we study how to strike trade-offs among these design goals.

As a motivation of strong security notions, we first study the potential risks of constructing SSE schemes based on an ad hoc security notion for the purpose of efficiently performing more expres-sive queries. We demonstrate several previous unknown security risks of widely used ad hoc secure SSE schemes to show that ad hoc security notion leaves room for unpredicted information leakage. To address this problem, our next two contributions focus on constructing practical SSE schemes with a formal security definition.

Ciphertext indistinguishability is a notable formal security notion. SSE schemes achieving this no-tion can provide a strong security guarantee, but often result in less efficiency or reduced query expressiveness. In order to support equality conjunction queries in a sub-linear search performance, we argue that this notion is not necessary in practice and formalize a relaxed notion of ciphertext indistinguishability. We propose a novel SSE scheme for equality conjunction queries that meets the relaxed notion while minimizing the client and communication cost.

Most SSE schemes achieving ciphertext indistinguishability leak access pattern (i.e., what docu-ments are searched) and search pattern (i.e., whether two queries pertain to the same keyword) for efficiency. However, these patterns could lead to serious security risks. Our final contribution is the first practical dynamic SSE scheme that hides access and search patterns for single key search.

(4)

Dedication

(5)

Acknowledgements

First and foremost I would like to thank my senior supervisor Dr. Ke Wang, for his encouragement, tremendous academic support and all his contributions of time and funding during my Ph.D. study and research. It would not have been possible to conduct this thesis without his precious support. He was and remains my best role model for a mentor, researcher and professor. It has been an honor to be his Ph.D. student. I am also pleased to say thank you to my supervisor Dr. Andrei Bulatov, who provided insightful comments, useful advice and discussions. He was instrumental in helping me with writing this thesis.

Besides my supervisors, I would like to express my sincere gratitude to all other committee members: Dr. Qianping Gu, Dr. Tianzheng Wang and Dr. Robert H. Deng for their patient, enthusi-asm and guidance.

I would also like to thank my labmates: Chao Han, Hongwei Liang, Ryan McBride, Zhilin Zhang, Jiaxi Tang, Yue Wang, Lovedeep Gondara, Andy Zhang and Yingwen Ren for the wonderful moments we have spent together. I also thank my friends, colleagues and collaborators for providing friendship and support throughout my life and study.

Finally, I would like to give special thanks to my parents, my wife, as well as her family for their continuous and unconditional encouragement and love. I would not have made it this far without their support. Thank you.

(6)

List of Tables

Table 2.1 Summary of notations . . . 7

Table 3.1 Summary of attack algorithms . . . 26 Table 3.2 SNMF’s precision (P), recall (R) and runtime for different settings on

syn-thetic data sets . . . 29 Table 3.3 The frequency distribution of five most frequent documents Piin O2000. This

distribution is completely preserved on the indexes Ii and the reconstructed

indexes I_i∗ . . . 30

Table 4.1 Query time (seconds) vs dimensionality d on US Flight data (20M records) . 54

Table 5.1 Asymptotical complexity comparison. k = key size for the Paillier cryptosys-tem (typically set as 2048); Exp = exponentiation; V = number of docu-ment/keyword pairs in the database; β = average padding percentage; nw =

number of documents containing w; ng = Σ`i=1nwi; ` = group size; G = total

number of groups; D = |D|; α = average number of ones in each non-zero sub-matrix; V_α = number of non-zero sub-matrices. . . 77 Table 5.2 Disclosure comparison. “full" = full disclosure, “uniform" = disclosure of

uniform access pattern (i.e., no disclosure), “block" = disclosure of block-level access pattern, “padded" = disclosure of padded access pattern, and “group" = disclosure of group-level search pattern . . . 77 Table 5.3 Statistics of data sets. D is the document number; W is the keyword

num-ber; V is the number of 1’s in the (W × D) binary matrix M and θ is the percentage of 1’s in M . . . 77 Table 5.4 The proportion of queries in four query result size bins on Enron and Wiki

data sets . . . 78

Table 6.1 Summary of trade-offs among query expressiveness, security and efficiency design goals studied in this thesis . . . 81

(10)

List of Figures

Figure 1.1 Key entities in a typical SSE system . . . 2

Figure 3.1 A COA adversary’s knowledge on observed ciphertexts encrypted by MK-FSE . . . 24

Figure 3.2 SNMF’s precision and recall vs the number of ciphertexts (m = n) on Enron data sets (d = 500, ρ ∈ [5%, 35%]) . . . . 30

Figure 4.1 Bucketization: the client searches for relevant bucket ids, the server returns all records in the buckets, and the client filters false positives . . . 33

Figure 4.2 Proposed scheme: the client encrypts the query predicate, the server searches for a candidate set and filters false positives, and the client decrypts the query result . . . 33

Figure 4.3 Data encryption time . . . 50

Figure 4.4 Query time vs data cardinality . . . 51

Figure 4.5 Query time of CLASS vs data cardinality . . . 52

Figure 4.6 Query time of CLASS vs query dimensionality q (50M Records) . . . . . 53

Figure 4.7 Query time of CLASS vs value class size κ (50M Records) . . . . 53

Figure 4.8 Query time of CLASS vs query distribution α (50M Records) . . . . 53

Figure 4.9 Query time of CLASS vs random noise interval size U − L (50M Records, L = 1000) . . . . 53

Figure 5.1 Example of embedded sum in BES . . . 60

Figure 5.2 Setup protocol in BES . . . 62

Figure 5.3 Search protocol in BES . . . 63

Figure 5.4 Update protocol in BES . . . 64

Figure 5.5 Example of the index in LES protocols . . . 71

Figure 5.6 Client decryption time vs query result size range . . . 79

Figure 5.7 Communication cost normalized by the cost of LES vs query result size range . . . 79

(11)

Chapter 1

Introduction

Outsourcing data storage and management to cloud-based service providers has become the current trend and received wide attention in capital markets. According to [97], in 2018, cloud storage ser-vices have attracted over 1.9 billion users who upload tremendous amount of personal data daily. Moreover, Amazon Web Services generated revenues of around 17.5 billion U.S. dollars in 2017 [96]. Moving data from local devices to the remote cloud server offers great convenience and capi-tal saving to data owners, but raises security and privacy concerns at the same time. Outsourced data are under threat from being accessed, used, or shared by the cloud server and other unauthorized parties for their own benefits without the data owner’s knowledge. It thereby causes the privacy breach. To give an example, a collection of private photos of Hollywood actresses which are stored on Apple iCloud were leaked in 2014 [36]. On the one hand, the sensitive and confidential nature of data requires that outsourced data need to be stored in encrypted form to preserve the privacy. On the other hand, outsourcing encrypted data precludes the data owner from delegating query processing tasks that depend on plaintext data information to the cloud server, thus, induces inefficiency. Ap-parently, simply downloading all encrypted data for each query and then performing search queries locally after decryption is impractical for most applications that deal with a large amount of data.

A promising solution to the above problem is searchable encryption [83] that allows the server to answer search queries directly over encrypted data on a data owner’s behalf while protecting the confidentiality of plaintext data and queries. We can generally classify searchable encryption schemes into the following two categories [10, 54]: asymmetric searchable encryption (ASE) and symmetric searchable encryption (SSE). ASE is suitable for data sharing applications, such as re-trieving emails via a cloud email server, where the party generating data (e.g., the sender) is different from the one who carries out search queries (e.g., the receiver). SSE refers to the scenario where the data owner is also the one who will query the database. For this reason, we focus our discussion on SSE schemes in this thesis. It is worth noting that in some data outsourcing applications, the data owner can grant access rights to authorized users to query the outsourced data as well. The problem of how to authorize users’ access rights is out of the scope of this thesis. Please refer to [79, 46] for more details on access control issues.

(12)

Semi-trusted Server Trusted Client

Encrypted Queries

(Also called Trapdoors)

Query Results Searchable Encrypted Index

Figure 1.1: Key entities in a typical SSE system

For simplicity, a typical SSE system consists of two key parties: the client (i.e., the data owner) and the server (i.e., a single cloud service provider) as shown in Figure 1.1. The client is in the trusted perimeter while the server is considered as “semi-trusted” or “honest-but-curious”. To be specific, it means that the server will follow protocols honestly but may passively attempt to learn the information about plaintext data and queries. Therefore, the server is not fully trusted and is treated as the adversary. The terms “server”, and “adversary” are interchangeable in this thesis. In order to enable the server to perform search queries over encrypted data, the client builds a searchable encrypted index and outsources it to the server. Later on, the client can issue a search query by generating a search token or so-called trapdoor for the query. With the trapdoor, the server can search over the searchable encrypted index to get query results without decryption. An SSE is called dynamic SSE if the searchable encrypted index supports search queries even after update (addition and deletion) of data. Otherwise, it is static SSE. The design goals of SSE are listed as follows.

• Security refers to the degree to which an SSE scheme protects the content of the client’s plaintext data and queries against a semi-trusted server. A formal security definition comprises a precise description of the information about plaintext data and queries leaked to the specific server. Clearly, an SSE scheme with a stronger security guarantee requires the specific server to learn less plaintext information.

• Efficiency is measured by the communication cost, client-side cost and server-side overhead. Under the assumption that most clients only have limited computational resources and net-work bandwidth, minimizing the communication and client-side cost is an essential require-ment. A sub-linear search performance (i.e., not having to examine all outsourced data for evaluating a trapdoor) on the server-side is a desired property for dealing with a large database.

• Query expressiveness refers to the types of search queries supported by an SSE scheme. To provide a better and more flexible search experience to the client, an SSE scheme that supports more expressive search queries beyond simple single keyword search is preferred.

An SSE scheme can be constructed by applying powerful techniques such as Oblivious RAM [41, 85, 30] and fully homomorphic encryption [38, 90, 15] to support general types of queries

(13)

without leaking any information about plaintext data and queries to the server. However, solutions based on these techniques suffer from prohibitive communication and computation costs [25, 70, 12, 84]. For this reason, most existing work focuses on practical SSE schemes that strike trade-offs among efficiency, search queries and security notions.

1.1 Research Questions and Contributions

In this thesis, we study the following three research questions raised on how to strike trade-offs among the above design goals to construct practical SSE schemes. As a motivation of strong security notions, we first study the risks of constructing SSE schemes based on an ad hoc security notion to enable the server to perform more expressive queries efficiently. In order to provide sufficient security, our second and third contributions address issues on constructing SSE schemes with a formal security definition that precisely specifies the information leakage.

1.1.1 Research Question on Ad hoc Secure SSE

For the purpose of supporting more expressive search queries and achieving better efficiency, a growing number of recent SSE schemes are proposed by adopting an ad hoc security definition (e.g., [98, 92]). Specifically, the ad hoc security definition [81] assumes certain background knowledge of the adversary with a specific way of attacks and checks if the proposed SSE scheme resists to the attacks by such adversaries. However, ad hoc secure SSE schemes provide no guarantee of thwarting attacks launched by the adversary in other ways. This raises the following research question.

RQ1: Do existing ad hoc secure SSE schemes provide sufficient security?

In Chapter 3, we demonstrate several previous unknown security risks of a widely used ad hoc secure SSE scheme, called asymmetric scalar-product preserving encryption (ASPE) [98], and its variants. Our finding implies that constructing SSE schemes based on an ad hoc security notion can-not provide sufficient security and often ends in disaster. To avoid unpredicted information leakage caused by adopting an ad hoc security definition, an SSE scheme needs a formal security definition that precisely specifies the information leaked to the server. Usually, the formal security refers to provable security [81, 10, 25] which shows that any adversary with specific capabilities to break the SSE scheme can be transformed to an efficient algorithm to break a well-studied “hardness prob-lem” (e.g., distinguishing outputs of a pseudo-random function from a real random function). In this way, an SSE scheme is secure if the underlying “hardness problem” cannot be solved in rea-sonable time. It is often the case that an SSE scheme with a formal security definition results in less efficiency or reduced query expressiveness. Given the above reasons, our next two contributions focus on constructing practical SSE schemes with a formal security definition via reduced query expressiveness.

(14)

1.1.2 Research Question on Outsourcing Equality Conjunction Search

In this contribution, we consider a relational database over a fixed set of attributes, and equality conjunction queries defined by a conjunction of one or more equalities. Equality conjunction search allows the client to selectively retrieve records that matching all equalities in the query. For example, suppose a doctor Alice outsources the following relational table of patients to the remote cloud server.

PATIENT (name,sex,age,city,country,disease).

Through the following equality conjunction search, Alice can easily retrieve medical records from the table PATIENT for all patients in Vancouver who are diagnosed with HIV.

SELECT * FROM PATIENT as P,

WHERE P.city = Vancouver AND P.disease = HIV.

Simply solving equality conjunction search by performing search for each equality in the query often results in inefficient and unsecured query processing [21]. Taking the above query as an ex-ample, the server first retrieves all records matching the equality (P.city = V ancouver) and the equality (P.disease = HIV ), respectively. Then, the server does the intersection between the re-sults. This method reveals more information than what the query asks for because it discloses the matching results for each equality in the set of equalities. The insufficiency of performing each single search equality underlines the need for SSE schemes supporting equality conjunction search. On the one hand, a strong security guarantee such as ciphertext indistinguishability [40, 25, 42] is preferred. On the other hand, a sub-linear search performance is an important requirement for dealing with large applications. Unfortunately, achieving ciphertext indistinguishability is at odds with a sub-linear search performance requirement. The efficiency consideration (i.e., the sub-linear search performance) requires that the server can effectively prune the records not necessary for evaluating a query, whereas the security consideration (i.e., ciphertext indistinguishability) requires that the server should not distinguish such records from others without searching. This raises the following research question.

RQ2: Can we design a practical SSE scheme that meets both ciphertext indistinguishability and sub-linear search performance for equality conjunction search?

Given the above motivation, in Chapter 4, we argue that ciphertext indistinguishability is not necessary in practice. We formalize a relaxed notion of ciphertext indistinguishability, called class indistinguishability, to enable the sub-linear search on the server-side. In addition, we propose a novel SSE scheme called CLASS for outsourcing equality conjunction search to meet class indis-tinguishability. We evaluate the proposed CLASS scheme on two real-world data sets. Extensive experimental results show that CLASS outperforms the state-of-the-art.

(15)

1.1.3 Research Question on Outsourcing Single Keyword Search

Most SSE schemes achieving ciphertext indistinguishability are restricted to single keyword search. Traditionally, these schemes focus on ciphertext privacy and consider leaking search pattern and access pattern a necessary pay for efficiency [70, 25, 19]. Search pattern refers to the disclosure on whether two searches pertain to the same search keyword, which discloses the search frequency of a keyword. Access pattern refers to the disclosure on what documents are returned for a search query. Nevertheless, search pattern and access pattern could lead to serious real world consequences. Recent work [69, 11] shows that the server can exploit search pattern and auxiliary knowledge (e.g., the search frequency of known keywords) to reveal the underlying search keyword of a query trapdoor. The potential security risks of access pattern disclosure have been studied in [52, 19, 107]: with the help of auxiliary knowledge, the pattern of documents accessed by a search allows the server to learn the context of encrypted queries and documents. For this reason, in addition to achieving ciphertext indistinguishability, a stronger security guarantee needs to mitigate access and search pattern disclosures as well. It is worth noting that access pattern hiding is more difficult in dynamic SSE which supports the client in updating, because each update (i.e., adding or deleting a document-keyword pair) will change the query result (i.e., access pattern). This raises the following research question.

RQ3: Can we design a practical dynamic SSE scheme that meets ciphertext indistinguishability and mitigates access and search pattern disclosures, by restricting to single keyword search?

In Chapter 5, we further protect access and search patterns by considering single keyword search over a collection of text documents. We propose the first practical dynamic SSE scheme, called LES, for outsourcing single keyword while mitigating access and search pattern disclosures. We give a formal leakage profile to capture all information leaked to the server. We evaluate the efficiency of the proposed schemes on two real world data sets.

1.2 Thesis Organization

The remainder of this thesis is organized as follows.

• In Chapter 2, we first present the necessary background knowledge on SSE, including nota-tions and defininota-tions. Then, we discuss general methods to construct SSE.

• In Chapter 3, we study the potential security risks of adopting ad hoc security definition. In particular, we demonstrate several previously unknown security risks of the widely used ad hoc secure scheme, called ASPE, and its variants. We propose efficient attack algorithms to show that the original security claim about ASPE does not hold. The effectiveness of the proposed attacks is verified on both real and synthetic data sets.

(16)

• In Chapter 4, with the motivation that we have mentioned above, we formalize the class indistinguishability and propose a novel approach CLASS to meet both security and efficiency goals for outsourcing equality conjunction search. We present the detailed construction of CLASS and formally prove its security. The efficiency of CLASS is studied on US census data set and US flight data set.

• In Chapter 5, we study the problem of hiding both access pattern and search pattern for out-sourcing single keyword search. We present a practical dynamic SSE scheme called LES to mitigate both access and search pattern disclosures - including the motivation, the detailed constructions, security analysis, as well as the comprehensive empirical evaluation on the Enron emails data set and Wiki articles data set.

• In Chapter 6, we provide the conclusion of this thesis and suggestions for some future direc-tions that are worth exploring.

(17)

Chapter 2

Background

In this chapter, we first present the necessary background knowledge on SSE. Then we discuss general approaches to construct SSE.

2.1 Preliminaries

2.1.1 Notations

In the following, we use x ← A to denote the output x of an algorithm A. We use x ← X$ to denote x is sampled uniformly at random from a finite set X, |X| to represent the number of elements in X. We write xky to denote the concatenation of two strings x and y. A look-up table L of capacity n consists of n entries hLabel, Valuei where Value can be retrieved for a given Label. Value=L.get(Label) denotes getting Value from the entry with the label Label. The Value can be a simple value or another look-up table. Please refer to Table 2.1 for frequently used notations in this thesis.

Notation Description ψ security parameter

negl(ψ) negligible function in security parameter ψ L leakage profile

K secret key

D a database D = {P1, · · · , P|D|} with |D| data

Q the search query

upd update information

M[i, j] (i, j)-th element of a matrix M

I searchable encrypted index for a database D

T query trapdoor for a query Q

utn update token for the update information upd

Table 2.1: Summary of notations

In addition, we use F to denote pseudo-random functions (PRF). PRF is a polynomial-time computable function which is indistinguishable from random functions by any polynomial time

(18)

adversary. Following the standard security definition [43], we say a PRF F is secure if AdvPRF_A,F ≤ negl(ψ) for all probabilistic polynomial-time algorithm A, where negl(ψ) is a negligible function of the security parameter ψ.

2.1.2 Symmetric Searchable Encryption

Consider a database D = {P1, · · · , P|D|} of |D| data. Let id be the identifier for the data Pid. Most existing SSE schemes are dealing with document databases and relational databases. For a document database, each data Pi ∈ D refers to a document containing a set of keywords. For a

relational database, each data Pi ∈ D refers to a record with a set of attributes that may occur in

a query predicate. Let Q denote a search query. We borrow the definition of the SSE from [25] as follows.

Definition 1. (SSE scheme [25]) A symmetric searchable encryption (SSE) scheme is a collection of four polynomial-time algorithms SSE =(Gen, Enc, Trpdr, Search):

• K ← Gen(1ψ_{) : A probabilistic algorithm takes as input a security parameter ψ and outputs}

a secret keyK.

• (I, EDB) ← Enc(K, D) : A probabilistic algorithm takes as input a secret key K and a databaseD. The output contains encrypted data EDB and a searchable encrypted index I.

• T ← Trpdr(K, Q) : An algorithm takes as input a secret key K and a query Q. The output is a query trapdoorT.

• E ← Search(I, T) : A deterministic algorithm takes as input a searchable encrypted index I and a query trapdoorT. The output is a set of data IDs E.

Gen, Enc, Trpdr functions are run by the client, and Search function is run by the server. An SSE scheme is correct if for allψ ∈ N, for all K output by Gen(1ψ), for all D, for all I output by Enc(K, D), for all queries Q and T output by Trpdr(K, Q), the output of Search(I, T) is the set of IDs for the data in_{D satisfying Q.}

EDB can be generated by a traditional encryption method such as advanced encryption

stan-dard (AES) [27]. We omit EDB (and the corresponding decryption function) from now on because

EDB is not involved in query testing Search(I, T). Therefore, in this thesis, ciphertexts refer

to searchable encrypted index I and query trapdoor T. It it worth noting that the searchable en-crypted index I could be either a collection of enen-crypted “indexes” as I = {I1 ← Enc(K, P1), · · · , I|D| ← Enc(K, P|D|)} where each Ii(i ∈ [1, |D|]) is generated with the data Pias input, or a single

encrypted index which is built with the plaintext database D as input.

Recall that an SSE scheme is called dynamic SSE if it supports search queries even after update (addition and deletion) of data. Let upd denote an update information. Usually, a dynamic SSE scheme is defined by setup, search and update protocols. In the following, we borrow the definition of the dynamic SSE from [84].

(19)

Definition 2. (Dynamic SSE scheme [84][57]) A dynamic SSE (DSSE) scheme is a tuple of three polynomial-time protocols DSSE = (Setup, Search, Update):

Setup Protocol.

• K ← Gen(1ψ_{) : A probabilistic algorithm takes as input a security parameter ψ and outputs}

a secret keyK.

• I ← Enc(K, D) : A probabilistic algorithm takes as input a secret key K and a database D, and outputs a searchable encrypted indexI.

Search Protocol.

• T ← Trpdr(K, Q) : An algorithm takes as input a secret key K and a query Q, and outputs a query trapdoorT.

• E ← Search(I, T) : A deterministic algorithm takes as input an encrypted index I and a query trapdoorT, and outputs a set of data IDs E that satisfy Q.

• R ← Resolve(K, E)(optional) : A deterministic algorithm takes as input a secret key K and a set of encrypted data IDsE, and outputs the decrypted results R. This function is needed when returned data IDsE are encrypted.

Update Protocol.

• utn ← UpdToken(K, upd): An algorithm takes as input a secret key K and an update informationupd, and outputs an update tokenutn.

• Inew← Update(utn, I): A deterministic algorithm takes as input an update token utn and an encrypted indexI, and outputs the updated encrypted index Inew.

Gen, Enc, Trpdr, Resolve, UpdToken functions are run by the client, and Search and Update functions are run by the server. A DSSE scheme is correct if for all_{ψ ∈ N, for all database D, for} all secret keysK, encrypted index structures I generated by the setup protocol, for all sequences of update operations through the update protocol, the search protocol always returns the correct results for the query_Q.

2.1.3 Security Definitions

The “gold standard" security definition for cryptosystems is semantic security. Semantic security conveys the ciphertext indistinguishability under chosen plaintext attack (IND-CPA) [44] that, given the ciphertext for one of two plaintext records chosen by himself, the adversary cannot tell which record is encrypted by the ciphertext. Intuitively, semantic security aims to make sure that cipher-texts do not leak any information about the plaintext. Note that semantic security is originally in-troduced for the traditional cryptosystems without search capability. In recent years, the notion of ciphertext indistinguishability is used to formalize SSE security definitions [40, 25].

(20)

Before going ahead, we first summarize the potential leakages of an SSE scheme. Based on the SSE definitions introduced above, potential information leaked by SSE schemes can be classified into four categories:

• Access pattern. This disclosure refers to the knowledge learned by the adversary from the query results (i.e., E) such as the size of the result and which data are returned for a query.

• Search pattern. This disclosure refers to the knowledge learned by the adversary to determine whether two trapdoors (i.e., T) containing the same query predicate.

• Ciphertexts information. This disclosure refers to the knowledge learned by the adversary from trapdoors T and the searchable encrypted index I directly.

• Side information. This disclosure refers to all other potential information learned by the ad-versary beyond the above leakages. For example, assuming the proposed SSE scheme requires the server to compute results for each equality in an equality conjunction query and then take the intersection, the server infers the matching result for each equality in the query which reveals more information than access pattern and search pattern.

For efficiency consideration, all practical SSE schemes expose some information about plaintext data and queries to the server [19]. A formal SSE security definition should comprise a precise description of the information about plaintext data and queries leaked to a specific adversary. To capture such leakages and the adversary’s power, we introduce two common techniques to define a formal security definition for SSE as follows.

Game-based Definitions

One way to define a formal security definition is formalizing a probabilistic game (or experiment) between an adversary and a challenger. The state-of-the-art static SSE definitions for dealing with single keyword search are formalized in this way as follows.

Indistinguishability Against Adaptive Chosen Keyword Attack (IND1-CKA). Goh’s work [40] is the first semantic security notion introduced in the context of SSE for single keyword search. IND1-CKA requires that the adversary can not deduce any information about the plaintext data from the searchable encrypted index I. Specifically, IND1-CKA is formalized by the following probabilistic game between a challenger C and an adversary A (i.e., the server).

Let |V | be the number of the words in the set V . The adversary A sends two non-empty sets of words V0 and V1 to the challenger C where |V0− V1| 6= 0, |V1 − V0| 6= 0, and |V0| = |V1|. The challenger C picks a bit b ∈ {0, 1} randomly to generate the searchable encrypted index Ib

for Vb and then returns Ib to A. After receiving Ib, A can ask C to return query trapdoors for

any word, except for keywords in (V1− V0) ∪ (V0− V1). This restriction excludes the possibility of trivially distinguishing the sets by the query capability. IND1-CKA requires that the adversary

(21)

cannot determine the value of b with a probability significantly larger than 1/2 given Ib, trapdoors

and search capability.

Chang et al. [23] and Goh [40] strengthened IND1-CKA by introducing a stronger definition called IND2-CKA in the sense of that an adversary can not even distinguish indexes from two documents with different sizes.

Non-adapative/Adapative Indistinguishability Security (IND-CKA1/IND-CKA2). The main limitation of IND1/2-CKA is not taking the trapdoor security into account. To protect the trapdoor security as well, Curtmola et al. [25] proposed the state-of-the-art security definition IND-CKA1.

Consider two histories of interaction between a client and a server, H0and H1,

H0 = {D0, Q0= {Q0,1, · · · , Q0,m}},

H1 = {D1, Q1= {Q1,1, · · · , Q1,m}},

where for i ∈ {0, 1}, Di is a database and Qi is a sequence of queries. Let τ (H) be the trace

of a history including the number of documents, access pattern, and search pattern. Non-adaptive indistinguishability is formalized by the following probabilistic game.

Definition 3. (Non-adaptive indistinguishability [25]) Let SSE = (Gen, Enc, Trpdr, Search) be an SSE scheme andψ ∈ N. Let A = (A1, A2) be an adversary. Consider the following probabilistic game: IndSSE,A(ψ) 1.K ←Gen(1ψ₎ 2.(stA, H0, H1) ← A1(1ψ) 3.b← {0, 1}$ 4. parseHbas(Db, Qb) 5. letIb=Enc(Db) 6. for1 ≤ j ≤ m 7.Tb,j ← Trpdr(K, Qb,j) 8. letTb= (Tb,1, · · · , Tb,m) 9.b0 ← A₂(stA, Ib, Tb) 10. ifb0 = b, output 1 11. otherwise output 0

with the restriction thatτ (H0) = τ (H1), and where stA is a string that capturesA1’s state.SSE is secure in the sense of non-adaptive indistinguishability if for all polynomial-size adversaries A = (A₁, A2),

Pr[IndSSE,A(ψ) = 1] ≤ 1

(22)

where the probability is taken over the choice ofb and the coins ofGen and Enc.

In the above probabilistic game, the challenger randomly chooses a bit b from {0, 1}, generates an encrypted index structure Ib for Db and query trapdoors for Qb, and then releases them to the

adversary (Lines 2-8). Finally, A needs to output a guessing value b0(Line 9). The game outputs 1 (i.e., IndSSE,A(ψ) = 1) if b0= b, otherwise, 0. In other words, an SSE scheme is secure if no adver-sary can win the above probabilistic game with significantly greater probability than an adveradver-sary who must guess randomly. The adaptive indistinguishability security IND-CKA allows the adver-sary in the probabilistic game to choose each next query Qb,j+1 adaptively after accessing Ib and

the trapdoors {Tb,1, · · · , Tb,j} for previous queries {Qb,1, · · · , Qb,j}. The rest of the experiment is

identical to IND-CKA1.

Simulation-based Definitions

Another way to define a formal security definition is following real simulation paradigm [20, 25]. In particular, simulation-based definition is parameterized by a collection of leakage functions to precisely specify the information leaked to the server so that the scheme discloses no information beyond what can be inferred from the leakage functions. The simulation-based definition is widely used to formalize a security definition for SSE schemes that disclose more information than access and search patterns. In the following, we borrow the security definition of a dynamic SSE from [61][84].

Let Π = (Setup, Search, Update) be a dynamic SSE scheme and let L be a leakage profile defined as

L = (L_Setup, LSearch, LUpdate),

where LSetup, LSearch and LUpdate represent the leakage from running each protocol. The security definition requires that the adversary cannot distinguish the transcripts generated by the real pro-tocols from the transcripts generated by a simulator with the leakage L as the input. The real and ideal experiments for a probabilistic polynomial-time adversary A and an efficient simulator S are defined as follows.

RealΠA(ψ): A chooses a database D. The experiment runs K ← Gen(1ψ) and gives I ←Enc(K, D) to A. Then, A adaptively specifies a query Q or an update upd. A receives the query trapdoor

T ←Trpdr(K, Q), or the update token utn ← UtnToken(K, upd). Finally, A outputs the value 1 if it guesses that all received transcripts are generated by real protocols, or the value 0 otherwise.

IdealΠA,S(ψ): A chooses a database D. The experiment runs S(LSetup(D)) and gives its output to A. Then, A adaptively specifies a query Q or an update upd. A receives the query trapdoor generated from S(LSearch(Q)), or the update token generated from S(LUpdate(upd)). Finally, A outputs the value 1 if it guesses all received transcripts are generated by the real protocols, or the value 0 otherwise.

(23)

Definition 4. A dynamic SSSE scheme Π = (Setup, Search, Update) is L-adaptively-secure if for all probabilistic polynomial-time algorithmA, there exists an efficient simulator S such that,

|Pr[RealΠ

A(ψ) = 1] − Pr[IdealΠA,S(ψ) = 1]| ≤ negl(ψ),

where negl(ψ) is a negligible function of ψ.

2.2 Related Work

In this section, we discuss general approaches to construct SSE for data outsourcing applications. We will discuss the related work to the three proposed research questions in the corresponding Chapter.

2.2.1 Symmetric Key Encryption

Most SSE schemes are constructed based on symmetric key encryption. The term “symmetric” refers to that the same secret key K is used to generate searchable encrypted index and query trapdoors. Normally, the secret key should be kept by the data owner only. The most widely used symmetric key primitive is the pseudo random function (PRF) [43] because it can efficiently com-pute the output which is indistinguishable from a truly random function. While SSE schemes based on symmetric key primitives are efficient, most of them are limited to single keyword search.

The first practical SSE was proposed by Song et al. [83] (SWP) for single keyword search on encrypted document database. The complexity of SWP is linear in the total number of documents in the database. The basic idea of SWP is levering the PRF with the secret key K (denoted as FK) and

XOR operation1, denoted by ⊕, to generate the searchable encrypted index Ii = {I1, · · · , Im} for a

document Picontaining m keywords. In particular, each keyword wu(u ∈ [1, m]) in the document

Piis encoded as Iu = hInfoi ⊕ FK(wu) where hInfoi indicates whether the document contains the

keyword or not. The query trapdoor T = FK(w) for a keyword w is generated by the same PRF

FK which enables the server to extract the hInfoi by testing against each document’s searchable

encrypted index I through XOR operation (Iu⊕ T) for each Iu in I. As SWP scheme is the first

practical searchable encryption scheme, it adopts the IND-CPA as the security definition instead of using the security definitions in the context of SSE which are proposed afterward. For this reason, SWP is weak in security guarantee. For example, SWP leaks the position of the matching word in the document which is vulnerable to the statistical analysis.

Goh [40] proposed the first IND1-CKA secure SSE scheme, called Secure Index, for outsourc-ing soutsourc-ingle keyword search. Each keyword w in the document Piis inserted into a Bloom Filter [16]

Bi using multiple PRFs. Bloom Filter is a compact structure supporting insertion of words w to the

set and inclusion test of a word w in the set. Query testing is checking against each document in

(24)

the database with the Bloom Filter intersection testing. Therefore, the complexity of Secure Index is linear in the total number of documents in the database. Another drawback of Secure Index is that Bloom Filter may introduce the false positive which means the query trapdoor will return some documents not matching the query.

The first IND-CKA1/2 secure SSE scheme for outsourcing single keyword search was proposed by Curtmola et al. [25] (CGK). CGK leverages the inverted index2 to materialize the results of all single-keyword queries to achieve a sub-linear search performance. The list for each distinct keyword in the inverted index is implemented as a link list L. Then, the each node in the link list L is encrypted by a PRF with the secret key K and stored in a look-up table. The header node in the list for the keyword w is encrypted as FK(w). The trapdoor for the keyword w is encrypted as FK(w),

thus, it can find the header of link list and extract the pointer to next node until the stop indicator showing up. The inverted index used in CGK scheme provides the optimal search complexity for single keyword search. However, the encrypted look-up table has a fixed size and does not support updating operations. To support the client in efficient updates, several dynamic SSE schemes are proposed [57, 91, 20, 57, 71] for single keyword search.

The Oblivious Cross-Tags Protocol (OXT) [21] is the first sub-linear SSE scheme supporting conjunctive keyword queries. OXT adopts IND-CKA2 and uses a leakage profile to capture the additional information disclosed by query processing. However, as will be discussed in Section 4.2, it is difficult to capture and model low-level disclosures leaked by a sub-linear search process in the security definition. The recent work [55] has the same problem as OXT.

2.2.2 Asymmetric Key Encryption

In general, SSE schemes can be constructed using asymmetric key encryption as well. The term “asymmetric” refers to that different keys are used to generate searchable encrypted index and query trapdoors. In particular, the data owner generates a pair of key (Kpub, Kpriv) where the public key

Kpubis public available for generating the searchable encrypted index, and the private key Kprivis

known only to a single user for query trapdoor generation. For this reason, asymmetric key encryp-tion (or called public key encrypencryp-tion) is more suitable for data sharing applicaencryp-tions. When applying asymmetric key encryption to construct SSE schemes for data outsourcing applications, both public key and secret key are kept as secret. Most public key primitives are based on the computational hardness problems, such as the Discrete Logarithm Problem (DLP)3 and the Decisional Diffie-Hellman (DDH)4. Readers who are interested in the hardness problems please refer to [7, 9, 73]. Compared to symmetric key encryption schemes, SSE schemes based on asymmetric key

encryp-2

Inverted index. https://en.wikipedia.org/wiki/Inverted_index

3_{Discrete logarithm. https://en.wikipedia.org/wiki/Discrete_logarithm} 4

Decisional Diffie-Hellman. https://en.wikipedia.org/wiki/Decisional_Diffie-Hellman_ assumption

(25)

tion provide more functionality, such as support conjunctive keyword searches, at the cost of higher computation overhead.

The solutions [45, 4] pioneered the construction of conjunctive keyword search. The work pro-posed by Golle et al. [45] is the first work for conjunctive keyword search for specified keyword fields and leverages the PRF and the public key primitive to implement the searchable encryption functions. And Ballard et al. [4] improved the efficiency by using the secret sharing. However, both of them result in prohibitive communication overhead because the query trapdoor size that is linear in the total number of documents. Another drawback of these two schemes is that the number of keywords in the query is disclosed to the adversary. Byun et al. [17] proposed an SSE scheme for conjunctive keyword query using bilinear maps [9] to reduce the expensive communication over-head of [45], at the cost of higher computation cost. Wang et al. [93] presented the first keyword free scheme for conjunctive keyword searches on encrypted data. However, the approach proposed in [93] discloses the single keyword information to the adversary. Conjunctive keyword search is also studied based on Hidden Vector Encryption (HVE) [65], but suffers from prohibitive computa-tion and communicacomputa-tion costs.

2.2.3 Other Approaches

There are several other approaches related to SSE but not suitable for data outsourcing applications due to too impractical to be deployed or out of the scope of data outsourcing.

Oblivious RAM (ORAM)

ORAM [41, 85, 30] is a cryptographic primitive for accessing a block of memory using its address without leaking any information. The access pattern is hidden by continuously shuffling and re-encryption data once the data accessed. However, existing ORAM-based SSE schemes [70, 37] suffer from prohibitive communication and computational overhead [70].

Trusted Hardware Based Methods

Searchable encryption can use hardware approaches to achieve data security such as [3, 1] as well. Recently, [58] proposes a new direction that aims for efficient encrypted database and query pro-cessing with provable security properties, but the security definition requires the use of cloud-sit trusted, tamper-resistant hardware to achieve. Note that this direction is different from the search-able encryption schemes implemented by algorithm methods which discussed in this thesis.

Differential Privacy

Data privacy [31] addresses the data publishing scenario where the adversary is also the data re-cipient, therefore, is allowed to access plaintext data in some anonymized form. This setting is not acceptable to cloud-based computing where disclosing any plaintext information to the cloud is a security breach. Therefore, differential privacy is not suitable for data outsourcing applications.

(26)

Chapter 3

Security Risks of Ad hoc Secure SSE

To achieve better efficiency and support more expressive queries, a lot of recent ad hoc secure SSE schemes are proposed based on an ad hoc security notion, that is assuming certain background knowledge of the adversary with a specific way of attacks and checks if the proposed scheme resists to the attacks. While these solutions meet both efficiency and query expressiveness design goals, the ad hoc nature of such security notion leaves room for unpredicted information leakage. This immediately raises the following question: Do existing ad hoc secure SSE schemes provide sufficient security?In this chapter, we answer this question by demonstrating previous unknown security risks of widely used ad hoc secure SSE schemes. Our finding implies that ad hoc secure SSE schemes cannot provide sufficient security.

3.1 Motivations and Contributions

Asymmetric scalar-product preserving encryption(ASPE) [98] is one of the most popular ad hoc secure SSE schemes. ASPE supports the k-nearest neighbour query that allows the server to effi-ciently find k data points in the outsourced database which are the nearest to a given query point Q. The popularity and convenience of ASPE come from its inner product preserving property which makes it particularly suitable for designing a sub-linear search algorithm for distance and similarity-based queries. For this reason, a lot of recent work adopted ASPE to support a variety of important similarity-based queries efficiently, including range queries [94], ranking queries [18][87][103][64], fuzzy multi-keyword match queries [92][35], privacy preserving computation of inner product [67], secure image search [105] and privacy-preserving biometric identification queries [104][95].

On the one hand, ASPE and its variants enable the server to perform expressive search queries efficiently which meets both query expressiveness and efficiency design goals. On the other hand, the security property of ASPE has not been carefully studied. In particular, ASPE claims that it is resilient to an adversary who has the ability to acquire some plaintext-ciphertext pairs for data in the outsourced database if the adversary cannot infer the secret key which is used for encrypting plaintext data and queries. However, this security claim leaves room for unpredicted information

(27)

leakage. For example, instead of inferring the secret key, the adversary can launch attacks in other ways to compromise privacy of outsourced data and queries.

It is worth noting that ASPE has been widely used in previous work for performing various queries and most of them take ASPE’s security claim for granted without scrutinizing the security. This motivates us to revisit the security risks of these widely used ad hoc secure SSE schemes to demonstrate the potential risks of ad hoc secure SSE schemes. To better explain our contributions, we borrow the following adversary models which are based on additional information, beside ci-phertexts (i.e., the searchable encrypted index and query trapdoors), available to the adversary (i.e., the server) from [29].

• Ciphertext only attack (COA): In this model, the adversary has only access to the ciphertexts. This is the weakest adversary model.

• Known plaintext attack (KPA): In addition to the ciphertexts, the adversary also has the ability to acquire (or observe) pairs of plaintext-ciphertext for some number of data in the database. For example, if the adversary knows that someone joins a club and observes a new encrypted record afterward, the adversary can associate this person’s plaintext record with the new ci-phertext observed.

• Chosen plaintext attack (CPA): A more powerful adversary has the ability to obtain the pairs of plaintext-ciphertext for some data of the adversary’s choice. For example, the adversary may be able to influence the client to encrypt certain plaintext data.

In the above list, the adversary is getting more powerful because of having access to more information. Therefore, a security risk under a less powerful adversary is also a security risk under a more powerful adversary.

Contributions

The main contributions of this work are stated as follows.

• First of all, in Section 3.2, we show that ASPE is subject to the complete disclosure of plain-text information to a KPA adversary. In particular, we propose an efficient algorithm, called Linear Equation Program(LEP) attack, to reconstruct the plaintext of the entire database and all processed queries, once the adversary acquires plaintext-ciphertext pairs for a small num-ber of data. Furthermore, we point out that the previous attack in [101] which assuming a CPA adversary was not effective and did not imply our finding.

• Then, in Section 3.3, we demonstrate the vulnerability of a variant scheme of ASPE, called MKFSE [92], to a COA adversary which is the weakest adversary. In particular, we propose an efficient algorithm, Sparse Non-negative Matrix Factorization (SNMF) attack, to enable a COA adversary that has access only to ciphertexts to reconstruct the plaintext encrypted by

(28)

MKFSE. Unlike LEP attack, SNMF attack may not reconstruct all plaintext. SNMF attack is strong in that the adversary does not need any information beyond ciphertexts.

• Finally, in Section 3.4, we present comprehensive experimental study for the proposed SNMF attack. Specifically, we conduct empirical study on both synthetic and real world data sets to examine how much portion of plaintext can be reconstructed by SNMF attack. Our results show that SNMF attack is able to reconstruct plaintexts with a high accuracy.

To our knowledge, this is the first work systematically demonstrating the vulnerability of the widely used ad hoc secure SSE scheme, ASPE, and its variants. Our study implies that construct-ing SSE schemes based on an ad hoc security notion often ends in disaster, thus, cannot provide sufficient security.

3.2 Known Plaintext Attack

In this section, we begin with the revisit of ASPE construction and then we present our efficient attack to show that ASPE suffers from a complete disclosure for a KPA adversary.

3.2.1 Revisit ASPE Construction

In a nutshell, the main idea of ASPE is employing a secret matrix multiplication to diffuse the patterns in plaintexts while preserving inner product between plaintext data and queries. Consider a data point Piand a query Qj, represented by d-dimensional column vectors. ASPE first encodes Pi

and Qjinto an encoded data Iiand an encoded query Tjrespectively as,

Ii= (PiT, −0.5||Pi||2)T,

Tj = rj(QTj, 1)T,

(3.1)

where vT denotes the transpose of the vector v, ||Pi||2is the length of Pi, and rjis a value randomly

chosen for each Qj. Since Iiand Pican be derived from each other and Qj can be derived from Tj,

all of Ii, Tj, Pi, Qjare considered sensitive. To generate the encrypted index Iifor an encoded data

Ii and the trapdoor Tj for an encoded query Tj, ASPE has two schemes: Scheme 1 (i.e, the basic

one), and Scheme 2 (i.e., the enhanced one). We give a brief introduction for these two schemes here. For more details, please refer to [98].

Scheme 1.This scheme uses a (d + 1) × (d + 1) invertible matrix M (i.e., MM−1 = 1) as the secret key to encrypt Iiand Tj as,

Ii = MTIi,

Tj = M−1Tj.

(3.2)

The inner product between ciphertexts Iiand Tjpreserves the inner product between Iiand Tj, i.e.,

(29)

Using this equality, Theorem 3 in [98] shows that a data point P1is nearer to a query point Qj than

another data point P2if and only if (IT1 − IT2)T1 > 0. Thus, the server can rank points Piby their

distance to Qj using the ciphertexts Ii and Tj. On the other hand, the asymmetric encryption does

not preserve the inner product between two encrypted indexes or between two trapdoors, so the server cannot compare the distance between two data or two queries, a property that is considered sensitive. Theorem 4 in [98] shows that scheme 1 does not resist a KPA adversary: the adversary can find the (d + 1) × (d + 1) secret matrix M using Equation (3.2) if he can acquire (Pi, Ii) pairs

for (d + 1) linearly independent data Pi.

Scheme 2. To fix the above problem, Scheme 1 is enhanced by employing the following two tricks. The first trick is expanding the (d + 1) dimensional vectors Ii and Tj to d0 dimensional

(d0 > d + 1) vectors ˆIi and ˆTj, so that ˆIi T _ˆ

Tj = IiTTj. The second trick uses a secret splitting

configuration to randomly split ˆIiand ˆTjinto ( ˆIia, ˆIib) and ( ˆTja, ˆTjb) respectively in the way so that

ˆ Ii T _ˆ Tj = ˆIia T _ˆ Tja+ ˆIib T _ˆ Tjb, (3.4)

where each of ˆIia, ˆIib, ˆTjaand ˆTjbis a d0dimensional column vector and then is encrypted by two

(d0× d0_{) invertible matrices M}

1and M2 as

Iia= MT1Iˆia, Iib= MT2Iˆib,

T_ja= M−1₁ Tˆja, Tjb= M−12 Tˆjb.

(3.5)

Let Iidenote the ciphertext (Iia, Iib) and Tjdenote the ciphertext (Tja, Tjb). We have

IT_i Tj = ITiaTja+ ITibTjb = ˆIia

T _ˆ

Tja+ ˆIib T _ˆ

Tjb = IiTTj(= rj(PiTQj− 0.5||Pi||2)), (3.6)

which preserves the inner product between Ii and Tj as the same as Scheme 1. The security claim

in [98] claimed that the above Scheme 2 is resilient to a KPA adversary, quoted as below.

Theorem 6 in [98] claims: “Scheme 2 is resilient to a level-3 attack if the attacker cannot derive the splitting configuration."

The level-3 attack mentioned in the above security claim is exactly the KPA adversary model. Below, we shall consider only Scheme 2 (because it enhances the security of Scheme 1) and refer it as ASPE. The rest of this Chapter focuses on the security risk of ASPE and its variants in the literature.

3.2.2 Proposed Attack

In this section, we present an efficient algorithm for a KPA adversary to reconstruct the plaintext of the entire database and all processed queries.

(30)

Main Idea

The proof of the above ASPE security claim (i.e., Theorem 6 in [98]) is based on the fact that it is infeasible to find M1 and M2 from Equation (3.5). The reason is that ( ˆIia, ˆIib) and ( ˆTja, ˆTjb) are

generated based on the secret splitting configuration which is kept from the adversary. The main problem of the proof is that it assumes a specific way of attack (i.e., trying to find M1and M2). The KPA adversary, however, can launch attacks in other ways as we demonstrated follows.

From Equation (3.6), we observe that the inner product between ciphertexts Iiand Tjpreserves

the inner product between Ii and Tj. In other words, the tricks introduced in enhanced scheme

cannot prevent a KPA adversary to exploit the acquired plaintext-ciphertext pairs and observed query trapdoor to infer the unknown encoded query Tj in Equation (3.3). Based on this basic idea,

we propose the following linear equation program attack to enable a KPA adversary to reconstruct the plaintext of the entire database and all processed queries by solving linear equation systems.

Linear Equation Program (LEP) Attack

Under KPA adversary model, we assume that the adversary acquires plaintext-ciphertext pairs {(P1, I1), · · · , (Pd+1, Id+1)}, therefore {(I1, I1), · · · , (Id+1, Id+1)}, for (d+1) data O = {P1, · · · ,

Pd+1} in D, where {I1, · · · , Id+1} are linearly independent. We further assume that the adversary

has access to the ciphertexts of all processed queries, denoted by T . Moreover, we assume that T contains (d + 1) query trapdoors {T1, · · · , Td+1} where the corresponding encoded queries

{T1, · · · , Td+1} are linearly independent; this assumption holds because the cloud is expected to

process many queries over time.

Algorithm 1 describes the detailed steps of our LEP attack where the above KPA adversary can exploit his knowledge on acquired plaintext-ciphertext pairs to reconstruct the encoded query Tjof

each processed query in T and all previous unknown data Pi∈ (D − O).

Step 1. The purpose of the first step in LEP is to uncover (d + 1) linearly independent encoded queries {T1, · · · , Td+1} which can be used to further infer the encoded data of unknown plaintext

data. In particular, the adversary can find the encoded query Tj for any observed query trapdoor

Tj ∈ T by solving Equation (3.7) where Tj (indicated in red) is a (d + 1)-dimensional column

vector with (d + 1) unknown variables, whereas Ii, Ii and Tj in the equation system are known.

Equation (3.7) has a unique solution for Tj because {I1, · · · , Id+1} are linearly independent which

form a (d + 1) × (d + 1) invertible matrix. The adversary can repeat this computation for each query trapdoor Tj ∈ T until finding (d + 1) encoded queries such that {T1, · · · , Td+1} are

lin-early independent. At the end of this step, the adversary obtains (d + 1) plaintext-ciphertext pairs {(T1, T1), · · · , (Td+1, Td+1)}.

Step 2. In this step, the adversary can exploit the (d + 1) linearly independent encoded queries {T₁, · · · , Td+1} obtained from the first step to recover the encoded data Iiof any unknown plaintext

data Pi ∈ (D − O). Specifically, the adversary applies Equation (3.8) to learn Iifor every remaining

(31)

Algorithm 1: Linear Equation Program (LEP) 1 Adversary model: a KPA adversary

2 Require: (1) Acquired plaintext-ciphertext pairs {(I1, I1), · · · , (Id+1, Id+1)} for (d + 1) data

O = {P1, · · · , Pd+1} ⊆ D where {I1, · · · , Id+1} are linearly independent;

(2) The set of query trapdoors T for all processed queries; (3) The ciphertexts Iifor data Pi ∈ (D − O).

3 Procedure: 4 Step 1:

5 while not find (d + 1) linearly independent encoded queries {T1, · · · , Td+1} do

6 get a query trapdoor Tj ∈ T and set T = T \ {Tj}

7 solve the following linear equation system to get encoded queryT_j

IT₁T_j =I₁TTj

.. .

IT_d+1Tj =Id+1T Tj

(3.7)

8 return (d + 1) plaintext-ciphertext pairs {(T1, T1), · · · , (Td+1, Td+1)}

9 Step 2:

10 for each I_i of an unknown data P_i ∈ (D − O) do

11 solve the following linear equation system to get encoded dataIi

IT_i T₁ =I_iTT1 .. .

IT_iT_d+1=I_iTTd+1

(3.8)

12 return encoded data Iifor all Pi ∈ (D − O)

unknown variables. Since {T1, · · · , Td+1} are linearly independent, as mentioned above, Equation

(3.8) has a unique solution for Ii.

Algorithm 1 enables a KPA adversary to infer the encoded query Tjfor every processed Qjand

uncover the encoded data Ii for every Pi ∈ (D − O) which lead to a completely discloses. The

following statement summarizes this security risk.

Security Risk 1. ASPE (i.e., Scheme 2) is vulnerable to a KPA adversary: if the adversary acquires (d + 1) plaintext-ciphertext pairs {(P₁, I1), · · · , (Pd+1, Id+1)} where Pi ∈ D (i ∈ [1, d + 1]),

and the corresponding encoded data{I₁, · · · , Id+1} are linearly independent, the adversary can

recover the plaintext of the entire databaseD and all processed queries.

Compared to Previous Attack

Recent work [101] shows that ASPE is vulnerable to an attack where the adversary has the ability to obtain the ciphertext for some plaintext queries chosen by the adversary (i.e., a CPA adversary). In

(32)

particular, if the adversary knows the pairs (Qj, Tj) encrypted by ASPE for d queries Qj, 1 ≤ j ≤

d, he can learn the unknown index Ii (highlighted in red) for a data Pi with d unknown variables

(thus, learns the plaintext information) by solving the following “d linear equations", 1 ≤ j ≤ d:

IT_i Tj =IiTTj, (3.9)

where the first d elements in Ii(highlighted in red) are d unknown variables. Refer to Equation (3.1)

for the definition of Ii and Tj. However, a closer look reveals that this attack cannot be executed as

suggested in [101] for the following reasons.

• First of all, each Tj involves a random value rj unknown to the adversary, thus, the above d

equations actually contain 2d unknown variables (instead of d unknown variables). That is d unknown variables for the random values {r1, · · · , rd} generated to randomize {Q1, · · · , Qd},

and the d unknown variables for the first d elements of Ii.

• Secondly, the (d + 1)th element in Iiis a quadratic term −0.5||Pi||2, thus, these equations are

not a linear system.

• Thirdly, this attack assumes a CPA adversary which is much powerful than the KPA adversary considered in [98].

• Finally, this attack assumes that the adversary obtains the plaintext-ciphertext pairs for a query. In this case, the adversary can learn the plaintext of the query result because he has the ability to compute the query result, and the result must match the known plaintext of the query. In fact, no SSE can defend against such adversaries because the attack does not depend on the encryption scheme.

Discussion. Security Risk 1 advances the state-of-the-art because ASPE was previously claimed to be resilient to a KPA adversary (i.e., Theorem 6 in [98]). For a d dimensional database, with (d + 1) plaintext-ciphertext pairs, the adversary can solve the linear equation system in Algorithm 1 using Gaussian elimination1 with the complexity O((d + 1)3) to recover the plaintext of entire database and all processed queries encrypted by ASPE. As discussed above, the previous attack shown in [101] is ineffective. To our knowledge, LEP attack (i.e., Algorithm 1) is the first work demonstrating a complete disclosure by assuming exactly the same adversary model as in [98].

3.3 Ciphertext Only Attack

In this section, we studied the security risk of the recent work based on ASPE, called multi-keyword fuzzy search over encrypted data(MKFSE) [92], for outsourcing multi-keyword fuzzy search query. In particular, we show that MKFSE is vulnerable to the ciphertext only attack (COA) adversary

(33)

model who knows only ciphertexts of documents and queries. Since a COA adversary knows less than a KPA adversary, this result does not follow from the previous section. MKFSE encodes plain-text documents and queries into secret binary vectors before applying ASPE for encryption. The binary domain attributes are very common. For example, a text document is usually represented as a binary vector over keywords where 0/1 means the absence/presence of a keyword, and a bloom filter is a compressed representation of a set of items and is widely used in document search [40][5]. Our study shows that the vulnerability of MKFSE can lead to the disclosure of the secret binary vectors, which further lead to the disclosure of plaintext documents or queries.

3.3.1 Revisit MKFSE Construction

First of all, we revisit the MKFSE construction. To support fuzzy keyword search, each keyword in a document Pi is transformed to a bigram set which contains all the contiguous sequence of 2

letters in the keyword. For example, the bigram set for “simon” is {si, im, mo, on}. Then, each element in the bigram set is inserted into a d-dimensional binary vector (e.g., a bloom filter) using l locality-sensitive hash (LSH) functions [75]. Then, the encoded data Iifor data Piis generated by a

pseudo-random function F to a d-dimensional binary vector. The encoded query Tj of a query Qj

is generated in the same way. We represent the encoded data Ii and encoded query Tjin [92] as

Ii= F (LSH(Pi), K),

Tj = F (LSH(Qj), K),

(3.10)

where K is the secret key for the pseudo-random function F , and Iiand Tj are both d-dimensional

binary vectors. It is worth noting that the dimensionality d of the encoded data and query is public because the generation algorithm is pubic, but their content are secret. Intuitively, F can be thought as permuting the positions of the 0/1 string LSH(Pi) or LSH(Qj) with the permutation determined

by the secret key K. Given the encoded data Iiand encoded query Tj, MKFSE generates ciphertexts

I_iand Tj as the same as ASPE using Equations (3.4)-(3.5). The following equation holds

IT_i Tj = IiTTj, (3.11)

which allows the server to approximate the relevance score of Pito Qj by computing ITiTj.

3.3.2 Proposed Attack

Given the above brief introduction of MKFSE scheme, the goal of our attack is using Equation (3.11) to recover the encoded data Iiand encoded query Tjwhich are binary vectors. While learning Iiand

Tjdoes not directly lead to the disclosure of plaintext Pi or Qj, the deterministic property of LSH

and F implies that the similarity between Ii or Tj reflects the similarity on plaintext. For example,

similar I1 and I2 will be generated from similar P1 and P2 with a high probability. Therefore, it the adversary learns the plaintext content of P1, so the plaintext content of P2. In addition, the

Secure and efficient search over outsourced databases