VERIFIABLE SEARCHABLE SYMMETRIC ENCRYPTION

(1)

BY

ZACHARY A. KISSEL

B.S. MERRIMACK COLLEGE (2005) M.S. NORTHEASTERN UNIVERSITY (2007)

SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

COMPUTER SCIENCE

UNIVERSITY OF MASSACHUSETTES LOWELL

Signature of

Author: Date:

Signature of Dissertation Chair:

Dr. Jie Wang

Signatures of Other Dissertation Committee Members Committee Member Signature:

Dr. Xinwen Fu Committee Member Signature:

Dr. Tingjian Ge Committee Member Signature:

(2)

BY

ZACHARY A. KISSEL

ABSTRACT OF A DISSERTATION SUBMITTED TO THE FACULTY OF THE DEPARTMENT OF COMPUTER SCIENCE

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

COMPUTER SCIENCE

UNIVERSITY OF MASSACHUSETTS LOWELL 2013

Dissertation Supervisor: Jie Wang, Ph.D.

(3)

convenient platform for users to store data that can be accessed from anywhere at anytime without the cost of maintaining a storage infrastructure. However, cloud storage is inherently insecure, hindering general acceptance of the paradigm shift. To make use of storage services provided by a cloud, users would need to place their trust, at least implicitly, in the provider. There have been a number of attempts to alleviate the need for this trust through cryptographic methods. An immediate approach would be to encrypt each file before uploading it to the cloud. This approach, calls for a new searching mechanism over encrypted data stored in the cloud.

This dissertation considers a solution to this problem using Symmetric Searchable Encryption (SSE). SSE allows users to offload search queries to the cloud. The cloud is then responsible for returning the encrypted files that match the search queries (also encrypted). Most previous work was focused on keyword search in the Honest-but-Curious (HBC) cloud model, while some more recent work has considered searching on phrases. Recently, a new cloud model was introduced that supersedes the HBC model. This new model, called Semi-Honest but Curious (SHBC), is less restrictive over the actions a cloud can take. In this dissertation, we present three systems that are secure under this new SHBC model. Two systems provide phrase search and the other provides hierarchical access control over keyword search.

(4)

I would like to begin by thanking the person responsible most for the success of this dissertation, my advisor, Prof. Jie Wang. Prof. Wang provided me with the unique opportunity to look at the problems that interested me, providing encouragement and guidance as I progressed. I would also like to thank my committee members Professors Xinwen Fu, Tingjian Ge, and Yan Luo. Together, they provided helpful comments that improved this work. In particular, the article that became Chapter 5 was in preparation at the time of the proposal; their comments around investigating access control over searching validated the need to submit that work.

While completing the last year of my PhD studies, I was fortunate to have the opportunity to join the faculty at Merrimack College as a visiting professor. This appointment gave me a chance to branch out in all facets of academia. I am most indebted to the friendships and hallway conversations with Lisa Michaud, Vance Poteat, and Chris Stuetzle. In particular, I wish to thank Chris Stuetzle for early reviews of the material that would become Chapter 3. I would also like to thank Vance Poteat for serving as a mentor for my transition from industry to teaching this year, and for sparking my interest in networking and security many years ago.

I would like to thank my parents Dan and Deb for their continued love, support, and encouragement over all these years, specifically for demonstrating to me the most important lesson, with hard work there are no limits. To Wendy, thank you for sharing this journey with me. Thank you for all the encouragement and understanding for

(5)

(6)

List of Figures vii

1 Introduction 1

1.1 Applications of Searchable Encryption . . . 2

1.2 Overview of Results . . . 3 1.3 Dissertation Structure . . . 4 2 Background 5 2.1 Background on Probability . . . 5 2.2 Background on Cryptography . . . 7 2.2.1 Pseudo-Random Primitives . . . 7 2.2.2 Symmetric Encryption . . . 9

2.2.3 Cryptographic Hash Functions . . . 11

2.3 Searchable Encryption Framework . . . 12

2.4 Index Data Structures . . . 13

2.5 Models of Clouds and Security . . . 15

2.6 Previous Work . . . 23

2.6.1 A First Solution . . . 23

2.6.2 Early Indexed Approaches . . . 26

2.6.3 Improved SSE Constructions . . . 29

2.6.4 Phrase Searching . . . 32

(7)

3 Verifiable Phrase Search 36

3.1 Verifiable Encrypted Phrase Search . . . 37

3.1.1 Verifiable Keyword Search . . . 37

3.1.2 Verified Phrase Searching . . . 38

3.1.3 Correctness . . . 40

3.2 Conclusion . . . 45

4 Verifiable Phrase Search in a Single Phase 46 4.1 Notations . . . 47

4.1.1 Notations . . . 47

4.2 Background . . . 47

4.2.1 Background on Next-Word Indexing . . . 47

4.2.2 Secure Linked Lists . . . 48

4.3 Basic Construction . . . 49

4.3.1 Constructing an Encrypted Next-Word Index . . . 49

4.3.2 An SSE Construction . . . 51

4.3.3 Security and Efficiency . . . 53

4.4 Adding Verification . . . 59

4.4.1 Discussion of Security Guarantees . . . 59

4.5 Conclusion . . . 60

5 Hierarchical Access Control 61 5.1 Model . . . 62

5.2 Key Regression . . . 64

5.3 Construction of HAC-SSE and Security . . . 65

5.3.1 Security Guarantees of HAC-SSE . . . 69

5.4 Adding Revocation and Verification . . . 72

(8)

5.5 Conclusion . . . 75 6 Conclusion 76 6.1 Results . . . 76 6.2 Future Work . . . 76 Bibliography 78 Biography 80 vii

(9)

2.1 A secure linked list on the set {D1, D3, D5, D6} . . . 31

2.2 An example of a phase two table based index. . . 34

4.1 An example next-word index. . . 48 4.2 Example arrays A and N for ∆ = {w1, w2, w3}. The arcs represent a

logical connection. . . 51

5.1 An annotated trie for dictionaries ∆1 ={cat,dog}and ∆2 ={car,do}∪

∆1 . . . 66

5.2 Final trie based on Figure 5.1. The valuesPh denotes the parents hash

value and` denotes the current nodes level. . . 67 5.3 Modification to theBuildIndexalgorithm to add verification support to

the trie . . . 73 5.4 The HVerify algorithm . . . 73 5.5 The HRevokeUseralgorithm . . . 74

(10)

Chapter 1

Introduction

Imagine for the moment that Alice has a large collection of documents, D, that she wishes to store in a distributed storage environment owned by Bob. Bob has been known to be nosy, which means Alice must encrypt all the documents in her document collection before uploading them to Bob’s distributed storage environment. Assume, now, that Alice wants to read the documents in D that contain a certain word or phrase. What does she do? Trivially, she could ask Bob to send her all the files, decrypt them locally, and then search for the documents that contain the information she is looking for. Retrieving all the files and then decrypting them, however, will incur a great cost in both communication and time. It would be far more efficient, for Alice, if Bob could perform the search and only send the documents that match her query. Alice’s problem is known as the searchable encryption problem.

Song, Wagner, and Perrig offered the first glimpse of a solution to Alice’s prob-lem [1]. They introduced Searchable Symmetric Encryption (SSE). This new SSE construction allows for Alice to ask Bob to query the encrypted document collection for a specific word or phrase. Alice enables Bob to perform the search by providing Bob, at query time, with some special information known as a trapdoor. Bob then returns the results of the query to Alice. The guarantees that they provided are that

(11)

the queries remain unknown to the Bob (query privacy) and any information beyond the number of results and size of the encrypted documents is unknown to Bob (query result privacy).

Though not its original intention, we can adapt the searchable encryption to cloud storage. We assume that a collection of encrypted documents, D, are stored in the cloud such that a search query can be executed over all the documents in the collection. The cloud is responsible for both executing the query and returning the results. We have the added security guarantee that the cloud should be unable to learn the nature of the query. If one uses only symmetric cryptography in the solution, the problem is called the Symmetric Searchable Encryption (SSE) problem. While there do exist asymmetric forms of searchable encryption [2], we will only consider the SSE problem,for it is more efficient in comparison to asymmetric solutions to the searchable encryption problem.

1.1

Applications of Searchable Encryption

Searchable Encryption over phrases can be used to support a large number of diverse applications. For example, in human resource management, one may want to look for a series of phrases that assess the performance of an employee. In medical record management, a doctor may want to retrieve all records where a certain phrase of ail-ments occur next to each other. At an educational institution an instructor may want to search for student information based on phrases related to the course performance. All of these applications share the common need of querying for phrases that are not necessarily pre-known.

In the case we have access to a hierarchical access control mechanism on encrypted keyword search we have even more applications. For example, a company can out-source their data to the cloud and different employees can have different access. For

(12)

example, only members of the finance department should be able to search for fi-nancial information and only the members of the engineering department should be able to search for blueprint information. In the area of parental controls, envision a search engine where you do not have to forgo query privacy for filtering of explicit content. All the applications presented share common needs: confidentiality of data, query privacy, and query result privacy. Thus, they are perfect for the application of searchable encryption.

1.2

Overview of Results

In this dissertation we provide efficient solutions to two problems in Symmetric Searchable Encryption. Both solutions exhibit the property of verifiability. By veri-fiability we mean the client, in an SSE scheme, can detect if the cloud has returned incomplete or inaccurate results. Therefore, the cloud should be allowed to fabricate results that are inconsistent with the truth about the document collection. This can be achieved by considering SSE solutions under the model developed by Chai and Gong in [3]. The model is called the Semi-Honest but Curious model (SHBC). In this model, the cloud does the following: (1) honestly store data; (2) honestly execute the search operations or a fraction of them; (3) return a non-zero fraction of the query results honestly; and (4) try to learn as much information as possible. If a solution has the property of verifiability over its returned results, we say that we have a solution to the Verifiable SSE problem.

Our first result is structured around providing a verifiable phrase search mecha-nism. This result is based on the two phase protocol presented in [4]. Given a phrase, p, the first phase finds all the documents in D that contain all the words in p. The second phase, using the results of the first, determines which documents inD contain all the words inp, ordered according to p.

(13)

Our second result, improving on our first result, presenting a single phase search protocol. This new single phase protocol reduces both communication complexity as well as reducing the work that must be performed by the client to do a successful search. Like our first result, the second result is also verifiable.

In a second vein, we investigate an efficient verifiable searchable encryption scheme which provides access control over keywords that appear in a document collection. The most trivial access control is creating one group of users and allowing dynamic changes to the group. This problem has a good constructive solution provided by Curtmola et. al. in [5]. We demonstrate a hierarchical access control mechanism where we divide the users into numbered groups such that if a user in groupihas the ability to successfully search for a particular search term, then any user in any group j > i can also successfully search for the same search term.

1.3

Dissertation Structure

The remainder of this dissertation is structured as follows. In Chapter 2 we will discuss the cryptography, theory, and data structures needed to realize SSE. We will conclude this chapter with a discussion of existing work on SSE. In Chapter 3 we will present a verifiable phrase search SSE scheme. In Chapter 4 we will improve our system in Chapter 3 by introducing a single phase protocol. In Chapter 5 we will present a hierarchical access control mechanism for SSE. We conclude in Chapter 6 by discussing future directions based on the results presented.

(14)

Chapter 2

Background

Song, Wagner and Perrig posed the question [1]: Given an encrypted document, how does one search for a word in that document? They created a system known as Searchable Symmetric Encryption (SSE) to answer just this question. In this chapter we present all the background information necessary to understand SSE. We start by reviewing a few details from probability and cryptography. We proceed to discuss two formal models of clouds and the existing security models for SSE. We conclude by discussing the existing work in the area.

2.1

Background on Probability

In order to understand modern cryptography, one needs a firm grasp on probability theory. In this section we will review the probability theory needed to understand Section 2.2. The ideas that must be understood are the notions of probability distri-butions, statistical distance, and computational indisguishability.

We begin by discussing the idea of negligible functions. In cryptography we do not require that the adversary always fail, but that the adversary only succeeds with some very small non-zero probability. Formally, we call this small non-zero probability negligible, denoted bynegl. This is an asymptotic notion which we formally define in

(15)

Definition 2.1.1.

Definition 2.1.1 (Negligible Function [6]). A function f(n) is called negligible, if for all polynomial functions, poly(n), and for all n > n0, we have f(n) < _poly1₍_n₎. If

the bound holds, we denote f(n) by negl(n).

We are interested in making statements about probability distributions. Define a sample space S as the set of possible outcomes of some experiment and an event A as a subset of S. A probability distribution is defined as follows:

Definition 2.1.2(Probability Distribution [7]). A probability distributionPr(·) on a sample spaceS is a mapping from events ofS to real numbers satisfying the following axioms:

1. Pr(A)≥0 for any event A. 2. Pr(S) = 1.

3. Pr(A∪B) = Pr(A) + Pr(B) for any two mutually exclusive events A and

B. More generally, for any (finite or countably infinite) sequence of events

A1, A2, . . . that are pairwise mutually exclusive,

Pr [ i Ai ! =X i Pr(Ai).

The notation Pr(A) also denotes the probability of event A.

A random variable is a function X : S → _R, where S is a sample space. Given Definition 2.1.2 and the notion of a random variable we can define the notion of a probability ensemble. A probability ensemble is a, possibly infinite, collection of probability distributions. Formally, we define them as follows:

Definition 2.1.3 (Probability Ensemble [6]). Let I be a countable set. A probability ensemble indexed by I is a collection of random variables {Xi}_i∈I.

(16)

Several cryptographic discussions rely on the notion of one probability distribution being computationally indistinguishable from another. What this means is that one cannot construct a probabilistic polynomial-time algorithm that can distinguish one distribution from another with more than a negligible probability. Given Definition 2.1.3 we define computational indistinguishability formally as follows:

Definition 2.1.4 (Computational Indistinguishablility [6]). Two probability ensem-bles X ={Xn}_n∈_N and Y = {Yn}_n∈_N are computationally indistinguishable, denoted

X ≡c Y, if for every probabilistic polynomial-time distinguisher D there exists a neg-ligible function negl(n) such that

|Pr(D(1n, Xn) = 1)−Pr(D(1n, Yn) = 1)| ≤negl(n)

where D(1n_{, X}

n) means to choose x according to distribution Xn, and then run

D(1n_{, x)}_.

2.2

Background on Cryptography

Searchable Symmetric Encryption is based on several cryptographic primitives. The necessary primitives are pseudo-random generators, pseudo-random functions, pseudo-random permutations, symmetric key encryption, and cryptographic hash functions. For discussions of these primitives please see, for example, [8, 6, 9].

2.2.1

Pseudo-Random Primitives

We consider a pseudo-random generator (PRG). A pseudo-random generator is a function provided with an n-bit input that expands its input to a longer sequence in a way that the distribution generated by the pseudo-random generator is computa-tionally indistinguishable from being truly random. The precise definition appears in

(17)

Definition 2.2.1.

Definition 2.2.1 (Pseudo-Random Generator [6]). Let `(·) be a polynomial andG a deterministic polynomial-time algorithm such that for any inputs∈ {0,1}n, algorithm

G outputs a string of length`(n). We say that Gis a pseudo-random generator if the following two conditions hold:

1. For every n it holds that `(n)> n.

2. For any probabilistic polynomial-time distinguisher D, there exists a negligible function negl(n) such that

|Pr(D(r) = 1)−Pr(D(G(s)) = 1)| ≤negl(n),

whereris chosen uniformly at random from{0,1}`(n), the seed,s, is chosen uni-formly at random from {0,1}n, and the probabilities are taken over the random coin tosses used by D and the choice of r and s.

A stronger pseudo-random primitive comes in the form of a pseudo-random func-tion (PRF). A pseudo-random funcfunc-tion is a member of the family of funcfunc-tions where the behavior of one function, drawn randomly from the family, is computationally indistinguishable from any other random function. A family of functions as a set of keyed functions F : {0,1}k × {0,1}n → {0,1}l, where k, n, l > 1. If k = n = l then we have a pseudo-random permutation (PRP). Formally, a pseudo-random function is defined by Definition 2.2.2.

Definition 2.2.2 (Pseudo-Random Function). A keyed function F : {0,1}k × {0,1}n → {0,1}l is pseudo-random if for any probabilistic polynomial-time distin-guisher D, given oracle access to Fk = F(k,·), there exists a negligible function,

negl(n) such that

Pr DFK (·) (1n) = 1 −Pr Df(·)(1n) = 1 ≤negl(n),

(18)

where K ← {R 0,1}k is chosen uniformly at random and f is chosen uniformly at random from all functions that map {0,1}n to {0,1}l.

If we have a family of length preserving functions, then we get a PRP. We say a function is length preserving if |F (k, x)| = |x| = |k|. Formally, this is given by Definition 2.2.3.

Definition 2.2.3 (Pseudo-Random Permutation [6]). Let F : {0,1}∗ × {0,1}∗ → {0,1}∗ be an efficient, length-preserving, keyed function. We say that F is a pseudo-random permutation if for any probabilistic polynomial-time distinguisher D, there exists a negligible function negl(n) such that

Pr DFK

(·)

(1n) = 1−Pr Df(·)(1n) = 1≤negl(n),

where K ← {R 0,1}n is chosen uniformly at random and f is chosen uniformly at random from the set of functions mapping {0,1}n to {0,1}n.

Notationally, Df(·)₍_·_{) means that} _D _uses _f _{as an oracle and} _D _{can query} _f _a

polynomial number of times.

2.2.2

Symmetric Encryption

Given a set Mknown as the message space, a set C known as the cipher-text space, and a set K known as the key space we define symmetric encryption as a tuple (G,E, D) of probabilistic polynomial-time algorithms.

G : 1λ → K: The key generation algorithm, takes a security parameter, 1λ, and selects a key k ∈ K.

E :M × K → C: The encryption algorithm takes a message and a key as input and outputs a string of ciphertext.

(19)

D :C × K → M: The decryption algorithm takes a string of ciphertext and a key as input and outputs the plaintext if, and only if, the ciphertext was encrypted with the key. Otherwise, ⊥ is returned.

There is one correctness guarantee, namely, Dk(Ek(m)) = m must hold for all

keys k and messages m. Notationally, we will write the key used for encryption and decryption as a subscript of the function, not as an argument.

The simplest, practical, security guarantee that a symmetric encryption scheme can exhibit is that of semantic security, meaning that an attacker is unable to learn anything about the plaintext except what is leaked by the ciphertext (e.g., length of the message). IN other words, the probability of finding the plaintext from teh ciphertext is no much differnt from gussing the plaintext without the ciphertext. Formally, this can be defined as follows:

Definition 2.2.4 (Semantic Security for Symmetric Encryption [6]). A symmetric encryption scheme (G,E, D) is semantically secure in the presence of an eavesdrop-per if for every probabilistic polynomial-time algorithm A, there exists a probabilistic polynomial-time algorithm A0_{, such that for all efficiently-sampleable distributions}

X = (X1, . . .) and all polynomial-time computable functions f and h, there exists a

negligible function negl(n) such that

|Pr(A(1n,Ek(m), h(m)) =f(m))−Pr(A0(1n, h(m)) = f(m))| ≤negl(n),

where m is chosen according to distribution Xn, and the probabilities are taken over

the choice ofmand the keyk, and any random coins used byA, A0_{, and the encryption}

process.

This definition is based on the pioneering work of Goldwasser and Micali [10]. From Goldwasser and Micali’s work, Bellare, Desai, Jokipii, and Rogaway [11] defined semantic security for symmetric encryption systems

(20)

Using pseudo-random generators, pseudo-random functions, and pseudo-random permutations one can construct symmetric encryption schemes. One-time pad encryp-tion systems can be constructed from pseudo-random generators and block ciphers can be constructed from pseudo-random permutations or pseudo-random functions. In particular, block ciphers can be constructed using the Luby-Rackoff construction [12]. In the remainder of this dissertation we will consider a symmetric encryption system to be modeled as one of the pseudo-random primitives to exhibit its properties.

2.2.3

Cryptographic Hash Functions

We define a hash family H as a family of surjective functions hs :{0,1}n → {0,1}m

form < n. We say that the hash function,hs ∈ H, is collision resistant if it is hard to

find different stringsx1, x2 ∈ {0,1}n that hash to the same valuev ∈ {0,1}m. We say

that the hash function hs is pre-image resistant if given the value hs(x), an attacker

can recover x with negligible probability. Lastly, we say that the hash function hs

is second pre-image resistant if given a value x ∈ {0,1}n, an attacker can find, with only negligible probability, an x0 ∈ {0,1}n such that hs(x) =hs(x0).

Cryptographic hash functions are a family of collision, pre-image, and second pre-image resistant hash functions which are used in many areas of cryptography. They consist of a pair of probabilistic polynomial-time functions (G, H). G is used to select, at random, a key s. This key is an index of the hash function in the family. The function hs : {0,1}

∗

→ {0,1}`(n) is drawn from H according to s. The range of hs (i.e., `(n)) must be less than, or equal to, the length of the message being

hashed. Cryptographic hash functions can be constructed from block ciphers using the Merkle-Damg˚ard construction [13, 14].

Generally, the security of hash functions is modeled in two ways. The first is called the standard security model. In the standard model, one only uses the three properties of cryptographic hash functions stated above. The second model, called the

(21)

random oracle model, treats a hash function as a random oracle. This random oracle responds with a random value for each query. However, if a query is repeated the oracle will respond with the same value. This model was first proposed by Goldreich, Goldwasser, and Micali in 1985 [15].

2.3

Searchable Encryption Framework

We make use of the following notation for discussing the results of research into SSE. Let D = {D1, D2, . . . , Dn} denote a collection of n encrypted documents in

the cloud storage, Σ the alphabet over which characters from strings are drawn, and ∆ ={w1, w2, . . . , wd}a dictionary ofd words drawn from Σ∗. We associate with each

document in collection D a number used as an index. The function is denoted by

id: D →_Z. Let D(wi) denote the set of document identifiers that contain the word

wi ∈∆. We will use m1 ||m2 to denote the concatenation of message m1 and m2.

For the remainder of this dissertation we will define our SSE systems following the rigorous framework of Curtmola, Garay, Kamara, and Ostrovsky in [5]. Their model consists of a tuple of four algorithms (Keygen,BuildIndex,Trapdoor,Search). These algorithms are defined as follows:

Keygen 1λ: A probabilistic algorithm run by the owner to setup the scheme. It takes a security parameter λ, as input, and returns a secret key K.

BuildIndex(K,D): A probabilistic algorithm run by the owner to generate the in-dexes. It takes a key K and a document collectionD as input and returns an index

I.

Trapdoor(K, w): An algorithm run by the owner to generate a trapdoor Tw, give a

(22)

Search(I, Tw): An algorithm, run by the cloud, that searches for a keyword in the

document collection. It takes an indexI and a trapdoorTwand returns the document

identifiers for documents that contain wordw. An index, I, is a data structure, or set of data structures, that tracks keywords and documents that contain those keywords. We note that in some chapters of this dissertation we will, in some cases, assume that the model will be using phrases p∈∆∗ instead of words w∈∆. This will cause small modifications to both the Trapdoor and Search function inputs.

There are two major forms of indexes used by SSE. They are the inverted index and the per-document index. The inverted index structure, borrowed from the field of information retrieval, is a single data structure that is used to associate each keyword with the set of documents in the document collection that contain the word [16]. The per-document index associates, with each document, a data structure that tracks the keywords stored in that document.

2.4

Index Data Structures

In this section we will discuss two data structures that permeate the research. These data structures are used to construct both per-document and inverted indexes. In-dexes are required to provide two operations: Searchand Insert with a third optional operation: Delete. TheSearchoperation is used to determine if a search key occurs in the data structure. The Insertoperation is used to add a new key, with its associated data, to the data structure. The Deleteopertion is used to remove a key, and associ-ated data, from the data, structure. We present two index structures in this section, the trie and the Bloom Filter.

Devised by Fredkin [17], a tries is an index method, which supports three main operations: Insert, Search, and Delete; all take a word w ∈ Σ∗ as input. A trie is a

(23)

Moreover, a root-to-leaf path through the tree denotes a word w ∈ Σ∗, which is terminated by a special character $6∈Σ.

The Insertoperation appends a $ to the inputw. Starting at the root node of the tree, we use w to create a path. The first time we reach a node that does not have the current corresponding letter in w, we add a subpath as a child to the current node. Moreover, we label this subpath appropriately with the remaining letters ofw, terminating the path with a $. We note that the insertion time with in the trie is O(|w|).

The Search operation uses input w as a path through the tree. The function first adds a $ to the path. If that path ends in a leaf, i.e., the path is a root-to-leaf path, the search is successful. Otherwise, the word does not exist in the dictionary. We note that the search time with in the trie is Θ (|w|) in the worst case.

The Delete operation uses inputw as a path through the tree. This function will remove all nodes, in a bottom up fashion, according to the path given by w. There is an exception, a node will not be removed if it has children that do not match the symbol indicated by the previous level inw. We note that theDeletetime in the trie is Θ (|w|).

In this dissertation we will denote a trie by T and a node of the trie byTi,j, where

i is the depth of the node and j the left to right placement of the node. We will denote the access to values stored in the node of T by Ti,j[s], where s denotes the

name of the field.

Devised by B. H. Bloolm [18], a Bloom Filter is an index method, whch consists of ak-bit vector and three hash functions h1, h2, andh3 with range {1,2, . . . , k} and

supports two operations: Insert and Search.

The Insert operations inserts input v by setting position h1(v), h2(v), and h3(v)

in the k-bit vector to 1. The Search operation determines if input v is in the filter. To do this it checks if all locations h1(v), h2(v), and h3(v) in the k-bit vector are 1.

(24)

If they are, then the value is likely to be in the filter. We say likely as it is possible that the Bloom Filter may give a false positive result. However, the Bloom filter will never give a false negative. This false positive rate comes with the advantage of Θ(1)

Insertand Search time.

The traditional Bloom Filter does not support the Delete operation. The reason is this: If you just simply flip the hashed locations of the the value v to zero, you may introduce false negatives. One way to provide the Delete operation is to using a Counting Bloom Filter [19]. Briefly, a counting Bloom Filter has every bucket of the Bloom filter be a k-bit value, for some k ∈ _Z∗_{. The} _Insert _{operation works}

exactly the same as the Insertfor the traditional Bloom filter, except that 1 is added to the number found at the location in a counting filter. The Search operation, for the counting filter, works similar to the Bloom filter, except that the search value is considered present in the filter if each hashed location has a number greater than zero. TheDeleteoperation works by subtracting one from each location that the value hashes to. The Delete operations should only be performed if the Search operation finds the value to be deleted.

2.5

Models of Clouds and Security

In order to talk about the security of systems we must model the abilities of the attacker. In this case, the attacker is the cloud. There are two models prevalent in the research. The first is the Honest-but-Curious (HBC) model and the second is the Semi-Honest-but-Curious (SHBC) model proposed by Chai and Gong in [3].

The HBC model has been traditionally used in the literature to model servers in cryptographic protocols. This model can be easily be talked about in the terms of a cloud, which is what we will do here. The HBC model describes a cloud that interacts in the following way:

(25)

1. Honestly store the data.

2. Honestly follow the steps of the protocol.

3. Try to learn as much information as possible from the interaction.

The model has served well for many cloud based protocols, however, it does not take into account the desire of the cloud to do as little work as possible. In [3] Chai and Gong captured a more liberal set of requirements which they termed the Semi-Honest-but-Curious model. In the Semi-Honest but Curious model the cloud acts in the following way:

1. Honestly store the data.

2. Honestly execute the search operations or a fraction of them.

3. Return a non-zero fraction of the query results honestly.

4. Try to learn as much information as possible from the interaction.

Once a cloud model is selected we must provide formal definitions of security guarantees of a constructed SSE system. There are a potpourri of security guarantees for searchable encryption that have evolved in the literature over time. The five models we will review, in this section, are Indistiguishability under chosen keyword attack (IND-CKA and IND2-CKA) due to Goh [20]; Privacy Preserving Search on Encrypted Data (PPSED) due to Chang and Mitzenmacher [21]; and Non-Adaptive and Adaptive Indistiguishability due to Curtmola, Garay, Kamara and Ostrovsky [5]. We note that Song, Wagner, and Perrig did not define a set of security models beyond the normal cryptographic assumptions.

Goh, in 2006, gave the first two models of security under the name IND-CKA and IND2-CKA. The IND-CKA, semantic security against adaptive chosen keyword attacks, tracks the notion that an adversary A cannot deduce, beyond what is know

(26)

from previous queries, the contents of a document from its index. A system is shown to exhibit this security via a game between a challenger C and an adversary A.

Definition 2.5.1 (IND-CKA Game [20]). Given a challenger C and an adversary A we define the IND-CKA game as a tuple of rounds: Setup, Queries, Challenge, and Response.

Setup: The challengerC creates a setS ofqwords and gives it toA. The adversary A chooses a number of subsets S∗ from S and returns S∗ to C. Once C receives S∗, the challenger C uses BuildIndexto construct an index of S∗, I, for each document in D. The challengerC concludes by sending all indexes with their associated subsets to A.

Queries: The adversary A is allowed to queryC on a word w and receive the trap-door Tw for w. With Tw, the adversary A can invoke the Search on an index I to

determine if w∈ I.

Challenge: After making some Trapdoor requests, the adversary A decides on a challenge by picking a non-empty subset V0 ∈S∗, and generating another non-empty

subset V1 from S such that |V0 −V1| 6= 0, and the total length of the words in V0 is

equal to the total length of words inV1. Finally, the adversary Amust no have queried

C, for the trapdoor of any word in V0∪V1. The adversary A then gives V0 and V1 to

C who chooses b ← {R 0,1}, runs BuildIndex to obtain the index IVb for Vb and returns

IVb to A. The challenge for A is to determine b. After the challenge is issued, the

(27)

Response: The adversary A eventually outputs a bit b0, representing its guess for

b. The advantage of A in winning this game is defined by

AdvA = Pr(b =b0)− 1 2 .

This probability is taken over all the internal coin tosses of A and C.

We say that the adversary, A, (t, , q)-breaks the index if AdvA ≥ after A takes

at most t time to make q trapdoor queries to the challenger C. A index I is (t, , q) -IND-CKA secure if no adversary can (t, , q)-break it.

This game follows the semantic security game introduced by Goldwasser and Mi-cali [10]. We point out that with this model there is no requirement that the trapdoors be secure. This deficiency is later corrected [5]. It should be noted that Goh was not trying to create an SSE system, but a secure index. The notion of a secure index is closely related to, but separated from SSE.

Goh’s IND2-CKA game (indistinguishability against chosen-keyword attacks) states that given access to a set of indexes the adversary is not able to learn any partial information about the encrypted document that cannot be learned from pos-sessing the trapdoor. Moreover, pospos-sessing the trapdoor only proivdes knowledge of whether or not a keyword occurs in the index. Goh further showed that this property holds if the adversary can trick the client into generating trapdoors. We note that IND2-CKA does not require that trapdoors are kept secure. To obtain the IND2-CKA security guarantee, Goh modified Definition 2.5.1 on the Challenge step as follows: select two non-empty subsets V0, V1 ∈ S∗ of possibly different size and word length

such that |(V0−V1)| 6= 0.

Chang and Mitzenmacher proposed the PPSED [21], Privacy Preserving Keyword Searches on Remote Encrypted Data, security guarantee preserved between a server

(28)

Definition 2.5.2 (PPSED Security [21]). For k ∈ _N, let Ck denote all the

com-munications that server S receives from user U before the kth _{round of the protocol.}

Let C_k∗ ={ζ, Q0 =, Q1, . . . , Qk}, where ζ is the set of encrypted documents and Qj

is an n-bit string, for j ∈ {1,2, . . . , k−1}, such that for i ∈ {1,2, . . . n}, we have

Qj[i] = 1 if and only if wj is a keyword in document mi. We say the system is

secure if for any k ∈ _N, any probabilistic-polynomial-time (PPT) adversary A, any

δk={m1, m2, . . . , mn, w0 ≡, w1, . . . , wk−1}, and any function h, there exists a PPT

algorithm A∗, a negligible function negl(A) such that

|Pr(A(Ck,1) =h(δk))−Pr(A∗(Ck∗,1) =h(δk))| ≤negl(A).

The definition ofC_k∗ gives us the information about the interaction fromU’s point of view andCkgives us all the information that is obtained by watching the messages

exchanged betweenU andS. The PPSED definition, at its core, says that everything that can be computed by A from Ck can also be computed byA∗ fromCk∗. It turns

out that all SSE systems trivially satisfy PPSED [5].

Curtmola, Garay, Kamara, and Ostrovsky [5] presented two precise definitions of security for searchable encryption, which corrected all of the deficiencies present in IND-CKA, IND2-CKA, and PPSED. The defincience are due the fact no previous def-initions consider how the queries are issued. It was not until their work of Curtmola et. al. that rigorous and exact security definition was given for SSE. In particular, Curtmola et. al. provided formal definitions for non-adaptive, respectively adap-tive, indistiguishability notions and showed a reduction to a form of non-adapadap-tive, respectively adaptive, idistinguishablility notions to semantic security. Briefly, in non-adaptive security, privacy is only guaranteed when clients generate queries at once. In the case of adaptive security, privacy is guaranteed even if the clients generate queries as a function of previous search outcomes. Until this work, any systems that

(29)

is secure was secure in the non-adaptive sense at best. To understand the definitions of these security guarantees we first provide some necessary background.

We qualify the security of an SSE system by what we are willing to leak about a clients communication with the cloud. In general, we call an SSE system secure if it leaks only the search pattern and the search outcomes. The search pattern describes the queries that were issued and how they may relate (e.g., if the same keyword appears in multiple queries).

As defined Curtmola et. al., we define three sets (the history, the view, and the trace) over an interaction between the client and the cloud. The history is the plaintext for each query and the plaintext for documents returned as a result of issuing the query. The view consists of the encrypted documents, the index, and the trapdoors. In other words, the view consists of everything the cloud can see. Thetrace

is the information about all of the structure of the interaction. Specifically, the view consists the length of the documents, the search outcomes, and the search patterns. Given these three sets we can more formally describe the security guarantees. A SSE system is secure if any function of the history that can be computed from the view can be computed from the trace.

Curtmol et. al.’s first considered security in the case of a non-adaptive adversary. A non-adaptive adversary must make search queries without seeing the outcome of previous search queries. In other words, the adversary must issue the queries in batch form. There exists a stronger adversary, namely and adaptive adversary, where the adversary can make search queries based on the outcomes of previous search queries. To prove that a system provides security in the presence of a non-adaptive adver-sary, we use the notion of computational indistinguishability. The proof is approached in this way so that we can capture all potential adversaries. To do so, we will give the cloud two indexes: One is legitimate and one is fabricated. We will also provide the cloud with the information it would see, if the queries were issued over the

(30)

legit-imate index (called the view). The goal of the cloud is to determine which index is legitimate and which index is fabricated. We introduce a proof tool, called a simula-tor, to construct the fabricated index from exactly the information we are willing to leak about a set of queries. We formalize these notions using the following, formal, definitions of history, view, and trace [5, 4]:

Definition 2.5.3. Let D be a collection of n documents and ∆ a dictionary. A

history Hq is an interaction between a client and a server over q queries, which is

denoted by Hq= (D, w1, w2, . . . pq), where each pi is a phrase.

An adversary’s view of Hq under secret key K is defined by

VK(Hq) = (id(D1), . . . ,id(Dn),

E(D1), . . . ,E(Dn),I, T1, . . . , Tq),

(2.1)

where T1, . . . , Tq are a series of trapdoors and I is an index.

The trace of Hq is the following sequence:

Tr (Hq) = (id(D1), . . . ,id(Dn),|D1|, . . . ,|Dn|,

D(w1), . . . ,D(wq), πq),

(2.2)

where πq is the search pattern of the user and D(wi) denotes the set of document

identifiers that contain the word wi.

We say a system is non-adaptively secure if for any two adversarially constructed histories with equal length and trace, no probabilistic polynomial-time adversary can distinguish the view of one from the view of the other with probability non-negligibly better than 1₂. Formally this notion is captured by Definition 2.5.4

Definition 2.5.4 (Non-Adaptive Semantic Security for SSE [5]). An SSE scheme

is non-adaptively semantically secure if for all q ∈ _N and for all

(31)

probabilistic polynomial-time algorithm (called the simulator) S such that for all traces Trq of length q, all polynomially sampleable distribution Hq over

n Hq∈22 ∆ ×∆q _{: Tr (H} q) = Trq o

(where ∆ is the dictionary), all functions f :

{0,1}m → {0,1}`(m) (where m = |Hq| and `(m) = poly(m)), all polynomials p,

and sufficiently large k, we have

|Pr(A(Vk(Hq)) = f(Hq))−Pr(S(Tr (Hq)) =f(Hq))|<negl(k)

where Hq R

← Hq, K ← Keygen 1k

, and the probabilities are taken over Hq and the

internal coin tossing algorithmms of Keygen, A, S, and the underlying BuildIndex.

In other words, the intuitive security notion is that if the adversary is unable to learn anything more than what they can learn from the secured index and the encrypted queries then we will say the system is secure. Definition 2.5.4 captures the idea that the system is secure if the simulator can simulate some function of the history that the adversary cannot distinguish, with more than negligible probability, provided that the simulator is given access to only the trace of the history and the adversary is only given access to a view of the history.

A stronger form of security can occur if we allow the adversary to issue queries based on the results of previous queries. In other words, the adversary can adapt to the results produced by the system. This is the strongest notion of security known for SSE.

The stronger notion is called Adaptive Semantic Security. If we the simulator is given access to only the partial trace of the history and the adversary is only given access to a partial view of the history. The partial trace of the history, denoted byH_qt is composed of the first t elements of the q-length history. The partial view, denoted byVt

q, is composed of the firstt elements of the q-length view. Formally we have,

(32)

in the sense of adaptively semantically secure if for allq ∈_Nand for all (non-uniform) probabilistic polynomial-time adversaries A, there exists a (non-uniform) probabilistic polynomial-time algorithm (the simulator) S such that for all traces Trq of length

q, all polynomially sampleable distribution Hq over

n Hq ∈22 ∆ ×∆q _{: Tr (H} q) = Trq o

(where ∆ is the dictionary), all functions f : {0,1}m → {0,1}`(m) (where m = |Hq|

and `(m) =poly(m)), all 0≤t ≤q, and all polynomials p, and sufficiently large k, we have Pr A V_kt H_qt =f H_qt−Pr S Tr H_qt=f H_qt<negl(k) where Hq R ← Hq, K ← Keygen 1k

, and the probabilities are taken over Hq and the

internal coin algorithms of Keygen, A, S, and the underlying BuildIndex algorithm.

Definition 2.5.5 captures the idea that the system is secure if the simulator can simulate a view of the partial history that the adversary cannot distinguish with more than negligible probability from the actual view.

The remainder of our work will concentrate on the notion of non-adaptive semantic security for SSE with a SHBC cloud. Non- adaptive semantic security is the strongest security guarantee we have been able to achieve with a cloud in the SHBC model.

2.6

Previous Work

2.6.1

A First Solution

Song, Wagner, and Perrig [1], first studied how to search for a keyword in encrypted data. They investigated an indexed approach and a non-indexed approach. They further distinguished between hidden searches and non-hidden searches. In a hidden search the query submitted to the cloud is constructed in such a way that the cloud is unable to ascertain the meaning of the query (i.e., query privacy). In a non-hidden

(33)

search the query is know to the cloud. All the systems they investigated suffer from scalability for they require polynomially many keys. Moreover, their systems are unable to handle compressed files, as pointed out in their paper. Finally, their systems (i.e., Schemes I, II, and III) may also leak the position of a keyword in the text. Note that the problem of position leakage was fixed in later systems designed by others.

The systems of Song et. al. rely on the following assumptions:

1. A document d consists of a sequence of words.

2. There exists a family of pseudo-random functions Fki :{0,1}

n−m _{→ {}

0,1}n, for any n and m.

3. There exists a family of pseudo-random permutations Eki : {0,1}

n

→ {0,1}n, for any n.

4. There exists a family of pseudo-random generators Gwith output contained in

{0,1}m, for any m.

Using these functions and the document collection D they presented three schemes that provide solutions of varying security guarantees.

Their schemes were built from the foundational scheme: Scheme I. This scheme consists of two main operations Encryptand Search. The Encryptoperation encrypts a document in such a way that at a later time the cloud can run Search and obtain an answer to a keyword query.

Encrypt: For each word in the document d, generate a pseudo-random value si for

word wi, using the pseudo-random generator G with output length m. Set Ti =

si || Fki(si) (Note: |Ti| =|wi|). Write Ci =wi⊕Ti to the file we will upload to the cloud.

(34)

Search: Given a keyword W and {ki |1≤i≤ |d|}, the cloud looks at every word in

the document d, computes Ti = Ci ⊕W, and parses Ti as s || v. If v = Fki(s) we found the word, otherwise we continue going through the document.

We can clearly see that we require a polynomial number of keys to implement this system. One key is needed for each word in the document. Additionally, the cloud is aware of the search terms, which is not a desirable property. We would like a system that is more efficient and more secure.

Song et. al. constructed a second system which seeks to improve upon the inef-ficiencies around keys presented in their first scheme. To achieve their results, they must add an assumption that fk0 : {0,1}∗ → K is a pseudo-random function that

maps arbitrary binary strings to a key space K. The Encrypt operation is modified from the first scheme such that the data owner chooses ki = fk0(w_i) where k0 is a

secret key never revealed. Encryption proceeds as in the first scheme. In order for the Search operation to succeed, we must reveal to the cloud (w, k =fk0(w)). With

this information, search continues as in the first scheme, but uses k instead of ki.

The second scheme proposed by Song et. al. reduces the number of keys that the data owner must store, but still leaks the search term to the cloud. They resolved the search-term leak in their final scheme, building on the results of the second scheme. In particular they modified the Encryption operation by pre-encrypting every word wi in the document as xi = Ek00(w_i). Next, the scheme splits x_i as L_i || R_i, where |Li| = n−m and |Ri| = m. We complete the changes by constructing the key for

the pseudo-random functionF by computingki =fk0(L_i). The rest of theEncryption

operation remains unaltered. It is necessary to modify theSearchoperation by sending to the cloud a tuple (x, k), wherex=Ek00(w) and k =f_k0(L). The Search operation

(35)

2.6.2

Early Indexed Approaches

There has been work that seeks to improve Song et. al.’s results using a keyword index. A keyword index is a data structure used for searching, which associates keywords with documents that contain those keywords. This data structure may be an inverted index or a per-document index.

Goh [20] used Bloom filters to maintain the index of keywords associated with each file. He considered conjunctive and disjunctive queries, as well as updates of document indexes. Goh’s approach, however, used a weaker model of index construction, where the index size is based on the number of distinct words in a given document and therefore leaks information. Goh considered only the HBC adversarial model under the IND2-CKA security definition.

Goh’s index construction is based on the following assumptions:

1. There exists a pseudo-random function f : {0,1}n× {0,1}s → {0,1}s, and a key set K = nki

R

← {0,1}s|1≤i≤ro, where s is a security parameter and r the number of hash functions used by the Bloom Filter.

2. Every documentd, in a collection of documentsD, is assigned a unique identifier

id(d)∈ {0,1}n.

3. The index is constructed as a result of a process involving the generation of a code-word and a trapdoor.

As part BuildIndex, a code word for every word in the document is constructed. A code-word set is defined as C = {yj =f(id(d), xj)|1≤j ≤r}. The values xj are

the results from computing the trapdoor sets Tw ={xj =f(w, kj)|1≤j ≤r} for a

wordw. Finally, every bit location yj is set to one in the Bloom Filter. We complete

the BuildIndex by blinding the Bloom Filter. To achieve this blinding let u be the upper bound on the number of tokens in d. One may assume there is one token for

(36)

every byte. Denote by v the number of unique words. Blind the index by inserting (u−v)r 1’s uniformly at random in the Bloom Filter.

To define the Search operation over Goh’s index we require a trapdoor. The

Trapdoor operation constructs the trapdoor.

Trapdoor: Given the set nki R

← {0,1}s|1≤i≤ro and a word w, the data owner constructs the set

Tw ={xj =f(w, kj)|1≤j ≤r}

for word w.

Search: Given trapdoor Tw = (x1, x2, . . . , xr) and the Bloom Filter (index) for

doc-ument id(d). Compute the codewords: {yj =f(id(d), xj)|1≤j ≤r}. Test if the

Bloom Filter contains 1’s in all r locations denoted by y1, . . . , yr. If so, output true.

Otherwise, output false.

While Goh’s construction is efficient and secure under the IND2-CKA security definition, the construction suffers from the possibility of false positives. In this sense the cloud may return identifiers for documents that don’t contain the search term. Though the IND2-CKA model does not explicitly require it, the trapdoors for this construction are secure.

Chang and Mitzenmacher [21] devised a separate but related indexed based sys-tem. Like [1], they considered keywords only. Their system uses bit maps as an index. Each bit in the map represents one possible word. This bit map is later obfuscated by a pseudo-random function G and uploaded to the cloud. The pseudo-random func-tion Gserves as the trapdoor information. This trapdoor information is provided by the client to the cloud when a query is made. The security of their system is only under the HBC model with the PPSED security definition. The PPSED definition guarantees security of both indexes and trapdoors. However, it does not take into

(37)

account what can be learned by how the queries are executed.

Unlike Goh’s construction, Chang and Mitzenmacher’s construction uses a de-terministic search data structure. The index is mapped one-to-one with documents in the document collection. They further require the following assumptions: There exists a keyed pseudo-random permutation

Pk :{0,1}d→ {0,1}d,

where k∈ {0,1}t, a keyed pseudo-random function

Fk :{0,1}d→ {0,1}t,

where k∈ {0,1}t, and a keyed pseudo-random function

Gk:{1, . . . n} → {0,1},

where k ∈ {0,1}t. Finally, the system requires that there exists an symmetric key encryption system (G,E, D).

TheBuildIndexalgorithm works as follows: Given a security parametert, choose an s, r ∈ {0,1}tuniformly at random. For each filej, prepare an index stringIj of size 2d

such that if document j contains the ith _{word in the dictionary} _w

i, set Ij[Ps(i)] = 1,

otherwise, setIj[Ps(i)] = 0. Complete the construction by computing ri =Fr(i) for

i ∈

0, . . . ,2d _{, and for each file} _{j, compute a 2}d_{-bit masked index string} _M j. Set

each entry in Mj byMj[i] =Ij[i]⊕Gri(j). Once the index is constructed, send Mj for all j to the cloud and the owner keeps s, r, and the index-word pairs.

To search on this index, the owner must generate a trapdoor and send that trap-door to the cloud so the cloud can execute the Search operation. To construct a trapdoor for word w, the owner runs Trapdoor(w) to retrieve the corresponding

(38)

in-dex λ for word w and computes Tw = (p, f), where p= Ps(λ) and f = Fr(p). The

cloud performs theSearchoperation using the trapdoorTw, over each document index

j in the document collection as follows: ComputeIj[p] =Mj[p]⊕Gf (j). IfIj[p] = 1

return the document with identifier j.

The problem with the above scheme is that there is no guarantee of even a non-adaptive form of security. Non-non-adaptive security requires security under the case that all search queries are issued at the same time. Moreover, all SSE systems satisfy the PPSED property trivially.

2.6.3

Improved SSE Constructions

Multi-user searchable encryption has also been considered. It was first considered by Curtmola, Garay, Kamara, and Ostrovsky [5]. In fact, they demonstrated a generic construction to obtain a single group access model over any given SSE construction and broadcast encryption system. In that same work, Curtmola et. al. described an adaptively secure SSE system and a adaptively secure SSE system. For the non-adaptively secure system, the indexing mechanism that they chose to use is essentially an encrypted and permuted linked list. This linked list is a list of the document set

D(w). Each list is stored collectively in an array. For their non-adaptively secure system the index also needs a lookup table, indexed by wordw, used to find the start of D(w) in the linked list. The index used in the adaptively secure system just uses a special lookup table such that each word is stored in the table with its correspond-ing document identifier. The adaptively secure index structure is significantly more expensive in storage due to this property.

The non-adaptively secure system of Curtmola et. al.’s requires several cryp-tographic primitives. Let k and ` be security parameters. Their system needs a semantically secure encryption system, one random function, and two pseudo-random permutations. The semantically secure encryption system (G,E, D) has an

(39)

encryption function E : {0,1}`× {0,1}r → {0,1}r, where r is the block size. The pseudo-random function is f : {0,1}k× {0,1}p → {0,1}`+lg(m), where m is the to-tal size of the plaintext document collection in bytes and p is the size of a word in bits. The two pseudo-random permutations needed areπ:{0,1}k× {0,1}p → {0,1}p

and ψ : {0,1}k× {0,1}lg(m) → {0,1}lg(m). The Keygen operation will generate the key as a triple of random bit strings needed in the implementation of the system s, y, z ← {R 0,1}k. To construct the index needed for searching, Curtmola et al. present the following algorithm for BuildIndex.

1. Scan the document collection, D where each document is identified by an in-dentifier, and build a dictionary ∆ that contains all the distinct words in D. Complete by initializing a counter cto 1.

2. For each word w ∈ ∆ build the document set D(w), which is the set of all documents containt w.

3. For eachwi ∈∆, build an encrypted permuted (i.e. secure) linked list containing

D(wi) and store it in arrayA. Selectκi,0

R

← {0,1}` for each i. 4. For the jth identifier in D(wi),

(a) Select κi,j R

← {0,1}` and create Ni,j = id(Di,j) || κi,j || ψs(c+ 1), where

id(Di,j) is the j-th identifier in D(wi).

(b) Compute Eκi,j−1(Ni,j), and store it in the list A at location ψs(c) (i.e.,

A[ψs(c)] =Eκi,j−1(Ni,j).

(c) Increase the counter cby 1.

5. To locate the start of the lists in array A a lookup table is constructed by the following process for each wi ∈∆.

(a) Let v = (addr(A(Ni,1)) ||κi,0)⊕fy(wi), whereaddr(A(Ni,j)) denotes the

(40)

(b) Set location πz(wi) of T to v. In other words, T[πz(wi)] =v.

6. Set any empty locations in T to random bit strings of the correct size.

An example construction of an index on the docset {D1, D3, D5, D6} appears in

figure 2.1 L D1 D3 D5 D6 T · · · (κ1,0||ψs(1))⊕fy(w2) πz(wi) · · · A Eκ1,3(D6|| ⊥ || ⊥) 0 Eκ1,1(D3||ψs(3)||κ1,2) 1 x←R Ran(E) 2 Eκ1,0(D1||ψs(2)||κ1,1) 3 Eκ1,2(D5||ψs(4)||κ1,3) 4

Figure 2.1: A secure linked list on the set {D1, D3, D5, D6}

Once the index is constructed, both the Trapdoor and Search operations may be run. The result of running Trapdoor on word w is the ordered pair (πz(w), fy(w)).

To search the index for a word w, we invoke Search. The Search operation, given a trapdoor Tw = (πz(w), fy(w)), will locate the start of the document set for word

w using πz(w) as an index into T. Then the Search operation will walk the list,

and use the κ values it finds at each node to decrypt the subsequent nodes values. The resulting document identifiers are then returned to the user. For the proof of non-adaptive security of this system see [5].

The adaptive form is significantly simpler in all algorithms, especially the

BuildIndex algorithm. The Keygen operations generates a key s ← {R 0,1}k. The

BuildIndexalgorithm works as follows:

1. Scan the document collection D and build a dictionary ∆ that contains all the distinct words in D.

(41)

2. For each word w∈∆ build the document set D(w). 3. For each wi ∈∆, set T [πs(wi ||j)] =id(Di,j).

4. Set all empty entries of the lookup table T to random binary strings of the correct length.

To construct a trapdoor for this index, the Trapdoor function must output Tw =

(Tw1, Tw2, . . . , Twmax) = (πs(w||1), πs(w||2), . . . , πs(w||max)). Where max is the

size, in words, of the longest plaintext document in D.

To search the index, given the trapdoor Tw, Search proceeds by using each entry

inTw to lookup the associated index inT. All of those indexes are collected together

and returned to the user. For the proof that this system is adaptively secure, please see [5].

2.6.4

Phrase Searching

In a separate direction, Tang, Gu, Ding, and Lu [4] presented the first phrase search over encrypted data. They solved the problem by presenting a two-phase protocol to handle the search over the encrypted data. In the first phase, the cloud retrieves the document identifiers for documents that contain all the words in the phrase provided by the client, and returns the identifiers to the client. This phase relies on a global index, namely, the index shared among all documents in the cloud. In the second phase, the client sends the phrase query and a list of document identifiers to the cloud. The cloud searches for an exact phrase match for each document in the per document index (phrase table) and returns to the client the actual encrypted documents that match the phrase. Their protocol uses the HBC model to model the cloud and are non-adaptively secure based on the definition provided in [5].

The most interesting part of the construction is the construction of the phrase table. This per document table allows the cloud to determine if the phrase occurs in

(42)

a specific document without learning what the phrase is. To construct the table we must assume the existence of the following three keyed pseudo-random functions

Ψ :{0,1}λ× {0,1}∗ → {0,1}n

h :{0,1}λ× {0,1}∗ → {0,1}u

f :{0,1}λ× {0,1}∗ → {0,1}λ

The table has the dimensions wc×(d+ 1). where wc is the distinct word number

for the document and d is the highest frequency (for any word) that occurs in the document collection D.

The phrase matching look-up table is constructed using the following process: First, associate a random number ri with the i-th word in the document. Store in

the first column of the look-up table, the value Ψz(wi ||id(D)). For each remaining

element of the look-up table we storehs(ri−1) ||riif the wordri−1precedesri. It is

re-quired that keys, be distinct for two coherent words (i.e.,s=fk(wi−1 ||wi ||id(D))).

The first word in the document is handled in a special way by computinghs(r∗) ||ri,

where r∗ is a random number. Once all of the relationships are placed in the table, all unfilled slots are filled with random numbers of the same size as the output of hs

and the size of the random number. Finally, permute the contents of each row in the table (starting from the second element) and sort the rows based on the first element of each row. An example of the construction is given in figure 2.2 for document Dj

To search this phrase look-up table, the cloud will use each Ψz(wi ||id(d)) value

in the order it appears. The Ψz(wi ||id(d)) part of the trapdoors is constructed

by the client as part of conducting the phrase search. The cloud proceeds as fol-lows: Use binary search to find the row for Ψz(wi ||id(d)). Search the row for

fk(wi−1 ||wi ||id(d)). If there is a word in the phrase that is not found, the cloud

(43)

w1 w2 w3 · · · wn Dj r1 r2 r3 · · · rn Ψz(wi||id(Dj)) .. . hs(ri−1) ||ri · · · wc (d+ 1)

Figure 2.2: An example of a phase two table based index.

Tang et. al.’s were able to show that their system exhibited non-adaptive security. What they did to show this was to give a proof methodology based on simulation. There methodology constructs a simulator S that consists of two sub-simulators S1

andS2. They show a simulatorS1 that can be used to prove non-adaptive security of

phase one. They then show that they can construct a simulator S2 that proves

non-adaptive security for the second phase. Finally, the unify the two simulators under a composition argument to demonstrate that the two phases taken together are also non-adaptively secure.

2.6.5

Non-HBC Systems

There were other attempts to move away from the HBC model. In particular, Chai and Gong [3] introduced verifiable symmetric searchable encryption, on keywords, to allow the client to verify that the cloud has returned the correct list of document identifiers. They achieved this through the use of tries [17]. Their system offers two innovations: (1) It does not use per document indexes and (2) it is secure under a more realistic Semi-Honest-but-Curious adversarial model. Chai and Gong’s trie

(44)

based approach to managing the shared keyword index reduces substantially the complexity of previous approaches that resorted to managing per document indexes. Using the trie based index, Chai and Gong were able to devize a verifiable SSE scheme that is non-adaptively secure against a SHBC clouda _[3].

a_{Chai and Gong did not prove their scheme is non-adaptively secure. We offer the proof in}

(45)

Chapter 3

Verifiable Phrase Search

In this chapter we present an efficient method to carry out verifiable phrase-search over encrypted data in the setting of Semi-Honest-but-Curious (SHBC) cloud storage. In particular, we devise a two-phase protocol that verifies the result from a search of an encrypted phrase. We achieve this by incorporating, with modifications, a verifiable keyword-search technique and a phrase-search method based on work presented in ICC 2012 and ICDCS 2012. We use added verification tags to provide proof of the query results returned from the cloud.

Our solution presents a two-phase search method based on the indexed search with document identifiers, which differs from the standard approach, but aligns with the work of [4]. In particular, phase one is used to retrieve potential document identifiers from a global index structure. Specifically, it returns document identifiers for documents that contain all the words in a query phrase. Phase two is used to retrieve phrase matches from specific documents using a per-file index structure. In other words, phase one provides potential candidates and phase two refines the search so that the cloud only sends to the client documents that contain exact phrase matches. Moreover, each phase can be independently verified for correctness by the client.

(46)

Our main contribution is the construction of verifiable encrypted phrase-search in the Semi-Honest-but-Curious (SHBC) model. Our work improves the recent en-crypted phrase-search mechanism of Tang et. al. [4] in the following two aspects: First, we provide an encrypted phrase-search that is secure against the more powerful SHBC adversary. Second, we provide the ability of verifying search results.

3.1

Verifiable Encrypted Phrase Search

We present a two-phase verifiable encrypted phrase search mechanism. We achieve this by augmenting the approach of Tang et. al. [4] with the verifiable encrypted keyword-search of Chai and Gong [3]. We will demonstrate verifiability for the sec-ond phase of the protocol of Tang et. al. [4] by augmenting it with a verification mechanism. As there are two protocol phases, we will have two search indexes. One global and the other per document.

3.1.1

Verifiable Keyword Search

Chai and Gong achieved verifiable keyword-search [3] using the notion of trie [17] over an alphabet Σ as a global index. Each node has a value r0 that holds the symbol in

Σ of the given node. Chai et. al. further augmented each node with two fields. One field, r1, stores a globally unique value for the node and the other field, r2, stores

a bit map of the children of the node. We note that in the case that a node is a leaf the bitmap is actually a list of document identifiers. Chai and Gong used the following algorithm to construct their trie. The algorithm assumes the existence of a pseudo-random function (keyed hash function) gk :{0,1}

∗

→ {0,1}n, a block cipher sK to encrypt (n+η) bits of plaintext, and a function ord which, when given a node

of the trie, returns the index of the associated character in the alphabet Σ. The node at levelj at the left-to-right location q is denoted byTj,q.

(47)

The trie is constructed by first inserting every word in the document collec-tion D into the trie. Every internal node sets r1 = gk(v ||j ||parent(v) [r1]) and

r2 = E(r1 ||b). Every leaf node sets r1 = D || gk(D) where D is the list of

doc-uments that contain the word on the path from the root to the leaf. Note that this stores the un-hashed version of D. The leaf node also sets the value r2 to

gk($||j+ 1||parent($) [r1]). Each node of the trie has its children permuted and its

associated symbol, stored in r0, removed. Finally, the trie is sent to the cloud along

with the encrypted document collection.

The client generates a privacy preserving queryπ for the cloud to use in searching for the keyword in the index. The query π on a word w is constructed by setting πi =gk(wi ||i||πi−1) for i≥1, where wi ∈wis the ith letter in wordw. The values

πi−1 is the hash of: the previous character in the word, its position in the word, and

the its parents hash value.. We boot strap this by setting π0 = 0. Observe, we are

essentially building a chained hash along the root to leaf path that exist in the nodes of the trie. This query is then sent to the cloud. The cloud will search the index and send back a list of document identifiers associated with documents that contain the keyword (document set), the hash of the document set and a proof that the data returned is correct and complete. The proof is the series of r2 values in the nodes of

the trie. Once the results are returned to the client, the client may verify that the cloud has behaved honestly. This is done by checking the series of r2 values against

the query and results.

3.1.2

Verified Phrase Searching

Building on the ideas summarized in Sections 2.6.4 and 3.1.1, we now construct a two-phase verified searchable phrase encryption protocol. In the first phase, the cloud returns all the document identifiers that contain all of the words in the phrase submitted by the client, together with a proof of correctness. In the second phase,

(48)

the per-document indexes are queried for the exact phrase, and a true or false value is returned to the client for each document along with a proof of correctness.

Phase one of our protocol will obtain the document identifiers using the verified encrypted keyword search protocol of Chai and Gong [3]. We extend their idea to deal with conjunction of keywords so that only identifiers for documents with all the keywords present are returned to the client. We then augment the work of Tang et. al., discussed in Section 2.6.4, to provide verification of phrase matches.

For the problem of conjunctive keyword matching we use the strategy used in [3, 20, 21] and send multiple query vectors to the cloud. The client will then take the document identifiers returned from each query vector, verify them, and compute the intersection of the document identifiers with the other resulting sets of document identifiers. The resulting intersection will tell us on what documents identifiers to perform a deeper phrase search. For verification we use the individual keyword ver-ification algorithm used in [3]. We note that the search time here is proportional to the length of the phrase. The size of the index (trie) will remain fixed to the size of the dictionary.

The second phase of our protocol is to match the desired phrase in a candidate document, using the same query and search mechanism of Tang et. al. [4], with an added “verification tag” when building the position index matrixA. The verification tags will allow us to later verify that the cloud correctly answered o