Scalable Private Database Querying
for Arbitrary Formulas
Vladimir Kolesnikov (Bell Labs)
Seung Geol Choi, Angelos Keromytis, Fernando Krell, Tal Malkin, Vasilis Pappas and Binh Vo (Columbia)
Wesley George (UToronto),
Outline
•
Problem description
•
The cost of secure computation and how to scale
•
Our system
3
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
IARPA SPAR: Security and Privacy Assurance
Research
Required features
100M records, 10TB DB
Preserve query and data privacy
Robust query support:
select * where NAME=Bob AND AGE >20
Boolean query expressions (including at least three conjunctions) Range queries and inequalities for integer numeric, date/time, etc Matching of keywords ―close to a specified value (stemming)
Text fields with many keywords (e.g. 100’s) Matching of values with wildcards
Matching of values with a specified subsequence m-of-n conjunctions
Ranking of results …
Allowed up to 2-10x overhead compared to MySQL
5
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
Basic Architecture
•
Database Owner Encrypted Database Client Index Server (S) S holds permuted encrypted indexed DB
Overview:
1. Alice prepares encrypted version C’ of C
2. Sends encrypted form x’ of her input x
3. Allows Bob to obtain encrypted form y’ of his input y
4. Bob can compute from C’,x’,y’ the “encryption” z’ of z=C(x,y)
5. Bob sends z’ to Alice and she decrypts and reveals to him z
AND OR AND
NOT
OR AND
Alice’s inputs
Bob’s inputs
7
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. AND OR
AND
NOT
OR AND
Alice’s inputs
Bob’s inputs
Secure Computation: Cost
Circuit encryption includes encryption of truth table of gates
For each gate of C, need to compute and send O(4) encryptions (AES needs 50-150 cycles to encrypt 128 bits)
Very fast for small problems
Does not scale for
large functions
Secure Computation: how to scale
If OK to have some security loss (as efficiency tradeoff):
Identify privacy-critical subroutines and implement them securely Insecure implementation of the rest
9
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
Natural Trade Offs
Deterministic encryption
Because of scale, comparison of encrypted values used in search must be very fast. Not clear how to approach with probabilistic encryption
Access patterns
Clearly not a bad leakage. Seems quite expensive to avoid, so natural to live with it.
Bloom Filter
Constant-time querying
Efficient storage (ca 10 bits per keyword)
Fixed access pattern (same for both match and non-match)
Encrypted BF:
11 | Columbia U / Bell Labs
Occluded BF
Idea:
Mask BF with a (pseudo-)random pad Let Client know the pad (via seed)
Then Client and Server run SFE for computing match, where C inputs pad. GC is very efficient: 10-20 gates per term, plus gates to implement formula.
Query: C sends Enc(kw), S computes match OK for single keyword searches
DB Search
… DB records S … C … … … … Solution: Evaluate via Secure Computation13
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
Security Guarantee
We leak to S at most the following access patterns:
- the query pattern of a set of queries (e.g., S can distinguish between simple
and complex queries)
- tree search pattern of each query
- returned records access pattern
Above types of leakage seem necessary to achieve efficient sublinear performance.
Advanced Queries Based on AND/OR formulas:
Range Queries
We cover the range of our data type With a collection of intervals
15 | Columbia U / Bell Labs
Advanced Queries Based on AND/OR formulas:
Range Queries
Advanced Queries Based on AND/OR formulas:
Range Queries
To search for any value within a range,
we search for the smallest covering collection of intervals, using an OR formula
17 | Columbia U / Bell Labs
Advanced Queries Based on AND/OR formulas:
Negations
Note that the set of points other than some fixed value, has a small interval cover
19
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
Experimental Results
Policy Compliance
GC is strategically at the center of our approach because easy to compose. Requirement: secure policy checking:
Policy rejection should look like a query no-match to C and S
implement policy as a GC computation whose output is an input to BF tree node GC computation.
21
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
Subtlety 1: inexact data representation by BF
Let A, B, C collide under hash functions of BF, s.t. every index of C is an index for either A or B.
Then !∧"⇒#
Well-known “issue” – BF false positive
Does not reveal knowledge of underlying data, just representation.
A
B
Subtlety 1: inexact data representation by BF
Let A, B, C collide under hash functions of BF, s.t. every index of C is an index for either A or B.
Then !∧¬#⇒¬"
Issue: learn B without querying, even in secure eval of !∧¬#
Pertains to original data, not just BF representation We calculate advantage Adv ≤*(+*/,)↑+
A
B
23
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
0-1 Result Set Size Indistinguishability
Goal: hide from S whether there was a 0 or 1 match.
S is an airline and C is gov’t querying for POI. Expect 0 hits S learning of a match can cause panic.
Def 1: Consider probability of bad event, prove it’s small
0-1 Result Set Size Indistinguishability
Goal: hide from S whether there was a 0 or 1 match.
Def 2: If distinguishable, guarantee that D’s confidence is not very high
- if the a-priori probability of a 1-case is /, then conditioned on any possible
view, the a-posteriori probability of a 1-case is at most (1+0)/).
Solution: C adds p of fake tree-traversal paths. p is a random variable drawn from distribution like this
N paths