Outline
Chapter: “Database Security” in Elmasri and Navathe (chapter 24, 6th Edition)
♦ Brief overview of database security
♦ More detailed study of database privacy
– Statistical databases, and differential privacy to protect data
– Data anonymization: k-anonymity and l-diversity
Why?
♦ A topical issue: privacy and security are big concerns
♦ Connections to computer security, statistics
CS346 Advanced Databases
2
Database Security and Privacy
Database Security and Privacy is a large and complex area, covering:
♦ Legal and ethical requirements for data privacy
– E.g. UK Data Protection Act (1998)
– Determines for how long data can be retained about people
♦ Government and organisation policy issues on data sharing
– E.g. when and how credit reports, medical data can be shared
♦ System-level issues for security management
– How is access to data controlled by the system?
♦ Classification of data by security levels (secret, unclassified)
– How to determine if access to data is permitted?
CS346 Advanced Databases
3
Threats to Databases
Databases faces many threats which must be protected against
♦ Integrity: prevent improper modification of the data
– Caused by intention or accident, insider or outsider threat
– Erroneous data can lead to incorrect decisions, fraud, errors
♦ Availability: is the data available to users and programs?
– Denial of service attacks prevent access, cost revenue/reputation
♦ Confidentiality: protect data from unauthorized disclosure
– Leakage of information could violate law, customer confidence
Sometimes referred to as the CIA triad
CS346 Advanced Databases
4
Control measures
♦ Security is provided by control measures of various types
♦ Access control: affect who can access the data / the system
– Different users may have different levels of access
♦ Inference control: control what data implies about individuals
– Try to make it impossible to infer facts about individuals
♦ Flow control: prevent information from escaping the database
– E.g. can data be transferred to other applications?
– Control covert channels that can be used to leak data
♦ Data Encryption: store encrypted data in the database
– At the field level or at the whole file level
– Trade-off between security and ease of processing
CS346 Advanced Databases
5
Information Security and Information Privacy
♦ The dividing line between security and privacy is hard to draw
♦ Security: prevent unauthorised use of system and data
– E.g. access controls to lock out unauthorised users
– E.g. encryption to hide data from those without the key
♦ Privacy: control the use of data
– Ensure that private information does not emerge from queries
– Ability of individuals to control the use of their data
♦ Will focus on database privacy for the remainder
CS346 Advanced Databases
6
Statistical Databases and Privacy
♦ Statistical databases keep data on large groups
– E.g. national population data (Office of National Statistics)
♦ The raw data in statistical databases is confidential
– Detailed data about individuals e.g. from census
♦ Users are permitted to retrieve statistics from the data
– E.g. averages, sums, counts, maximum values etc.
♦ Providing security for statistical databases is a big challenge
– Many crafty ways to extract private information from them
– The database should prevent queries that leak information
CS346 Advanced Databases
7
Statistical Database challenges
♦ Very specific queries can refer to a single person
– E.g. SELECT AVG(Salary) FROM EMPLOYEE WHERE AGE=22 AND POSTCODE=‘W1A 1AA’ AND DNO=5 : 45,000
– SELECT COUNT(*) FROM EMPLOYEE WHERE AGE=22 AND POSTCODE=‘W1A 1AA’ AND DNO=5 AND SALARY>40000 : 1
♦ How would you detect and reject such queries?
♦ Can arrange queries where the difference is small
– SELECT COUNT(*) FROM EMPLOYEE WHERE AGE>=22 AND DNO=5 AND SALARY>40000 : 12
– SELECT COUNT(*) FROM EMPLOYEE WHERE AGE>=23 AND DNO=5 AND SALARY>40000 : 11
CS346 Advanced Databases
8
Differential Privacy for Statistical Databases
♦ Principle: query answers reveals little about any individual
– Even if adversary knows (almost) everything about everyone else!
♦ Thus, individuals should be secure about contributing their data
– What is learnt about them is about the same either way
♦ Much work on providing differential privacy (DP)
– Simple recipe for some data types (e.g. numeric answers)
– Simple rules allow us to reason about composition of results
– More complex algorithms for arbitrary data
♦ Adopted and used by several organizations:
– US Census, Common Data Project, Facebook (?)
9
Differential Privacy Definition
The output distribution of a differentially private algorithm changes very little whether or not any individual’s data is included in the input
– so you should contribute your data
10
Laplace Mechanism
♦ The Laplace Mechanism adds random noise to query results
– Scaled to mask the contribution of any individual
– Add noise from a symmetric continuous distribution to true answer
– Laplace distribution is a symmetric exponential distribution
– Laplace provides DP for COUNT queries, as shifting the distribution changes the probability by at most a constant factor
11 CS346 Advanced Databases
Sensitivity of Numeric Functions
♦ For more complex functions, we need to calibrate the noise to the influence an individual can have on the output
– The (global) sensitivity of a function F is the maximum (absolute) change over all possible adjacent inputs
– S(F) = maxD, D’ : |D-D’|=1 |F(D) – F(D’)|
– Intuition: S(F) characterizes the scale of the influence of one individual, and hence how much noise we must add
♦ S(F) is small for many common functions
– S(F) = 1 for COUNT
– S(F) = 2 for HISTOGRAM
– Bounded for other functions (MEAN, covariance matrix…)
12
Female Male
CS346 Advanced Databases
CS346 Advanced Databases
13
Data Anonymization
The idea of data anonymization is compelling, has many applications
♦ For Data Sharing
– Give real(istic) data to others to study without compromising privacy of individuals in the data
– Allows third-parties to try new analysis and mining techniques not thought of by the data owner
♦ For Data Retention and Usage
– Various requirements prevent companies from retaining customer information indefinitely
– E.g. Google progressively anonymizes IP addresses in search logs
– Internal sharing across departments (e.g. billing → marketing)
CS346 Advanced Databases
14
Case Study: US Census
♦ Raw data: information about every US household
– Who, where; age, gender, racial, income and educational data
♦ Why released: determine representation, planning
♦ How anonymized: aggregated to geographic areas (Zip code)
– Broken down by various combinations of dimensions
– Released in full after 72 years
♦ Attacks: no reports of successful deanonymization
– Recent attempts by FBI to access raw data rebuffed
♦ Consequences: greater understanding of US population
– Affects representation, funding of civil projects
– Rich source of data for future historians and genealogists
CS346 Advanced Databases
15
Case Study: Netflix Prize
♦ Raw data: 100M dated ratings from 480K users to 18K movies
♦ Why released: improve predicting ratings of unlabeled examples
♦ How anonymized: exact details not described by Netflix
– All direct customer information removed
– Only subset of full data; dates modified; some ratings deleted,
– Movie title and year published in full
♦ Attacks: dataset is claimed vulnerable
– Attack links data to IMDB where same users also rated movies
– Find matches based on similar ratings or dates in both
♦ Consequences: rich source of user data for researchers
– Unclear how serious the attacks are in practice
CS346 Advanced Databases
16
Case Study: AOL Search Data
♦ Raw data: 20M search queries for 650K users from 2006
♦ Why released: allow researchers to understand search patterns
♦ How anonymized: user identifiers removed
– All searches from same user linked by an arbitrary identifier
♦ Attacks: many successful attacks identified individual users
– Ego-surfers: people typed in their own names
– Zip codes and town names identify an area
– NY Times identified user 4417749 as 62yr old widow
♦ Consequences: CTO resigned, two researchers fired
– Well-intentioned effort failed due to inadequate anonymization
♦ Last time:
– generalities about security, privacy; case studies privacy
♦ Next:
– Anonymisation, de-identification, attacks
CS346 Advanced Databases
17
CS346 Advanced Databases
18
Models of Anonymization
♦ Interactive Model (akin to statistical databases)
– Data owner acts as “gatekeeper” to data
– Researchers pose queries in some agreed language
– Gatekeeper gives an (anonymized) answer, or refuses to answer
♦ “Send me your code” model
– Data owner executes code on their system and reports result
– Cannot be sure that the code is not malicious
♦ Offline, aka “publish and be damned” model
– Data owner somehow anonymizes data set
– Publishes the results to the world, and retires
– The model used in most real data releases
CS346 Advanced Databases
19
Objectives for Anonymization
♦ Prevent (high confidence) inference of associations
– Prevent inference of salary for an individual in “census”
– Prevent inference of individual’s viewing history in “video”
– Prevent inference of individual’s search history in “search”
– All aim to prevent linking sensitive information to an individual
♦ Prevent inference of presence of an individual in the data set
– Satisfying “presence” also satisfies “association” (not vice-versa)
– Presence in a data set can violate privacy (e.g., STD clinic patients)
♦ Have to consider what knowledge might be known to attacker
– Background knowledge: facts about the data set (X has salary Y)
– Domain knowledge: broad properties of data (illness Z rare in men)
CS346 Advanced Databases
20
Utility
♦ Anonymization is meaningless if utility of data not considered
– The empty data set has perfect privacy, but no utility
– The original data has full utility, but no privacy
♦ What is “utility”? Depends what the application is…
– For fixed query set, can look at maximum or average error
– Problem for publishing: want to support unknown applications!
– Need some way to quantify utility of alternate anonymizations
CS346 Advanced Databases
21
Definitions of Technical Terms
♦ Identifiers–uniquely identify, e.g. Social Security Number (SSN)
– Step 0: remove all identifiers
– Was not enough for AOL search data
♦ Quasi-Identifiers (QI)—such as DOB, Sex, ZIP Code
– Enough to partially identify an individual in a dataset
– DOB+Sex+ZIP unique for 87% of US Residents [Sweeney 02]
♦ Sensitive attributes (SA)—the associations we want to hide
– Salary in the “census” example is considered sensitive
– Not always well-defined: only some “search” queries sensitive
– In “video”, association between user and video is sensitive
– One SA can reveal others: bonus may identify salary…
CS346 Advanced Databases
22 22
Tabular Data Example
♦ Census data recording incomes and demographics
♦ Releasing SSN → Salary association violates individual’s privacy
– SSN is an identifier, Salary is a sensitive attribute (SA)
SSN DOB Sex ZIP Salary 11-1-111 1/21/76 M 53715 50,000 22-2-222 4/13/86 F 53715 55,000 33-3-333 2/28/76 M 53703 60,000 44-4-444 1/21/76 M 53703 65,000 55-5-555 4/13/86 F 53706 70,000 66-6-666 2/28/76 F 53706 75,000
CS346 Advanced Databases
23 23
Tabular Data Example: De-Identification
♦ Census data: remove SSN to create de-identified table
– Remove an attribute from the data
♦ Does the de-identified table preserve an individual’s privacy?
– Depends on what other information an attacker knows
DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 70,000 2/28/76 F 53706 75,000
CS346 Advanced Databases
24 24
Tabular Data Example: Linking Attack
♦ De-identified private data + publicly available data
♦ Cannot uniquely identify either individual’s salary
– DOB is a quasi-identifier (QI)
DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 70,000 2/28/76 F 53706 75,000
SSN DOB 11-1-111 1/21/76 33-3-333 2/28/76
CS346 Advanced Databases
25 25
Tabular Data Example: Linking Attack
♦ De-identified private data + publicly available data
♦ Uniquely identified one individual’s salary, but not the other’s
– DOB, Sex are quasi-identifiers (QI)
DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 70,000 2/28/76 F 53706 75,000
SSN DOB Sex 11-1-111 1/21/76 M 33-3-333 2/28/76 M
CS346 Advanced Databases
26 26
Tabular Data Example: Linking Attack
♦ De-identified private data + publicly available data
♦ Uniquely identified both individuals’ salaries
– [DOB, Sex, ZIP] is unique for lots of US residents [Sweeney 02]
DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 70,000 2/28/76 F 53706 75,000
SSN DOB Sex ZIP 11-1-111 1/21/76 M 53715 33-3-333 2/28/76 M 53703
CS346 Advanced Databases
27 27
Tabular Data Example: Anonymization
♦ Anonymization through row suppression / deletion
♦ Cannot link to private table even with knowledge of QI values
– Missing values could take any permitted value
– Looses a lot of information from the data
DOB Sex ZIP Salary
* * * *
4/13/86 F 53715 55,000 2/28/76 M 53703 60,000
* * * *
4/13/86 F 53706 70,000 2/28/76 F 53706 75,000
SSN DOB Sex ZIP 11-1-111 1/21/76 M 53715
CS346 Advanced Databases
28 28
Tabular Data Example: Anonymization
♦ Anonymization through QI attribute generalization
♦ Cannot uniquely identify row with knowledge of QI values
– Fewer possibilities than row suppression
– E.g., ZIP = 537** → ZIP ∈ {53700, …, 53799}
DOB Sex ZIP Salary 1/21/76 M 537** 50,000 4/13/86 F 537** 55,000 2/28/76 * 537** 60,000 1/21/76 M 537** 65,000 4/13/86 F 537** 70,000 2/28/76 * 537** 75,000
SSN DOB Sex ZIP 11-1-111 1/21/76 M 53715 33-3-333 2/28/76 M 53703
CS346 Advanced Databases
29 29
k-Anonymization
♦ k-anonymity: Table T satisfies k-anonymity with respect to quasi- identifier QI if and only if each tuple in (the multiset) T[QI]
appears at least k times
– Protects against “linking attack”
♦ k-anonymization: Table T’ is a k-anonymization of T if T’ is a generalization/suppression of T, and T’ satisfies k-anonymity DOB Sex ZIP Salary
1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 70,000 2/28/76 F 53706 75,000
DOB Sex ZIP Salary 1/21/76 M 537** 50,000 4/13/86 F 537** 55,000 2/28/76 * 537** 60,000 1/21/76 M 537** 65,000 4/13/86 F 537** 70,000 2/28/76 * 537** 75,000
→
T’
T
30 30
k-Anonymization and queries
♦ Data Analysis
– Analysis should (implicitly) range over all possible tables
– Example question: what is the salary of individual (1/21/76, M, 53715)? Best guess is 57,500 (average of 50,000 and 65,000)
– Example question: what is the maximum salary of males in 53706?
Could be as small as 50,000, or as big as 75,000
DOB Sex ZIP Salary 1/21/76 M 537** 50,000 4/13/86 F 537** 55,000 2/28/76 * 537** 60,000 1/21/76 M 537** 65,000 4/13/86 F 537** 70,000 2/28/76 * 537** 75,000
CS346 Advanced Databases
CS346 Advanced Databases
31 31
Homogeneity Attack
♦ Issue: k-anonymity requires each tuple in (the multiset) T[QI] to appear ≥ k times, but does not say anything about the SA values
– If (almost) all SA values in a QI group are equal, loss of privacy!
– The problem is with the choice of grouping, not the data
DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 50,000 4/13/86 F 53706 55,000 2/28/76 F 53706 60,000
DOB Sex ZIP Salary 1/21/76 * 537** 50,000 4/13/86 * 537** 55,000 2/28/76 * 537** 60,000 1/21/76 * 537** 50,000 4/13/86 * 537** 55,000 2/28/76 * 537** 60,000
→
Not Ok!
CS346 Advanced Databases
32 32
Homogeneity Attack
♦ Issue: k-anonymity requires each tuple in (the multiset) T[QI] to appear ≥ k times, but does not say anything about the SA values
– If (almost) all SA values in a QI group are equal, loss of privacy!
– The problem is with the choice of grouping, not the data
– For some groupings, no loss of privacy
DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 50,000 4/13/86 F 53706 55,000 2/28/76 F 53706 60,000
→
DOB Sex ZIP Salary 76-86 * 53715 50,000 76-86 * 53715 55,000 76-86 * 53703 60,000 76-86 * 53703 50,000 76-86 * 53706 55,000 76-86 * 53706 60,000
Ok!
CS346 Advanced Databases
33 33
Homogeneity
♦ Intuition: A k-anonymized table T’ represents the set of all possible tables Ti s.t. T’ is a k-anonymization of Ti
♦ Lack of diversity of SA values implies that for large fraction of possible tables, some fact is true, which can violate privacy DOB Sex ZIP Salary
1/21/76 * 537** 50,000 4/13/86 * 537** 55,000 2/28/76 * 537** 60,000 1/21/76 * 537** 50,000 4/13/86 * 537** 55,000 2/28/76 * 537** 60,000
SSN DOB Sex ZIP 11-1-111 1/21/76 M 53715
CS346 Advanced Databases
34 34
l-Diversity
♦ l-Diversity Principle: a table is l-diverse if each of its QI groups contains at least l “well-represented” values for the SA
– Frequency l-diversity: for each QI group g, no SA value should occur more than 1/l fraction of the time
♦ Even l-diversity has its weaknesses: an adversary can use machine learning techniques to make inferences about individuals
DOB Sex ZIP Salary 1/21/76 * 537** 50,000 4/13/86 * 537** 50,000 2/28/76 * 537** 60,000 1/21/76 * 537** 55,000 4/13/86 * 537** 55,000 2/28/76 * 537** 65,000
Summary
CS346 Advanced Databases
35
♦ Concepts in database security: integrity, availability, confidentiality
♦ Statistical databases, and differential privacy to protect data
♦ Data anonymization: k-anonymity and l-diversity
♦ Identifiers, Quasi-identifiers, sensitive attributes
Recommended reading:
♦ Chapter: “Database Security” in Elmasri and Navathe
♦ “A Firm Foundation for Private Data Analysis”, Cynthia Dwork
♦ “k-anonymity”, V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati