• No results found

CS346: Advanced Databases

N/A
N/A
Protected

Academic year: 2022

Share "CS346: Advanced Databases"

Copied!
35
0
0

Loading.... (view fulltext now)

Full text

(1)

CS346:

Advanced Databases

Alexandra I. Cristea

[email protected]

Data Security and

Privacy

(2)

Outline

Chapter: “Database Security” in Elmasri and Navathe (chapter 24, 6th Edition)

♦ Brief overview of database security

♦ More detailed study of database privacy

Statistical databases, and differential privacy to protect data

Data anonymization: k-anonymity and l-diversity

Why?

♦ A topical issue: privacy and security are big concerns

♦ Connections to computer security, statistics

CS346 Advanced Databases

2

(3)

Database Security and Privacy

Database Security and Privacy is a large and complex area, covering:

Legal and ethical requirements for data privacy

E.g. UK Data Protection Act (1998)

Determines for how long data can be retained about people

♦ Government and organisation policy issues on data sharing

E.g. when and how credit reports, medical data can be shared

System-level issues for security management

How is access to data controlled by the system?

Classification of data by security levels (secret, unclassified)

How to determine if access to data is permitted?

CS346 Advanced Databases

3

(4)

Threats to Databases

Databases faces many threats which must be protected against

♦ Integrity: prevent improper modification of the data

Caused by intention or accident, insider or outsider threat

Erroneous data can lead to incorrect decisions, fraud, errors

♦ Availability: is the data available to users and programs?

Denial of service attacks prevent access, cost revenue/reputation

♦ Confidentiality: protect data from unauthorized disclosure

Leakage of information could violate law, customer confidence

Sometimes referred to as the CIA triad

CS346 Advanced Databases

4

(5)

Control measures

♦ Security is provided by control measures of various types

♦ Access control: affect who can access the data / the system

Different users may have different levels of access

♦ Inference control: control what data implies about individuals

Try to make it impossible to infer facts about individuals

♦ Flow control: prevent information from escaping the database

E.g. can data be transferred to other applications?

Control covert channels that can be used to leak data

♦ Data Encryption: store encrypted data in the database

At the field level or at the whole file level

Trade-off between security and ease of processing

CS346 Advanced Databases

5

(6)

Information Security and Information Privacy

♦ The dividing line between security and privacy is hard to draw

♦ Security: prevent unauthorised use of system and data

E.g. access controls to lock out unauthorised users

E.g. encryption to hide data from those without the key

♦ Privacy: control the use of data

Ensure that private information does not emerge from queries

Ability of individuals to control the use of their data

♦ Will focus on database privacy for the remainder

CS346 Advanced Databases

6

(7)

Statistical Databases and Privacy

♦ Statistical databases keep data on large groups

E.g. national population data (Office of National Statistics)

♦ The raw data in statistical databases is confidential

Detailed data about individuals e.g. from census

Users are permitted to retrieve statistics from the data

E.g. averages, sums, counts, maximum values etc.

♦ Providing security for statistical databases is a big challenge

Many crafty ways to extract private information from them

The database should prevent queries that leak information

CS346 Advanced Databases

7

(8)

Statistical Database challenges

♦ Very specific queries can refer to a single person

E.g. SELECT AVG(Salary) FROM EMPLOYEE WHERE AGE=22 AND POSTCODE=‘W1A 1AA’ AND DNO=5 : 45,000

SELECT COUNT(*) FROM EMPLOYEE WHERE AGE=22 AND POSTCODE=‘W1A 1AA’ AND DNO=5 AND SALARY>40000 : 1

♦ How would you detect and reject such queries?

♦ Can arrange queries where the difference is small

SELECT COUNT(*) FROM EMPLOYEE WHERE AGE>=22 AND DNO=5 AND SALARY>40000 : 12

SELECT COUNT(*) FROM EMPLOYEE WHERE AGE>=23 AND DNO=5 AND SALARY>40000 : 11

CS346 Advanced Databases

8

(9)

Differential Privacy for Statistical Databases

♦ Principle: query answers reveals little about any individual

Even if adversary knows (almost) everything about everyone else!

♦ Thus, individuals should be secure about contributing their data

What is learnt about them is about the same either way

♦ Much work on providing differential privacy (DP)

Simple recipe for some data types (e.g. numeric answers)

Simple rules allow us to reason about composition of results

More complex algorithms for arbitrary data

♦ Adopted and used by several organizations:

US Census, Common Data Project, Facebook (?)

9

(10)

Differential Privacy Definition

The output distribution of a differentially private algorithm changes very little whether or not any individual’s data is included in the input

– so you should contribute your data

10

(11)

Laplace Mechanism

♦ The Laplace Mechanism adds random noise to query results

Scaled to mask the contribution of any individual

Add noise from a symmetric continuous distribution to true answer

Laplace distribution is a symmetric exponential distribution

Laplace provides DP for COUNT queries, as shifting the distribution changes the probability by at most a constant factor

11 CS346 Advanced Databases

(12)

Sensitivity of Numeric Functions

♦ For more complex functions, we need to calibrate the noise to the influence an individual can have on the output

The (global) sensitivity of a function F is the maximum (absolute) change over all possible adjacent inputs

S(F) = maxD, D’ : |D-D’|=1 |F(D) – F(D’)|

Intuition: S(F) characterizes the scale of the influence of one individual, and hence how much noise we must add

♦ S(F) is small for many common functions

S(F) = 1 for COUNT

S(F) = 2 for HISTOGRAM

Bounded for other functions (MEAN, covariance matrix…)

12

Female Male

CS346 Advanced Databases

(13)

CS346 Advanced Databases

13

Data Anonymization

The idea of data anonymization is compelling, has many applications

♦ For Data Sharing

Give real(istic) data to others to study without compromising privacy of individuals in the data

Allows third-parties to try new analysis and mining techniques not thought of by the data owner

♦ For Data Retention and Usage

Various requirements prevent companies from retaining customer information indefinitely

E.g. Google progressively anonymizes IP addresses in search logs

Internal sharing across departments (e.g. billing → marketing)

(14)

CS346 Advanced Databases

14

Case Study: US Census

♦ Raw data: information about every US household

Who, where; age, gender, racial, income and educational data

♦ Why released: determine representation, planning

♦ How anonymized: aggregated to geographic areas (Zip code)

Broken down by various combinations of dimensions

Released in full after 72 years

♦ Attacks: no reports of successful deanonymization

Recent attempts by FBI to access raw data rebuffed

♦ Consequences: greater understanding of US population

Affects representation, funding of civil projects

Rich source of data for future historians and genealogists

(15)

CS346 Advanced Databases

15

Case Study: Netflix Prize

♦ Raw data: 100M dated ratings from 480K users to 18K movies

♦ Why released: improve predicting ratings of unlabeled examples

♦ How anonymized: exact details not described by Netflix

All direct customer information removed

Only subset of full data; dates modified; some ratings deleted,

Movie title and year published in full

♦ Attacks: dataset is claimed vulnerable

Attack links data to IMDB where same users also rated movies

Find matches based on similar ratings or dates in both

♦ Consequences: rich source of user data for researchers

Unclear how serious the attacks are in practice

(16)

CS346 Advanced Databases

16

Case Study: AOL Search Data

♦ Raw data: 20M search queries for 650K users from 2006

♦ Why released: allow researchers to understand search patterns

♦ How anonymized: user identifiers removed

All searches from same user linked by an arbitrary identifier

♦ Attacks: many successful attacks identified individual users

Ego-surfers: people typed in their own names

Zip codes and town names identify an area

NY Times identified user 4417749 as 62yr old widow

♦ Consequences: CTO resigned, two researchers fired

Well-intentioned effort failed due to inadequate anonymization

(17)

♦ Last time:

generalities about security, privacy; case studies privacy

♦ Next:

Anonymisation, de-identification, attacks

CS346 Advanced Databases

17

(18)

CS346 Advanced Databases

18

Models of Anonymization

♦ Interactive Model (akin to statistical databases)

Data owner acts as “gatekeeper” to data

Researchers pose queries in some agreed language

Gatekeeper gives an (anonymized) answer, or refuses to answer

♦ “Send me your code” model

Data owner executes code on their system and reports result

Cannot be sure that the code is not malicious

♦ Offline, aka “publish and be damned” model

Data owner somehow anonymizes data set

Publishes the results to the world, and retires

The model used in most real data releases

(19)

CS346 Advanced Databases

19

Objectives for Anonymization

♦ Prevent (high confidence) inference of associations

Prevent inference of salary for an individual in “census”

Prevent inference of individual’s viewing history in “video”

Prevent inference of individual’s search history in “search”

All aim to prevent linking sensitive information to an individual

♦ Prevent inference of presence of an individual in the data set

Satisfying “presence” also satisfies “association” (not vice-versa)

Presence in a data set can violate privacy (e.g., STD clinic patients)

♦ Have to consider what knowledge might be known to attacker

Background knowledge: facts about the data set (X has salary Y)

Domain knowledge: broad properties of data (illness Z rare in men)

(20)

CS346 Advanced Databases

20

Utility

♦ Anonymization is meaningless if utility of data not considered

The empty data set has perfect privacy, but no utility

The original data has full utility, but no privacy

What is “utility”? Depends what the application is…

For fixed query set, can look at maximum or average error

Problem for publishing: want to support unknown applications!

Need some way to quantify utility of alternate anonymizations

(21)

CS346 Advanced Databases

21

Definitions of Technical Terms

♦ Identifiers–uniquely identify, e.g. Social Security Number (SSN)

Step 0: remove all identifiers

Was not enough for AOL search data

♦ Quasi-Identifiers (QI)—such as DOB, Sex, ZIP Code

Enough to partially identify an individual in a dataset

DOB+Sex+ZIP unique for 87% of US Residents [Sweeney 02]

♦ Sensitive attributes (SA)—the associations we want to hide

Salary in the “census” example is considered sensitive

Not always well-defined: only some “search” queries sensitive

In “video”, association between user and video is sensitive

One SA can reveal others: bonus may identify salary…

(22)

CS346 Advanced Databases

22 22

Tabular Data Example

♦ Census data recording incomes and demographics

♦ Releasing SSN → Salary association violates individual’s privacy

SSN is an identifier, Salary is a sensitive attribute (SA)

SSN DOB Sex ZIP Salary 11-1-111 1/21/76 M 53715 50,000 22-2-222 4/13/86 F 53715 55,000 33-3-333 2/28/76 M 53703 60,000 44-4-444 1/21/76 M 53703 65,000 55-5-555 4/13/86 F 53706 70,000 66-6-666 2/28/76 F 53706 75,000

(23)

CS346 Advanced Databases

23 23

Tabular Data Example: De-Identification

♦ Census data: remove SSN to create de-identified table

Remove an attribute from the data

♦ Does the de-identified table preserve an individual’s privacy?

Depends on what other information an attacker knows

DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 70,000 2/28/76 F 53706 75,000

(24)

CS346 Advanced Databases

24 24

Tabular Data Example: Linking Attack

♦ De-identified private data + publicly available data

♦ Cannot uniquely identify either individual’s salary

DOB is a quasi-identifier (QI)

DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 70,000 2/28/76 F 53706 75,000

SSN DOB 11-1-111 1/21/76 33-3-333 2/28/76

(25)

CS346 Advanced Databases

25 25

Tabular Data Example: Linking Attack

♦ De-identified private data + publicly available data

♦ Uniquely identified one individual’s salary, but not the other’s

DOB, Sex are quasi-identifiers (QI)

DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 70,000 2/28/76 F 53706 75,000

SSN DOB Sex 11-1-111 1/21/76 M 33-3-333 2/28/76 M

(26)

CS346 Advanced Databases

26 26

Tabular Data Example: Linking Attack

♦ De-identified private data + publicly available data

♦ Uniquely identified both individuals’ salaries

[DOB, Sex, ZIP] is unique for lots of US residents [Sweeney 02]

DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 70,000 2/28/76 F 53706 75,000

SSN DOB Sex ZIP 11-1-111 1/21/76 M 53715 33-3-333 2/28/76 M 53703

(27)

CS346 Advanced Databases

27 27

Tabular Data Example: Anonymization

♦ Anonymization through row suppression / deletion

♦ Cannot link to private table even with knowledge of QI values

Missing values could take any permitted value

Looses a lot of information from the data

DOB Sex ZIP Salary

* * * *

4/13/86 F 53715 55,000 2/28/76 M 53703 60,000

* * * *

4/13/86 F 53706 70,000 2/28/76 F 53706 75,000

SSN DOB Sex ZIP 11-1-111 1/21/76 M 53715

(28)

CS346 Advanced Databases

28 28

Tabular Data Example: Anonymization

♦ Anonymization through QI attribute generalization

♦ Cannot uniquely identify row with knowledge of QI values

Fewer possibilities than row suppression

E.g., ZIP = 537** → ZIP ∈ {53700, …, 53799}

DOB Sex ZIP Salary 1/21/76 M 537** 50,000 4/13/86 F 537** 55,000 2/28/76 * 537** 60,000 1/21/76 M 537** 65,000 4/13/86 F 537** 70,000 2/28/76 * 537** 75,000

SSN DOB Sex ZIP 11-1-111 1/21/76 M 53715 33-3-333 2/28/76 M 53703

(29)

CS346 Advanced Databases

29 29

k-Anonymization

♦ k-anonymity: Table T satisfies k-anonymity with respect to quasi- identifier QI if and only if each tuple in (the multiset) T[QI]

appears at least k times

Protects against “linking attack”

♦ k-anonymization: Table T’ is a k-anonymization of T if T’ is a generalization/suppression of T, and T’ satisfies k-anonymity DOB Sex ZIP Salary

1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 70,000 2/28/76 F 53706 75,000

DOB Sex ZIP Salary 1/21/76 M 537** 50,000 4/13/86 F 537** 55,000 2/28/76 * 537** 60,000 1/21/76 M 537** 65,000 4/13/86 F 537** 70,000 2/28/76 * 537** 75,000

T’

T

(30)

30 30

k-Anonymization and queries

♦ Data Analysis

Analysis should (implicitly) range over all possible tables

Example question: what is the salary of individual (1/21/76, M, 53715)? Best guess is 57,500 (average of 50,000 and 65,000)

Example question: what is the maximum salary of males in 53706?

Could be as small as 50,000, or as big as 75,000

DOB Sex ZIP Salary 1/21/76 M 537** 50,000 4/13/86 F 537** 55,000 2/28/76 * 537** 60,000 1/21/76 M 537** 65,000 4/13/86 F 537** 70,000 2/28/76 * 537** 75,000

CS346 Advanced Databases

(31)

CS346 Advanced Databases

31 31

Homogeneity Attack

♦ Issue: k-anonymity requires each tuple in (the multiset) T[QI] to appear ≥ k times, but does not say anything about the SA values

If (almost) all SA values in a QI group are equal, loss of privacy!

The problem is with the choice of grouping, not the data

DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 50,000 4/13/86 F 53706 55,000 2/28/76 F 53706 60,000

DOB Sex ZIP Salary 1/21/76 * 537** 50,000 4/13/86 * 537** 55,000 2/28/76 * 537** 60,000 1/21/76 * 537** 50,000 4/13/86 * 537** 55,000 2/28/76 * 537** 60,000

Not Ok!

(32)

CS346 Advanced Databases

32 32

Homogeneity Attack

♦ Issue: k-anonymity requires each tuple in (the multiset) T[QI] to appear ≥ k times, but does not say anything about the SA values

If (almost) all SA values in a QI group are equal, loss of privacy!

The problem is with the choice of grouping, not the data

For some groupings, no loss of privacy

DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 50,000 4/13/86 F 53706 55,000 2/28/76 F 53706 60,000

DOB Sex ZIP Salary 76-86 * 53715 50,000 76-86 * 53715 55,000 76-86 * 53703 60,000 76-86 * 53703 50,000 76-86 * 53706 55,000 76-86 * 53706 60,000

Ok!

(33)

CS346 Advanced Databases

33 33

Homogeneity

♦ Intuition: A k-anonymized table T’ represents the set of all possible tables Ti s.t. T’ is a k-anonymization of Ti

♦ Lack of diversity of SA values implies that for large fraction of possible tables, some fact is true, which can violate privacy DOB Sex ZIP Salary

1/21/76 * 537** 50,000 4/13/86 * 537** 55,000 2/28/76 * 537** 60,000 1/21/76 * 537** 50,000 4/13/86 * 537** 55,000 2/28/76 * 537** 60,000

SSN DOB Sex ZIP 11-1-111 1/21/76 M 53715

(34)

CS346 Advanced Databases

34 34

l-Diversity

l-Diversity Principle: a table is l-diverse if each of its QI groups contains at least l “well-represented” values for the SA

Frequency l-diversity: for each QI group g, no SA value should occur more than 1/l fraction of the time

Even l-diversity has its weaknesses: an adversary can use machine learning techniques to make inferences about individuals

DOB Sex ZIP Salary 1/21/76 * 537** 50,000 4/13/86 * 537** 50,000 2/28/76 * 537** 60,000 1/21/76 * 537** 55,000 4/13/86 * 537** 55,000 2/28/76 * 537** 65,000

(35)

Summary

CS346 Advanced Databases

35

♦ Concepts in database security: integrity, availability, confidentiality

♦ Statistical databases, and differential privacy to protect data

♦ Data anonymization: k-anonymity and l-diversity

♦ Identifiers, Quasi-identifiers, sensitive attributes

Recommended reading:

♦ Chapter: “Database Security” in Elmasri and Navathe

♦ “A Firm Foundation for Private Data Analysis”, Cynthia Dwork

♦ “k-anonymity”, V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati

References

Related documents

be related to the frequency of Ai in a population defined sufficiently large enough that the descendants of the ancestors of A and B are randomly dis-

Conclusion: The model extends previous research by outlining a process by which clinicians' perceptions shape implementation of lifestyle risk factor management in routine

Surface force analysis with atomic force microscope (AFM) in which a single amino acid residue was mounted on the tip apex of AFM probe was carried out for the first time at

e , f The 18 F-fluorodeoxyglucosepositron emission tomography-computed tomography ( 18 F-FDG PET/CT) scan performed five months after starting cART showed intense accumulation

• Macro/Institutional Level: Macro-level factors that have a significant impact on young Kenyans include: (i) the poor performance of the economy, which in turn limits

3 Sizing EqualLogic Storage for Exchange Server 2010 Mailbox Role in a Virtualized Environment In an Exchange 2010 environment, storage sizing will depend on capacity and

In the present study it was observed that the alcoholic extract of Plumeria rubra pod extract at 200 mg/kg body weight prolonged the estrous cycle and particularly diestrus phase

In this instance, the the master has two different options: (i) running the tasks on the local infrastructure (single cloud case) or (ii) outsourcing (using a