• No results found

PRIVACY-PRESERVING DATA ANALYSIS AND DATA SHARING

N/A
N/A
Protected

Academic year: 2021

Share "PRIVACY-PRESERVING DATA ANALYSIS AND DATA SHARING"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

P

RIVACY

-P

RESERVING

D

ATA

A

NALYSIS

AND

D

ATA

S

HARING

Chih-Hua Tai

Dept. of Computer Science and Information Engineering, National Taipei University

(2)

B

ENEFIT

OF

D

ATA

A

NALYSIS

¢ 

Many fields such as marketing, business, healthcare,

software engineering, government, to new a few, has

received great benefit through data analysis

¢ 

Marketing / Retail

—  Helps marketing

—  Understand customers’ behaviors

—  Predict what their customers want

¢ 

Finance / Banking

—  Predict the risk of a loan

—  Detect fraudulent credit card transactions

(3)

B

ENEFIT

OF

D

ATA

S

HARING

¢ 

The volume of data effects the quality of the results

of data analysis

¢ 

The release of data brings more chance of better

data analysis

(4)

P

RIVACY

-P

RESERVING

D

ATA

(5)

P

RIVACY

I

SSUES

IN

D

ATA

S

HARING

(E

X

1)

DOB Sex Zipcode Disease

1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatities 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu

2/28/76 Female 53706 Hang Nail

(6)

P

RIVACY

I

SSUES

IN

D

ATA

S

HARING

(E

X

1)

6

Name DOB Sex Zipcode

Alice 1/21/76 Male 53715

Bob 1/10/81 Female 55410

Carol 10/1/44 Female 90210

Dan 2/21/84 Male 02174

(7)

Name DOB Sex Zipcode Alice 1/21/76 Female 53715 Bob 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237 7

DOB Sex Zipcode Disease

1/21/76 Female 53715 Heart Disease 4/13/86 Female 53715 Hepatities 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu

2/28/76 Female 53706 Hang Nail

(8)

P

RIVACY

I

SSUES

IN

D

ATA

S

HARING

(E

X

2)

¢ 

Attribute Linkage

(9)

P

RIVACY

I

SSUES

IN

D

ATA

S

HARING

(E

X

3)

9 Attributes: Name, Salary, … Links: Friends, Neighborhood, … Communities: Interests, Activities, …
(10)

P

RIVACY

I

SSUES

ON

S

OCIAL

N

ETWORKS

¢ 

Personal information leaked, even if the vertex

identifies are hidden…

10

Many information can be used to re-associate the vertex with its identity. Vertex degree: k-degree anonymity , … Neighborhood configuration: k-neighborhood anonymity, k-automorphism anonymity, k-isomorphism anonymity, grouping-and- collapsing, …

(11)

F

RIENDSHIP

A

TTACK

¢ 

Still there are another type of information for

vertex re-identification – friendship attack

(12)

P

RIVACY

C

ONCERNS

I

N

D

ATA

S

HARING

¢ 

Personal information leaked, even if the vertex

identifies are hidden…

—  E.g., Friendship attacks

—  E.g., Community Identification

12

C.-H. Tai, P. S. Yu, D.-N. Yang, and M.-S. Chen, ”Structural

diversity for privacy in publishing social networks,” In SDM, 2011; “Structural Diversity for Resisting Community Identification in Published Social Networks ,” IEEE Trans. Knowl. Data Eng. 26(1): 235-252 (2014).

C.-H. Tai, P. S. Yu, D.-N. Yang and M.-S. Chen, " Privacy-preserving Social Network Publication Against Friendship Attacks," In KDD, 2011.

C.-H. Tai, P.-J. Tseng, P. S. Yu and M.-S. Chen, "Identities Anonymization in Dynamic Social Networks," In ICDM, 2011.

(13)

E

NFORCING

S

TRUCTURAL

D

IVERSITY

FOR

P

RIVACY

-P

RESERVING

R

ELEASE

(14)

A

TTRIBUTE

L

INKAGE

(15)

L-

DIVERSITY

¢ 

Distinct

l

–diversity

—  At least l distinct values associated with a sensitive

attribute within a group of records, which share the same set of Quasi Identifier

¢ QI: attributes that could potentially re-identify records.

(16)

L-

DIVERSITY

(17)

C

OMMUNITY

I

DENTIFICATION

— 

Vertex identification is considered to be an important

privacy issue in publishing social networks.

◦  k-degree anonymity, k-neighborhood anonymity, …

— 

In addition to a vertex identity, each individual is also

associated with a

community identity

.

◦  Could be used to infer the political party affiliation or disease

information sensitive to the public.

◦  Is a kind of structural information

(18)

C

OMMUNITY

I

DENTIFICATION

— 

Community information is explicitly given:

Ex.

Alice knows recently…

—  Mark is sick

—  Mark participates in this social network

—  Mark makes 3 friends. (vertex degree attack)

(19)

C

OMMUNITY

I

DENTIFICATION

— 

Community information is not given:

Ex.

Alice knows Mark participates in this social network and

has 3 friends. (vertex degree attack)

è

Alice can know the approximation of Bob’s

neighborhood.

(20)

N

EW

P

RIVACY

M

ODEL

A

GAINST

C

OMMUNITY

I

DENTIFICATION

¢ 

L-Structural Diversity

—  To protect against vertex degree attack, for each vertex, there

should be other vertices with the same degree located in at least L-1 other communities.

(21)

P

RIVACY

MODEL

:

K

W

-

STRUCTURAL

DIVERSITY

ANONYMITY

¢ 

A release of a social network satisfies

L-structural diversity, if for every

node, there exists a l-shielding group.

¢ 

A group

θ

d

, consisting of all

vertices of degree

d,

is a

l-shielding group if

there is a vertex subset θ ⊆  θd s. t.

—  (1) |θ |≥  L,  and

—  (2) any two vertices u and v in θ,

Cv Cv = ø, where C is the

community identity. 21

Ex. Mary has four friends.

???

(22)

T

HE

A

NONYMIZATION

¢ 

Problem formulation:

—  Given a graph G(V, E, C) and an integer k, 1≦ L ≦ |C|, the

problem is to anonymize G to satisfy L-structural diversity such that information distortion is minimized.

¢ 

The challenges:

—  Trade off between data utility and data privacy

—  Scalability of anonymization techniques

(23)

T

HE

P

ROBLEM

IN

D

YNAMIC

S

CENARIOS

¢ 

A dynamic social network

will be sequentially released.

¢ 

An attacker can monitor a

victim for a period

w

.

¢ 

Therefore, the adversary

knowledge includes:

—  The releases G t-w+1, G t-w+2, …, G t during w —  A degree sequence Δvw=(dvt-w+1, dvt-w+2, …, d vt) of a victim v during w 23

G

2

G

1

Ex. John has two friends at time 1, and makes a new friend at time 2.

(24)

P

RIVACY

MODEL

:

K

W

-

STRUCTURAL

DIVERSITY

ANONYMITY

¢ 

Dynamic scenarios of

w>1

—  A consistent group ΘΔ is the set of vertices

that always share the same degree during w.

—  A consistent group ΘΔ is a l-shielding if

at each time instant t in w, there is a vertex subset Θ t ⊆  Θ

Δ s. t.

¢ (1) |Θ t |≥  L,  and

¢ (2) any two vertices u and v in Θ t, Cv t Cv t = ø, where C t is the community identity at time t.

24 The adversary knowledge of includes: 1.  The releases G t-w +1, G t-w+2, …, G t during w 2.  A degree sequence Δvw=(d vt-w+1, dvt-w+2, …, dvt) of a victim v during w

(25)

T

HE

A

NONYMIZATION

¢ 

Problem formulation:

—  Suppose that every vertex in a series of sequential

releases G t-w+1, G t-w+2, …, G t-1 is protected.

—  Given G t-w+1, G t-w+2, …, G t-1 and L, anonymize the

current social network Gt s. t. every vertex is protected

in a L-shielding consistent group.

¢ 

The challenges:

—  The anonymization is depended on not only the current

social network but also previous w-1 releases.

—  Searching through all the w-1 releases to eliminate

(26)

P

RIVACY

-P

RESERVING

D

ATA

(27)

D

ATA

P

RIVACY

VS. D

ATA

A

NALYSIS

¢ 

Data mining has shown its great promise in various

fields.

—  Business, Medical treatment, Networks, Bioinformatics, ...

¢ 

For those who lack of expertise in data analysis

and/or computing resources...

27 Data Owner

Mining Services Provider (Cloud Computing)

(28)

D

ATA

P

RIVACY

VS. D

ATA

A

NALYSIS

¢

Data Is Money!!!

28 Data

Owner

Mining Services Provider (Cloud Computing)

Priva

cy?!

(29)
(30)

P

RIVACY

C

ONCERNS

I

N

D

ATA

S

HARING

¢ 

Privacy-preserving outsourcing techniques without

sacrificing data utility

—  E.g., Mining patterns

—  E.g., Similarity measurement

30

Y.-W. Chu, C.-H. Tai, M.-S. Chen and P. S. Yu,

"Privacy-preserving SimRank over Distributed Information Network," In ICDM, 2012.

C.-H. Tai, P. S. Yu, and M.-S. Chen, "k-Support Anonymity based on Pseudo Taxonomy for Outsourcing of Frequent Itemset

Mining," In KDD, 2010.

C.-H. Tai, J.-W. Huang, M.-H. Chung, "Privacy Preserving Frequent Pattern Mining on Multi-Cloud Environment,” In SBAST, 2013.

(31)

P

RIVACY

-P

RESERVING

O

UTSOURCING

OF

(32)

F

REQUENT

I

TEMSET

M

INING

(FIM)

¢ 

Discover what happened frequently

32 Trans. ID Items

1 wine

2 cigar, wine

3 cigar, tea

4 beer, cigar, wine

5 beer, tea, wine

When threshold set as 3 (=60%), {wine} and {cigar} are frequent. When threshold set as 2 (=40%), {wine}, {cigar}, {tea}, {beer},

{wine, cigar}, and {wine, beer} are frequent.

(33)

T

HE

R

ISKS

OF

O

UTSOURCING

FIM

¢ 

Encryption/decryption method is believed as the

possible solution.

33 Mining Services Provider

(Cloud Computing)

Data Owner

How to achieve the encryption and decryption?

—

Privacy protected

—

Correct mining results

(34)

T

HE

R

ISKS

OF

O

UTSOURCING

FIM

34 Trans. ID Items 1 wine 2 cigar, wine 3 cigar, tea

4 beer, cigar, wine

5 beer, tea, wine

Trans. ID Items 1 a 2 a, c 3 c, d 4 a, b, c 5 a, b, d Encrypt

(35)

T

HE

R

ISKS

OF

O

UTSOURCING

FIM

¢ 

Top frequency attack

—  Wine is the most frequent item à ‘a’ is ‘wine’

¢ 

Approximate support attack

—  The support of cigar is about 55%~60% à‘c’ is ‘cigar’

35 Trans. ID Items 1 wine 2 cigar, wine 3 cigar, tea

4 beer, cigar, wine

5 beer, tea, wine

Trans. ID Items 1 a 2 a, c 3 c, d 4 a, b, c 5 a, b, d Encrypt

(36)

T

HE

R

ISKS

OF

O

UTSOURCING

FIM

¢ 

Top frequency attack

—  Wine is the most frequent item à ‘a’ is ‘wine’

¢ 

Approximate support attack

—  The support of cigar is about 55%~60% à‘c’ is ‘cigar’

36 Trans. ID Items 1 wine 2 cigar, wine 3 cigar, tea

4 beer, cigar, wine

5 beer, tea, wine

Trans. ID Items 1 a 2 a, c 3 c, d 4 a, b, c 5 a, b, d Encrypt

The Risks of Outsourcing FIM

The support information about the frequent itemsets can be utilized to effectively reveal the raw data as well as the

sensitive information from the anonymized transactions. T. Mielik¨ainen. Privacy problems with anonymized transaction databases. In Proc. of Discovery Science,

(37)

R

ELATED

W

ORKS

¢  Encrypt each real items by a one-many mapping

function.

Wong, W. K., Cheung, D. W., Hung, E., Kao, B., Mamoulis, N.: Security in Outsourcing of

Association Rule Mining. In: Proc. of VLDB, 2007. ¢  However, it does not try to anonymize the support

information.

¢  Recently it is cracked.

Molloy, I., Li, N., Li, T.: On the (In)Security and (Im)Practicality of Outsourcing Precise Association Rule Mining. In: Proc. of ICDM, 2009.

(38)

K-

SUPPORT

A

NONYMITY

& A

NONYMIZATION

¢ 

For every sensitive item, there are at least k-1 other

items of the same support.

—  The probability of an item being correctly re-identified is

limited to 1/k, even when the precise support information is known.

¢ 

Given a transactional database T, encrypt T into E(T)

such that

—  There exist a decryption function D such that

MiningResult(T, Δ)= D(MiningResult(E(T), Δ)), for any

minimal support Δ.

— 

E(T) is k-support anonymous.

(39)

A

NONYMIZATION

E

XAMPLE

: E

NCRYPTION

39 Trans. ID Items 1 wine 2 cigar, wine 3 cigar, tea

4 beer, cigar, wine

5 beer, tea, wine Encrypt with k=3

a f beer wine j cigar e b i k c d g h tea Trans. ID Items 1 c, d, g 2 b, d, g 3 b, h 4 a, b, c 5 a, c, d, h

(40)

S

UMMARY

¢ 

Many fields such as marketing, business, healthcare,

software engineering, government, to new a few, has

received great benefit through data analysis

¢ 

While enjoying the benefit of data analysis, it is also

important to take care of privacy issues in both data

analysis process and the release of data pieces.

(41)

References

Related documents

directions for agency- and communion - oriented consumers (i.e., agentic consumers spend more with a friend, while communal consumers spend less when accompanied by a

e , f The 18 F-fluorodeoxyglucosepositron emission tomography-computed tomography ( 18 F-FDG PET/CT) scan performed five months after starting cART showed intense accumulation

All Android apps (n = 21) and most (92%, n = 23/25) iOS apps had issues affecting data input that might increase the risk of an incorrect value being used for calculation (Table

To demonstrate the impact of higher initial loan rates on expected holding period return given different sets of assumptions, the spreadsheet shown in Table 1 was recalculated

Statisticky vysoce významné rozdíly (P ≤ 0,01) mezi hřebci a klisnami byly nalezeny u kohoutkové výšky hůlkové, výšky ve hřbetě, výšky v kříži, výšky při kořeni

This research paper presents the result generated on the formation of hydrogen gas with minimum cell potential from the electrolysis of acidic cuprous