P
RIVACY
-P
RESERVING
D
ATA
A
NALYSIS
AND
D
ATA
S
HARING
Chih-Hua Tai
Dept. of Computer Science and Information Engineering, National Taipei University
B
ENEFIT
OF
D
ATA
A
NALYSIS
¢
Many fields such as marketing, business, healthcare,
software engineering, government, to new a few, has
received great benefit through data analysis
¢
Marketing / Retail
Helps marketing
Understand customers’ behaviors
Predict what their customers want
¢
Finance / Banking
Predict the risk of a loan
Detect fraudulent credit card transactions
B
ENEFIT
OF
D
ATA
S
HARING
¢
The volume of data effects the quality of the results
of data analysis
¢
The release of data brings more chance of better
data analysis
P
RIVACY
-P
RESERVING
D
ATA
P
RIVACY
I
SSUES
IN
D
ATA
S
HARING
(E
X
1)
DOB Sex Zipcode Disease
1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatities 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu
2/28/76 Female 53706 Hang Nail
P
RIVACY
I
SSUES
IN
D
ATA
S
HARING
(E
X
1)
6
Name DOB Sex Zipcode
Alice 1/21/76 Male 53715
Bob 1/10/81 Female 55410
Carol 10/1/44 Female 90210
Dan 2/21/84 Male 02174
Name DOB Sex Zipcode Alice 1/21/76 Female 53715 Bob 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237 7
DOB Sex Zipcode Disease
1/21/76 Female 53715 Heart Disease 4/13/86 Female 53715 Hepatities 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu
2/28/76 Female 53706 Hang Nail
P
RIVACY
I
SSUES
IN
D
ATA
S
HARING
(E
X
2)
¢Attribute Linkage
P
RIVACY
I
SSUES
IN
D
ATA
S
HARING
(E
X
3)
9 Attributes: Name, Salary, … Links: Friends, Neighborhood, … Communities: Interests, Activities, …P
RIVACY
I
SSUES
ON
S
OCIAL
N
ETWORKS
¢
Personal information leaked, even if the vertex
identifies are hidden…
10
Many information can be used to re-associate the vertex with its identity. Vertex degree: k-degree anonymity , … Neighborhood configuration: k-neighborhood anonymity, k-automorphism anonymity, k-isomorphism anonymity, grouping-and- collapsing, …
F
RIENDSHIP
A
TTACK
¢
Still there are another type of information for
vertex re-identification – friendship attack
P
RIVACY
C
ONCERNS
I
N
D
ATA
S
HARING
¢
Personal information leaked, even if the vertex
identifies are hidden…
E.g., Friendship attacks E.g., Community Identification
12
C.-H. Tai, P. S. Yu, D.-N. Yang, and M.-S. Chen, ”Structural
diversity for privacy in publishing social networks,” In SDM, 2011; “Structural Diversity for Resisting Community Identification in Published Social Networks ,” IEEE Trans. Knowl. Data Eng. 26(1): 235-252 (2014).
C.-H. Tai, P. S. Yu, D.-N. Yang and M.-S. Chen, " Privacy-preserving Social Network Publication Against Friendship Attacks," In KDD, 2011.
C.-H. Tai, P.-J. Tseng, P. S. Yu and M.-S. Chen, "Identities Anonymization in Dynamic Social Networks," In ICDM, 2011.
E
NFORCING
S
TRUCTURAL
D
IVERSITY
FOR
P
RIVACY
-P
RESERVING
R
ELEASE
A
TTRIBUTE
L
INKAGE
L-
DIVERSITY
¢
Distinct
l
–diversity
At least l distinct values associated with a sensitive
attribute within a group of records, which share the same set of Quasi Identifier
¢ QI: attributes that could potentially re-identify records.
L-
DIVERSITY
C
OMMUNITY
I
DENTIFICATION
Vertex identification is considered to be an important
privacy issue in publishing social networks.
◦ k-degree anonymity, k-neighborhood anonymity, …
In addition to a vertex identity, each individual is also
associated with a
community identity
.
◦ Could be used to infer the political party affiliation or disease
information sensitive to the public.
◦ Is a kind of structural information
C
OMMUNITY
I
DENTIFICATION
Community information is explicitly given:
Ex.
Alice knows recently…
Mark is sick Mark participates in this social network
Mark makes 3 friends. (vertex degree attack)
C
OMMUNITY
I
DENTIFICATION
Community information is not given:
Ex.
Alice knows Mark participates in this social network and
has 3 friends. (vertex degree attack)
è
Alice can know the approximation of Bob’s
neighborhood.
N
EW
P
RIVACY
M
ODEL
A
GAINST
C
OMMUNITY
I
DENTIFICATION
¢
L-Structural Diversity
To protect against vertex degree attack, for each vertex, there
should be other vertices with the same degree located in at least L-1 other communities.
P
RIVACY
MODEL
:
K
W-
STRUCTURAL
DIVERSITY
ANONYMITY
¢
A release of a social network satisfies
L-structural diversity, if for every
node, there exists a l-shielding group.
¢A group
θ
d, consisting of all
vertices of degree
d,
is a
l-shielding group if
there is a vertex subset θ ⊆ θd s. t.
(1) |θ |≥ L, and
(2) any two vertices u and v in θ,
Cv ∩ Cv = ø, where C is the
community identity. 21
Ex. Mary has four friends.
???
T
HE
A
NONYMIZATION
¢
Problem formulation:
Given a graph G(V, E, C) and an integer k, 1≦ L ≦ |C|, the
problem is to anonymize G to satisfy L-structural diversity such that information distortion is minimized.
¢
The challenges:
Trade off between data utility and data privacy
Scalability of anonymization techniques
T
HE
P
ROBLEM
IN
D
YNAMIC
S
CENARIOS
…
¢
A dynamic social network
will be sequentially released.
¢
An attacker can monitor a
victim for a period
w
.
¢
Therefore, the adversary
knowledge includes:
The releases G t-w+1, G t-w+2, …, G t during w A degree sequence Δvw=(dvt-w+1, dvt-w+2, …, d vt) of a victim v during w 23G
2G
1Ex. John has two friends at time 1, and makes a new friend at time 2.
P
RIVACY
MODEL
:
K
W-
STRUCTURAL
DIVERSITY
ANONYMITY
¢
Dynamic scenarios of
w>1
A consistent group ΘΔ is the set of vertices
that always share the same degree during w.
A consistent group ΘΔ is a l-shielding if
at each time instant t in w, there is a vertex subset Θ t ⊆ Θ
Δ s. t.
¢ (1) |Θ t |≥ L, and
¢ (2) any two vertices u and v in Θ t, Cv t ∩ Cv t = ø, where C t is the community identity at time t.
24 The adversary knowledge of includes: 1. The releases G t-w +1, G t-w+2, …, G t during w 2. A degree sequence Δvw=(d vt-w+1, dvt-w+2, …, dvt) of a victim v during w …
T
HE
A
NONYMIZATION
¢
Problem formulation:
Suppose that every vertex in a series of sequential
releases G t-w+1, G t-w+2, …, G t-1 is protected.
Given G t-w+1, G t-w+2, …, G t-1 and L, anonymize the
current social network Gt s. t. every vertex is protected
in a L-shielding consistent group.
¢
The challenges:
The anonymization is depended on not only the current
social network but also previous w-1 releases.
Searching through all the w-1 releases to eliminate
P
RIVACY
-P
RESERVING
D
ATA
D
ATA
P
RIVACY
VS. D
ATA
A
NALYSIS
¢
Data mining has shown its great promise in various
fields.
Business, Medical treatment, Networks, Bioinformatics, ...
¢
For those who lack of expertise in data analysis
and/or computing resources...
27 Data Owner
Mining Services Provider (Cloud Computing)
D
ATA
P
RIVACY
VS. D
ATA
A
NALYSIS
¢
Data Is Money!!!
28 Data
Owner
Mining Services Provider (Cloud Computing)
Priva
cy?!
P
RIVACY
C
ONCERNS
I
N
D
ATA
S
HARING
¢
Privacy-preserving outsourcing techniques without
sacrificing data utility
E.g., Mining patterns E.g., Similarity measurement
30
Y.-W. Chu, C.-H. Tai, M.-S. Chen and P. S. Yu,
"Privacy-preserving SimRank over Distributed Information Network," In ICDM, 2012.
C.-H. Tai, P. S. Yu, and M.-S. Chen, "k-Support Anonymity based on Pseudo Taxonomy for Outsourcing of Frequent Itemset
Mining," In KDD, 2010.
C.-H. Tai, J.-W. Huang, M.-H. Chung, "Privacy Preserving Frequent Pattern Mining on Multi-Cloud Environment,” In SBAST, 2013.
P
RIVACY
-P
RESERVING
O
UTSOURCING
OF
F
REQUENT
I
TEMSET
M
INING
(FIM)
¢
Discover what happened frequently
32 Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
When threshold set as 3 (=60%), {wine} and {cigar} are frequent. When threshold set as 2 (=40%), {wine}, {cigar}, {tea}, {beer},
{wine, cigar}, and {wine, beer} are frequent.
T
HE
R
ISKS
OF
O
UTSOURCING
FIM
¢
Encryption/decryption method is believed as the
possible solution.
33 Mining Services Provider
(Cloud Computing)
Data Owner
How to achieve the encryption and decryption?
Privacy protected
Correct mining results
T
HE
R
ISKS
OF
O
UTSOURCING
FIM
34 Trans. ID Items 1 wine 2 cigar, wine 3 cigar, tea4 beer, cigar, wine
5 beer, tea, wine
Trans. ID Items 1 a 2 a, c 3 c, d 4 a, b, c 5 a, b, d Encrypt
T
HE
R
ISKS
OF
O
UTSOURCING
FIM
¢
Top frequency attack
Wine is the most frequent item à ‘a’ is ‘wine’
¢
Approximate support attack
The support of cigar is about 55%~60% à‘c’ is ‘cigar’
35 Trans. ID Items 1 wine 2 cigar, wine 3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
Trans. ID Items 1 a 2 a, c 3 c, d 4 a, b, c 5 a, b, d Encrypt
T
HE
R
ISKS
OF
O
UTSOURCING
FIM
¢
Top frequency attack
Wine is the most frequent item à ‘a’ is ‘wine’
¢
Approximate support attack
The support of cigar is about 55%~60% à‘c’ is ‘cigar’
36 Trans. ID Items 1 wine 2 cigar, wine 3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
Trans. ID Items 1 a 2 a, c 3 c, d 4 a, b, c 5 a, b, d Encrypt
The Risks of Outsourcing FIM
The support information about the frequent itemsets can be utilized to effectively reveal the raw data as well as the
sensitive information from the anonymized transactions. T. Mielik¨ainen. Privacy problems with anonymized transaction databases. In Proc. of Discovery Science,
R
ELATED
W
ORKS
¢ Encrypt each real items by a one-many mapping
function.
Wong, W. K., Cheung, D. W., Hung, E., Kao, B., Mamoulis, N.: Security in Outsourcing of
Association Rule Mining. In: Proc. of VLDB, 2007. ¢ However, it does not try to anonymize the support
information.
¢ Recently it is cracked.
Molloy, I., Li, N., Li, T.: On the (In)Security and (Im)Practicality of Outsourcing Precise Association Rule Mining. In: Proc. of ICDM, 2009.
K-
SUPPORT
A
NONYMITY
& A
NONYMIZATION
¢
For every sensitive item, there are at least k-1 other
items of the same support.
The probability of an item being correctly re-identified is
limited to 1/k, even when the precise support information is known.
¢
Given a transactional database T, encrypt T into E(T)
such that
There exist a decryption function D such that
MiningResult(T, Δ)= D(MiningResult(E(T), Δ)), for any
minimal support Δ.
E(T) is k-support anonymous.
A
NONYMIZATION
E
XAMPLE
: E
NCRYPTION
39 Trans. ID Items 1 wine 2 cigar, wine 3 cigar, tea4 beer, cigar, wine
5 beer, tea, wine Encrypt with k=3
a f beer wine j cigar e b i k c d g h tea Trans. ID Items 1 c, d, g 2 b, d, g 3 b, h 4 a, b, c 5 a, c, d, h
S
UMMARY
¢
Many fields such as marketing, business, healthcare,
software engineering, government, to new a few, has
received great benefit through data analysis
¢
While enjoying the benefit of data analysis, it is also
important to take care of privacy issues in both data
analysis process and the release of data pieces.