Privacy through Accountability:
A Computer Science Perspective
Anupam Datta Associate Professor
Computer Science, ECE, CyLab Carnegie Mellon University
Research Challenge
Ensure organizations respect privacy expectations in the collection, use, and disclosure of personal
information
Web Privacy
Example privacy policies:
Not use detailed location (full IP address) for advertising
Healthcare Privacy
Hospital Drug Company Patient information Patient Auditor Patient informatio n Patient informatio n Physician NurseExample privacy policies:
Use patient health info only for treatment, payment
A Research Area
Formalize Privacy Policies
Precise semantics of privacy concepts
(restrictions on personal information flow)
Enforce Privacy Policies
Audit and Accountability
Detect violations
Blame-assignment
Adaptive audit resource allocation
Related ideas: Barth et al Oakland 2006; May et al CSFW 2006; Weitzner et al CACM 2008, Lampson 2004
Today: Focus on Detection
Healthcare Privacy
Play in two acts
Web Privacy
A covered entity may disclose an individual’s protected health information (phi) to law-enforcement officials for the purpose of identifying an individual if the individual made a statement admitting participating in a violent crime that the covered entity believes may have caused serious physical harm to the victim
Example from HIPAA Privacy Rule
Concepts in privacy policies
Actions: send(p1, p2, m)
Roles: inrole(p2, law-enforcement)
Data attributes: attr_in(prescription, phi)
Temporal constraints: in-the-past(state(q, m))
Purposes: purp_in(u, id-criminal))
Beliefs: believes-crime-caused-serious-harm(p, q, m)
Black-and-white concepts
Detecting Privacy Violations
Privacy Policy Computer-readable privacy policy Organizational audit log Detect policy violation s Audit Complete formalization of HIPAA Privacy Rule,GLBA Automated audit for black-and-white policy concepts Oracles to audit for grey
policy concepts The Oracle
The Matrix character
Species Computer Program Title A program designed to
investigate the human psyche.
Policy Auditing over Incomplete Logs
With D. Garg (CMU MPI-SWS) and
L. Jia (CMU)
2011 ACM Conference on Computer and Communications Security
Key Challenge for Auditing
Audit Logs are Incomplete
Future: store only past and current events
Example: Timely data breach notification refers to future event
Subjective: no “
grey
” information
Example: May not record evidence for purposes and beliefs
Spatial: remote logs may be inaccessible
Example: Logs distributed across different departments of a hospital
Abstract Model of Incomplete Logs
Model
all
incomplete logs uniformly
as 3
-valued structures
Define
semantics
(meanings of
formulas) over 3-valued structures
reduce: The Iterative Algorithm
reduce (
L
,
φ
) =
φ'
φ
0φ
1φ
2 r e d u c e r e d u c e Logs PolicySyntax of Policy Logic
First-order logic with restricted quantification over infinite domains (challenge for reduce)
Can express timed temporal properties, “grey” predicates
Example from HIPAA Privacy Rule
∀p1, p2, m, u, q, t. (send(p1, p2, m) ∧ inrole(p2, law-enforcement) ∧ tagged(m, q, t, u) ∧ attr_in(t, phi))⊃ (purp_in(u, id-criminal))
∧∃ m’. state(q,m’) ∧is-admission-of-crime(m’) ∧believes-crime-caused-serious-harm(p1, q, m’)
A covered entity may disclose an individual’s protected health information (phi) to law-enforcement officials for the purpose of identifying an individual if the individual made a statement admitting participating in a violent crime that the covered entity believes may have caused serious physical harm to the victim
reduce: Formal Definition
c is a formula for which finite satisfying substitutions of x can
be computed
General Theorem: If initial policy passes a
syntactic
mode check
, then finite
substitutions can be computed
Applications: The entire HIPAA and GLBA
Privacy Rules pass this check
φ = ∀p1, p2, m, u, q, t. (send(p1, p2, m) ∧ tagged(m, q, t, u) ∧ attr_in(t, phi)) ⊃ inrole(p2, law-enforcement) ∧ purp_in(u, id-criminal) ∧ ∃ m’. ( state(q, m’) ∧ is-admission-of-crime(m’) ∧ believes-crime-caused-serious-harm(p1, m’))
Example
{ p1→ UPMC, p2→ allegeny-police, m → M2, q → Bob, u → id-bank-robber, t → date-of-treatment }∧ purp_in(id-bank-robber, id-criminal)
{ m’ → M1 } ∧ is-admission-of-crime(M1) ∧ believes-crime-caused-serious-harm(UPMC, M1) Log Jan 1, 2011 state(Bob, M1) Jan 5, 2011 send(UPMC, allegeny-police, M2) tagged(M2, Bob, date-of-treatment, id-bank-robber)
T
Implementation and evaluation over simulated audit logs for compliance with all 84 disclosure-related
clauses of HIPAA Privacy Rule
Performance:
Average time for checking compliance of each disclosure
of protected health information is 0.12s for a 15MB log
Mechanical enforcement:
reduce can automatically check 80% of all the atomic
predicates
Ongoing Transition Efforts
Integration of reduce algorithm into Illinois Health Information Exchange prototype
Joint work with UIUC and Illinois HLN
Auditing logs for policy compliance
Related Work
Distinguishing characteristics
1. General treatment of incompleteness in audit logs
2. Quantification over infinite domains (e.g., messages)
3. First complete formalization of HIPAA Privacy Rule and
GLBA.
Nearest neighbors
Basin et al 2010 (missing 1, weaker 2, cannot handle 3)
Lam et al 2010 (missing 1, weaker 2, cannot handle entire
3)
Weitzner et al (missing 1, cannot handle 3)
Formalizing and Enforcing
Purpose Restrictions
With M. C. Tschantz (CMU Berkeley) and
J. M. Wing (CMU MSR)
Goal
Give a semantics to
“Not for” purpose restrictions
“Only for” purpose restrictions that is parametric in the purpose
Provide audit algorithm for detecting violations
for that semantics
X-ray taken
Send record
X-ray added
Diagnosis
by specialist
No diagnosis
by drug company
Send record
A
dd x
-ray
Medical
Record
Med records
used only for
X-ray taken
Send record
X-ray added
Diagnosis
by specialist
No diagnosis by
drug company
Send record
A
dd x
-ray
Not achieve
purpose
Achieve purpose
X-ray taken
Send
record
X-ray added
Diagnosis
by specialist
No diagnosis
(by drug co. or
specialist)
Send record
A
dd x
-ray
1/4
3/4
Specialist
fails
Choice
point
Best choice
Planning
Thesis: An action is for a purpose iff that
action is part of a plan for furthering the
purpose
i.e., always makes the best choice for furthering the
Auditing
Auditee’s
behavior
Purpose
restriction
Decision-making
model
Obeyed
Violated
Inconclusiv
e
Violated
MDP Solve rOptimal
actions for
each state
Actions optimal? Policy implicationsRecord only
for treatment
No
[ , send
record]
Summary: A Sense of Purpose
Thesis: An action is for a purpose iff that action
is part of a plan for furthering the purpose
i.e., always makes the best choice for furthering the
purpose
Audit algorithm detects policy violations by
checking if observed behavior could have been
produced by optimal plan
Today: Focus on Detection
Healthcare Privacy
Play in two acts
Web Privacy
Bootstrapping Privacy Compliance in a
Big Data System
With S. Sen (CMU) and
S. Guha, S. Rajamani, J. Tsai, J. M. Wing (MSR) 2014 IEEE Symposium on Security & Privacy
Privacy Compliance for Bing
Setting:
Two Central Challenges
Legal Team Crafts Policy Privacy Champion Interprets Policy Developer Writes Code Audit Team Verifies Compliance 1.Ambiguous privacy
policy
Meaning unclear 2.Huge undocumented
codebases &
datasets
Connection to policy unclear Meeting s Meeting s Meeting s1. Legalease
Clean syntax
Layered allow-deny
information flow rules with exceptions Precise Semantics No ambiguity Focus on Usability User study of Legalease with Microsoft privacy champions promising Example:
DENY Datatype IPAddress USE FOR PURPOSE
Advertising EXCEPT
ALLOW Datatype IPAddress: Truncated
2. Grok
Process 1 Dataset A Dataset B Dataset C Dataset F Dataset E Process 2 Process Dataset D Process 5 Dataset J Process Process 4 Dataset H Dataset I Dataset G NewAcct Login Check GeoIP Check Fraud Reportin Name Age IPAddress IDX Hash Country Timestam p Hash IDX IDX Data Inventory Annotate code + data with policy data types
Source labels propagated via data flow graph
Different Noisy Sources Variable Name Analysis Developer Annotations
2. Grok
Dataset F Dataset D Process 5 Dataset J Process Process 4 Dataset H Dataset I Dataset G GeoIP Check Fraud Reportin IPAddres s IDX Country IDX IDX Example Policy ViolationIPAddress is used for reporting (advertising)
2. Grok
Dataset F Dataset D Process 5 Dataset J Process Process 4 Dataset H Dataset I Dataset G GeoIP Check Fraud Reportin IPAddres s IDX Country IPAddress IDX IDX Example Fix IPAddress is truncated before it is passed toreporting (advertising) job
Dataset F
IPAddress
Bootstrapping Works
Pick x% most
frequently appearing column names, label them
Then propagate
label using Grok flow Pick the nodes
which will label the most of the graph
~200 annotations label 60% of nodes
A small number of annotations is enough to get off the ground.
Scale
77,000 jobs run each day
By 7000 entities
300 functional groups
1.1 million unique lines of code
21% changes on avg,
daily
46 million table schemas
32 million files
Manual audit infeasible
Information flow
analysis takes ~30 mins
A Streamlined Audit Workflow
Legal Team Crafts Policy Privacy Champ Interprets Policy Developer Writes Code Audit Team Verifies Compliance LegaleaseA Formal Policy Specification Language
Grok
Data Inventory with Policy Datatypes
Encode Refine
Code analysis, developer annotations
Checker Annotated Code Legalease Policy Potential violations Fix code Update Grok
Information Flow Experiments
With Michael Carl Tschantz (CMU UC Berkeley)
Amit Datta (CMU)
User
Ads
Search
terms
Other users
Advertisers
Websites
Confounding
inputs
Web Tracking
?
Control Group
Experimental Design
Scientist
Experimental Group
Drug
Placebo
Group 2
Information Flow Experiment
Group 1
Arrested?
Black
Looking for?
White
Black
Arrested?
Looking for?
White
Black
Arrested?
Black
Arrested?
Looking for?
White
Looking for?
White
Information Flow Experiments as Science
Experimental Science Information FlowNatural process System in question Population of units Subset of interactions
… …
Browser Instances are Not Independent
17 13 13 13 12 11 10 10 8 7Our Idea
Use a non-parametric test
Does not require model of Google
Specifically, a permutation test
Visiting Car Websites Impacts Ads
0 0 2 5 6 19 22 30 30 31Conclusion
A rigorous methodology for information flow experiments
Connection to causality in natural sciences
Experimental design for causal determination
Significance testing with non-parametric statistics
Future work
Replicate and analyze previous experiments
systematically
Guha et al, Wills and Tatar, Sweeney
Conduct new large-scale experiments systematically
52
A Research Area
Formalize Privacy Policies
Precise semantics of privacy concepts
(restrictions on personal information flow)
Enforce Privacy Policies
Audit and Accountability
Detect violations
Blame-assignment
Adaptive audit resource allocation
Application Domains
Information Flow Analysis
Analysis
White box
Black box
Experimenting Monitoring
Testing
Access to program?
Yes No
Total Partial None
Google Exhibits Complex Behavior
0 5 10 15 20 25 30 35 40 45 0 50 100 150 200 A d id Reload number 55Privacy as Contextual Integrity
Context-relative information flow norms
Example contexts: healthcare, friendship
Example norms: confidentiality, purpose, reciprocity
[Nissenbaum 2004; Barth-D-Mitchell-Nissenbaum 2006]
Norms to Policies
Example norm: confidentiality expectations in healthcare
Associated policy: clauses in the HIPAA Privacy Rule
Does policy reflect norm? Privacy
Norms
Privacy Policies