Preventing False
Discovery
In Adaptive Data Analysis. Aaron Roth
From: [email protected]
Date: 3/8/15
Subject: Gr8 investment opportunity!!!
Hi there. I’m tired of giving out this great advice for free. Let me manage your money, and I’ll continue giving you my stock prediction tips in exchange for a small cut!
Hmm…
• The chance he was right 10 times in a row if he was just randomly guessing is only .
I can reject the null hypothesis that these predictions were luck.
What happened
What happened
What happened
After 10 days…
• There remain people who have received perfect
predictions.
False discovery — a growing
concern
“The Multiple Hypothesis
Testing Problem”
• A finding with p-value has only a 5% probability of being realized “by chance”
• But we expect 5 such findings if we test 100 hypotheses, even if they are all wrong…
• Easy enough to account for if we know how many
Preventing false
discovery
Decade old subject in Statistics
Theory focuses on non-adaptive data analysis
Powerful results such as
Benjamini-Hochberg work on controlling
False
Discovery Rate
Lots of tools:
Non-adaptive data analysis
•
Specify exact
experimental setup
• e.g., hypotheses to test
•
Collect data
•
Run experiment
•
Observe outcome
Data
analyst
Can’t reuse data
Adaptive
data analysis
Data
analyst
•
Specify exact
experimental setup
• e.g., hypotheses to test
•
Collect data
•
Run experiment
•
Observe outcome
Why adaptivity is
troublesome
An adaptive data analyst is
a decision tree…
𝑞1
𝑞2 𝑞4
𝑞5 𝑞10
𝑞13 𝑞14
𝑞24 𝑞37
…
…
…
…
…
…
…
…
…
…
…
…
𝑞11 𝑞16
An adaptive data analyst is
a decision tree…
𝑞1
𝑞2 𝑞4
𝑞5 𝑞10
𝑞13 𝑞14
𝑞24 𝑞37
…
…
…
…
…
…
…
…
…
…
…
…
𝑞11 𝑞16
𝑘
Must account not just for the queries actually made, but the queries that
(One) Formalization of the
Problem: Statistical
Queries
• A data universe (e.g. ) • A distribution
(One) Formalization of the
Problem: Statistical
Queries
• A statistical query is defined by a predicate .
• The answer to a statistical query is
• A statistical query oracle is an algorithm for answering statistical queries:
• A statistical query oracle has accuracy on a collection of SQs if for every
(One) Formalization of the
Problem: Statistical
Queries
• Statistical query oracles are built from data: • i.e. for a dataset .
• Main quantity of interest: sample complexity.
Non-adaptively chosen
queries are easy.
• Consider the Naïve mechanism
Theorem: is -accurate for any set of non-adaptively chosen queries given a data set of size at least:
with probability .
Adaptive Queries are
Harder:
Theorem: is -accurate for any set of adaptively chosen queries given a data set of size at least:
with probability
Proof: Each query has at most possible answers, so branching factor of the analyst’s decision tree is at most . There are at most queries he might ever ask. Chernoff bound. Union bound.
Question:
D
Differential Privacy
[Dwork-McSherry-Nissim-Smith 06]
Algorithm
Pr [r]
ratio bounded
Alice
: The data universe.
: The dataset (one element per person)
Definition: Two datasets are
neighbors
if
they differ in the data of a single individual
Differential Privacy
: The data universe.
: The dataset (one element per person)
Definition: An algorithm is -differentially private if for all pairs of neighboring datasets , and for all outputs :
Differential Privacy
Intuition
Differential privacy is a
stability
guarantee:
•
Changing one data point doesn’t affect the
outcome much
Stability implies generalization
The connection
Theorem (informal): Let be a query answering mechanism such that:
1. A is -differentially private, and
2. For any set of adaptively chosen queries : with probability
then so long as then so long as: :
The connection
Theorem (informal): Let be a query answering mechanism such that:
1. A is -differentially private, and
2. For any set of adaptively chosen queries : with probability
then so long as then so long as: :
is -accurate for any set of adaptively chosen queries with probability .
Recent improvement by Nissim and Stemmer, arXiv 2015
An easier theorem:
Theorem (informal): Let be a query answering mechanism such that:
1. The output of can be described with bits, and 2. For any set of adaptively chosen queries :
then so long as then so long as: :
Proof of easy theorem
• Fix any data analyst. His interaction with can be described as a decision tree of depth . Since the outcome of can always be described with only
bits, there are at most trajectories through the tree that can be realized, for a total of queries that
might possibly be asked.
• Chernoff bound. Union bound.
𝑞1
𝑞2 𝑞4
𝑞5
𝑞10
𝑞13 𝑞14
Can we answer lots of
queries with short
description length?
• Yes – here is one easy way
• Based on [Roth, Roughgarden 2010] but without
the privacy.
• Important fact: For any set of queries, there is a dataset of size that answers all queries with
-accuracy.
How to answer queries.
1. Let
2. For each query : 1. If
• Output
• Set
2. Else
• Output (Rounded to nearest multiple of )
• Set
Easy Query
Some observations
• There can be at most hard queries in any sequence
of queries
• Since each hard query cuts in half, and • All answers are -accurate.
• The answers to any sequence of queries can be
Some observations
𝑞1
𝑞2 𝑞4
𝑞5 𝑞10
𝑞13 𝑞14
𝑞24 𝑞37
…
…
…
…
…
…
…
…
…
…
…
…
𝑞11 𝑞16
𝑘
To specify the trajectory down the tree, only need to specify the depth of each hard query, and its answer.
Applying the easy
theorem:
• Easy corollary: There is an algorithm that can
answer any sequence of adaptively chosen queries, while guaranteeing -accuracy, and needing sample complexity only:
Applying the main theorem:
(Using improved parameters from [BSSU15] and [NS15] and applying state of the art private query release
mechanism from [HR10])
• Corollary: There is an algorithm that can answer any sequence of adaptively chosen queries, while guaranteeing -accuracy, and needing sample
complexity only:
A Practical, Efficient
Technique
for Reusing a Holdout Set
Standard holdout method
training data
holdout
Data
analyst
good for
one
validation
unrestricted
access
Data
One corollary: a reusable
holdout
Data
training data
reusable
holdout
Data
analyst
unrestricted
access
can be used
many times
adaptively
* Function
q
overfits
if |
𝔼S[q]-
𝔼D[q]| > .
A reusable holdout:
Thresholdhout
Theorem: Thresholdout gives -accurate
estimates for any sequence of adaptively
chosen queries until overfitting*
An illustrative experiment
•
Data set with
2
n
= 20,000
rows and
d
= 10,000
variables. Class labels in
{-1,1}
•
Analyst performs
stepwise variable
selection:
1. Split data into training/holdout of size
n
No correlation
between data and labels
data are random gaussians
labels are drawn independently at random from {-1,1}
High correlation
20 attributes are highly correlated with target remaining attributes are uncorrelated
Lots of Future Directions
Theory
• Is it possible to obtain improved bounds? (We’ve basically hit the limits via differential privacy, but there may be other approaches)
• i.e. is differential privacy the best way to achieve generalization? • If not, is there an information theoretic measure that characterizes
generalization for adaptive analyses? (See DFHPRR15 for a partial result)
Practice
• Design practical methods for real data analysis problems
• Thresholdout is a step in this direction, but there is more to data analysis than holdout sets.
To read more
The main theorem:
• “Preserving Statistical Validity in Adaptive Data Analysis”, Dwork, Feldman, Hardt, Pitassi, Reingold, Roth. http://arxiv.org/abs/1411.2664 2014.
Improvements (Improved parameters and more general settings):
• “More General Queries and Less Generalization Error in Adaptive Data Analysis”, Bassily, Smith, Steinke, Ullman. http://arxiv.org/abs/1503.04843 2015.
• “On the Generalization Properties of Differential Privacy”, Nissim, Stemmer. http://
arxiv.org/abs/1504.05800 2015.
Description length and more general measures, as well as Thresholdout and experiments:
• “Generalization in Adaptive Data Analysis and Holdout Reuse”, Dwork, Feldman, Hardt, Pitassi, Reingold, Roth. http://arxiv.org/abs/1506.02629 2015.
For more on Differential Privacy:
• “The Algorithmic Foundations of Differential Privacy”, Dwork, Roth.