aaronroth.pptx

(1)

Preventing False

Discovery

In Adaptive Data Analysis. Aaron Roth

(2)

From: [email protected]

To: [email protected]

Date: 2/27/15

Subject: Gr8 investment tip!!!

(3)

Date: 2/28/15

Subject: Gr8 investment tip!!!

(4)

Date: 3/1/15

(5)

Date: 3/2/15

(6)

Date: 3/3/15

(7)

Date: 3/4/15

(8)

Date: 3/5/15

(9)

Date: 3/6/15

(10)

Date: 3/7/15

(11)

Date: 3/8/15

(12)

Date: 3/8/15

Subject: Gr8 investment opportunity!!!

Hi there. I’m tired of giving out this great advice for free. Let me manage your money, and I’ll continue giving you my stock prediction tips in exchange for a small cut!

(13)

Hmm…

• _{The chance he was right 10 times in a row if he was} just randomly guessing is only .

I can reject the null hypothesis that these predictions were luck.

(14)

What happened

(15)

What happened

(16)

What happened

(17)

After 10 days…

• _{There remain people who have received perfect}

predictions.

(18)

False discovery — a growing

concern

(19)

“The Multiple Hypothesis

Testing Problem”

• _{A finding with p-value has only a 5% probability of} being realized “by chance”

• _{But we expect 5 such findings if we test 100} hypotheses, even if they are all wrong…

• _{Easy enough to account for if we know how many}

(20)

Preventing false

discovery

Decade old subject in Statistics

Theory focuses on non-adaptive data analysis

Powerful results such as

Benjamini-Hochberg work on controlling

False

Discovery Rate

Lots of tools:

(21)

Non-adaptive data analysis

•

Specify exact

experimental setup

• e.g., hypotheses to test

•

Collect data

•

Run experiment

•

Observe outcome

Data

analyst

Can’t reuse data

(22)

Adaptive

data analysis

Data

analyst

•

Specify exact

experimental setup

• e.g., hypotheses to test

•

Collect data

•

Run experiment

•

Observe outcome

(23)

Why adaptivity is

troublesome

(24)

An adaptive data analyst is

a decision tree…

𝑞₁

𝑞₂ 𝑞4

𝑞₅ 𝑞10

𝑞₁₃ 𝑞14

𝑞₂₄ 𝑞37

…

𝑞₁₁ 𝑞16

(25)

An adaptive data analyst is

a decision tree…

𝑞₁

𝑞₂ 𝑞4

𝑞₅ 𝑞10

𝑞₁₃ 𝑞14

𝑞₂₄ 𝑞37

…

𝑞₁₁ 𝑞16

𝑘

Must account not just for the queries actually made, but the queries that

(26)

(One) Formalization of the

Problem: Statistical

Queries

• _{A data universe (e.g. )} • _{A distribution}

(27)

(One) Formalization of the

Queries

• _A_{statistical query}_{is defined by a predicate} .

• _{The answer to a statistical query is}

• _{A statistical query oracle is an algorithm for} answering statistical queries:

• _{A statistical query oracle has accuracy on a} collection of SQs if for every

(28)

(One) Formalization of the

Queries

• _{Statistical query oracles are built from}_data_: • _{i.e. for a dataset .}

• _{Main quantity of interest: sample complexity.}

(29)

Non-adaptively chosen

queries are easy.

• _{Consider the}_Naïve_mechanism

Theorem: is -accurate for any set of non-adaptively chosen queries given a data set of size at least:

with probability .

(30)

Adaptive Queries are

Harder:

Theorem: is -accurate for any set of adaptively chosen queries given a data set of size at least:

with probability

Proof: Each query has at most possible answers, so branching factor of the analyst’s decision tree is at most . There are at most queries he might ever ask. Chernoff bound. Union bound.

(31)

Question:

(32)

D

Differential Privacy

[Dwork-McSherry-Nissim-Smith 06]

Algorithm

Pr [r]

ratio bounded

Alice

(33)

: The data universe.

: The dataset (one element per person)

Definition: Two datasets are

neighbors

if

they differ in the data of a single individual

Differential Privacy

(34)

: The data universe.

: The dataset (one element per person)

Definition: An algorithm is -differentially private if for all pairs of neighboring datasets , and for all outputs :

Differential Privacy

(35)

Intuition

Differential privacy is a

stability

guarantee:

•

Changing one data point doesn’t affect the

outcome much

Stability implies generalization

(36)

The connection

Theorem (informal): Let be a query answering mechanism such that:

1. A is -differentially private, and

2. For any set of adaptively chosen queries : with probability

then so long as then so long as: :

(37)

The connection

1. A is -differentially private, and

2. For any set of adaptively chosen queries : with probability

is -accurate for any set of adaptively chosen queries with probability .

Recent improvement by Nissim and Stemmer, arXiv 2015

(38)

An easier theorem:

1. The output of can be described with bits, and 2. For any set of adaptively chosen queries :

(39)

Proof of easy theorem

• _{Fix any data analyst. His interaction with can be} described as a decision tree of depth . Since the outcome of can always be described with only

bits, there are at most trajectories through the tree that can be realized, for a total of queries that

might possibly be asked.

• _{Chernoff bound. Union bound.}

𝑞1

𝑞2 𝑞4

𝑞5

𝑞10

𝑞13 𝑞14

(40)

Can we answer lots of

queries with short

description length?

• _{Yes – here is one easy way}

• _{Based on [Roth, Roughgarden 2010] but without}

the privacy.

• _{Important fact: For any set of queries, there is a} dataset of size that answers all queries with

-accuracy.

(41)

How to answer queries.

1. Let

2. For each query : 1. If

• Output

• Set

2. Else

• Output (Rounded to nearest multiple of )

• Set

Easy Query

(42)

Some observations

• _{There can be at most hard queries in any sequence}

of queries

• _{Since each hard query cuts in half, and} • _{All answers are -accurate.}

• _{The answers to any sequence of queries can be}

(43)

Some observations

𝑞₁

𝑞₂ 𝑞4

𝑞₅ 𝑞10

𝑞₁₃ 𝑞14

𝑞₂₄ 𝑞37

…

𝑞₁₁ 𝑞16

𝑘

To specify the trajectory down the tree, only need to specify the depth of each hard query, and its answer.

(44)

Applying the easy

theorem:

• _{Easy corollary: There is an algorithm that can}

answer any sequence of adaptively chosen queries, while guaranteeing -accuracy, and needing sample complexity only:

(45)

Applying the main theorem:

(Using improved parameters from [BSSU15] and [NS15] and applying state of the art private query release

mechanism from [HR10])

• _{Corollary: There is an algorithm that can answer} any sequence of adaptively chosen queries, while guaranteeing -accuracy, and needing sample

complexity only:

(46)

A Practical, Efficient

Technique

for Reusing a Holdout Set

(47)

Standard holdout method

training data

holdout

Data

analyst

good for

one

validation

unrestricted

access

Data

(48)

One corollary: a reusable

holdout

Data

training data

reusable

holdout

Data

analyst

unrestricted

access

can be used

many times

adaptively

(49)

* Function

q

overfits

if |

𝔼S[q]

-

𝔼D[q]

| > .

A reusable holdout:

Thresholdhout

Theorem: Thresholdout gives -accurate

estimates for any sequence of adaptively

chosen queries until overfitting*

(50)

An illustrative experiment

•

Data set with

2

n

= 20,000

rows and

d

= 10,000

variables. Class labels in

{-1,1}

•

Analyst performs

stepwise variable

selection:

1. Split data into training/holdout of size

n

(51)

No correlation

between data and labels

data are random gaussians

labels are drawn independently at random from {-1,1}

(52)

High correlation

20 attributes are highly correlated with target remaining attributes are uncorrelated

(53)

Lots of Future Directions

Theory

• Is it possible to obtain improved bounds? (We’ve basically hit the limits via differential privacy, but there may be other approaches)

• _{i.e. is differential privacy the best way to achieve generalization?} • If not, is there an information theoretic measure that characterizes

generalization for adaptive analyses? (See DFHPRR15 for a partial result)

Practice

• _Design_practical_{methods for}_real_{data analysis problems}

• Thresholdout is a step in this direction, but there is more to data analysis than holdout sets.

(54)

To read more

The main theorem:

• “Preserving Statistical Validity in Adaptive Data Analysis”, Dwork, Feldman, Hardt, Pitassi, Reingold, Roth. http://arxiv.org/abs/1411.2664 2014.

Improvements (Improved parameters and more general settings):

• “More General Queries and Less Generalization Error in Adaptive Data Analysis”, Bassily, Smith, Steinke, Ullman. http://arxiv.org/abs/1503.04843 2015.

• “On the Generalization Properties of Differential Privacy”, Nissim, Stemmer. http://

arxiv.org/abs/1504.05800 2015.

Description length and more general measures, as well as Thresholdout and experiments:

• “Generalization in Adaptive Data Analysis and Holdout Reuse”, Dwork, Feldman, Hardt, Pitassi, Reingold, Roth. http://arxiv.org/abs/1506.02629 2015.

For more on Differential Privacy:

• “The Algorithmic Foundations of Differential Privacy”, Dwork, Roth.

http://

http://arxiv.org/abs/1504.05800

http://

http://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf