• No results found

aaronroth.pptx

N/A
N/A
Protected

Academic year: 2020

Share "aaronroth.pptx"

Copied!
54
0
0

Loading.... (view fulltext now)

Full text

(1)

Preventing False

Discovery

In Adaptive Data Analysis. Aaron Roth

(2)

From: [email protected]

To: [email protected]

Date: 2/27/15

Subject: Gr8 investment tip!!!

(3)

From: [email protected]

To: [email protected]

Date: 2/28/15

Subject: Gr8 investment tip!!!

(4)

From: [email protected]

To: [email protected]

Date: 3/1/15

(5)

From: [email protected]

To: [email protected]

Date: 3/2/15

(6)

From: [email protected]

To: [email protected]

Date: 3/3/15

(7)

From: [email protected]

To: [email protected]

Date: 3/4/15

(8)

From: [email protected]

To: [email protected]

Date: 3/5/15

(9)

From: [email protected]

To: [email protected]

Date: 3/6/15

(10)

From: [email protected]

To: [email protected]

Date: 3/7/15

(11)

From: [email protected]

To: [email protected]

Date: 3/8/15

(12)

From: [email protected]

To: [email protected]

Date: 3/8/15

Subject: Gr8 investment opportunity!!!

Hi there. I’m tired of giving out this great advice for free. Let me manage your money, and I’ll continue giving you my stock prediction tips in exchange for a small cut!

(13)

Hmm…

The chance he was right 10 times in a row if he was just randomly guessing is only .

I can reject the null hypothesis that these predictions were luck.

(14)

What happened

(15)

What happened

(16)

What happened

(17)

After 10 days…

There remain people who have received perfect

predictions.

(18)

False discovery — a growing

concern

(19)

“The Multiple Hypothesis

Testing Problem”

A finding with p-value has only a 5% probability of being realized “by chance”

But we expect 5 such findings if we test 100 hypotheses, even if they are all wrong…

Easy enough to account for if we know how many

(20)

Preventing false

discovery

Decade old subject in Statistics

Theory focuses on non-adaptive data analysis

Powerful results such as

Benjamini-Hochberg work on controlling

False

Discovery Rate

Lots of tools:

(21)

Non-adaptive data analysis

Specify exact

experimental setup

• e.g., hypotheses to test

Collect data

Run experiment

Observe outcome

Data

analyst

Can’t reuse data

(22)

Adaptive

data analysis

Data

analyst

Specify exact

experimental setup

• e.g., hypotheses to test

Collect data

Run experiment

Observe outcome

(23)

Why adaptivity is

troublesome

(24)

An adaptive data analyst is

a decision tree…

𝑞1

𝑞2 𝑞4

𝑞5 𝑞10

𝑞13 𝑞14

𝑞24 𝑞37

𝑞11 𝑞16

(25)

An adaptive data analyst is

a decision tree…

𝑞1

𝑞2 𝑞4

𝑞5 𝑞10

𝑞13 𝑞14

𝑞24 𝑞37

𝑞11 𝑞16

𝑘

Must account not just for the queries actually made, but the queries that

(26)

(One) Formalization of the

Problem: Statistical

Queries

A data universe (e.g. )A distribution

(27)

(One) Formalization of the

Problem: Statistical

Queries

A statistical query is defined by a predicate .

The answer to a statistical query is

A statistical query oracle is an algorithm for answering statistical queries:

A statistical query oracle has accuracy on a collection of SQs if for every

(28)

(One) Formalization of the

Problem: Statistical

Queries

Statistical query oracles are built from data:i.e. for a dataset .

Main quantity of interest: sample complexity.

(29)

Non-adaptively chosen

queries are easy.

Consider the Naïve mechanism

Theorem: is -accurate for any set of non-adaptively chosen queries given a data set of size at least:

with probability .

(30)

Adaptive Queries are

Harder:

Theorem: is -accurate for any set of adaptively chosen queries given a data set of size at least:

with probability

Proof: Each query has at most possible answers, so branching factor of the analyst’s decision tree is at most . There are at most queries he might ever ask. Chernoff bound. Union bound.

(31)

Question:

(32)

D

Differential Privacy

[Dwork-McSherry-Nissim-Smith 06]

Algorithm

Pr [r]

ratio bounded

Alice

(33)

: The data universe.

: The dataset (one element per person)

Definition: Two datasets are

neighbors

if

they differ in the data of a single individual

Differential Privacy

(34)

: The data universe.

: The dataset (one element per person)

Definition: An algorithm is -differentially private if for all pairs of neighboring datasets , and for all outputs :

Differential Privacy

(35)

Intuition

Differential privacy is a

stability

guarantee:

Changing one data point doesn’t affect the

outcome much

Stability implies generalization

(36)

The connection

Theorem (informal): Let be a query answering mechanism such that:

1. A is -differentially private, and

2. For any set of adaptively chosen queries : with probability

then so long as then so long as: :

(37)

The connection

Theorem (informal): Let be a query answering mechanism such that:

1. A is -differentially private, and

2. For any set of adaptively chosen queries : with probability

then so long as then so long as: :

is -accurate for any set of adaptively chosen queries with probability .

Recent improvement by Nissim and Stemmer, arXiv 2015

(38)

An easier theorem:

Theorem (informal): Let be a query answering mechanism such that:

1. The output of can be described with bits, and 2. For any set of adaptively chosen queries :

then so long as then so long as: :

(39)

Proof of easy theorem

Fix any data analyst. His interaction with can be described as a decision tree of depth . Since the outcome of can always be described with only

bits, there are at most trajectories through the tree that can be realized, for a total of queries that

might possibly be asked.

Chernoff bound. Union bound.

𝑞1

𝑞2 𝑞4

𝑞5

𝑞10

𝑞13 𝑞14

(40)

Can we answer lots of

queries with short

description length?

Yes – here is one easy way

Based on [Roth, Roughgarden 2010] but without

the privacy.

Important fact: For any set of queries, there is a dataset of size that answers all queries with

-accuracy.

(41)

How to answer queries.

1. Let

2. For each query : 1. If

• Output

• Set

2. Else

• Output (Rounded to nearest multiple of )

• Set

Easy Query

(42)

Some observations

There can be at most hard queries in any sequence

of queries

Since each hard query cuts in half, and All answers are -accurate.

The answers to any sequence of queries can be

(43)

Some observations

𝑞1

𝑞2 𝑞4

𝑞5 𝑞10

𝑞13 𝑞14

𝑞24 𝑞37

𝑞11 𝑞16

𝑘

To specify the trajectory down the tree, only need to specify the depth of each hard query, and its answer.

(44)

Applying the easy

theorem:

Easy corollary: There is an algorithm that can

answer any sequence of adaptively chosen queries, while guaranteeing -accuracy, and needing sample complexity only:

(45)

Applying the main theorem:

(Using improved parameters from [BSSU15] and [NS15] and applying state of the art private query release

mechanism from [HR10])

Corollary: There is an algorithm that can answer any sequence of adaptively chosen queries, while guaranteeing -accuracy, and needing sample

complexity only:

(46)

A Practical, Efficient

Technique

for Reusing a Holdout Set

(47)

Standard holdout method

training data

holdout

Data

analyst

good for

one

validation

unrestricted

access

Data

(48)

One corollary: a reusable

holdout

Data

training data

reusable

holdout

Data

analyst

unrestricted

access

can be used

many times

adaptively

(49)

* Function

q

overfits

if |

𝔼S[q]

-

𝔼D[q]

| > .

A reusable holdout:

Thresholdhout

Theorem: Thresholdout gives -accurate

estimates for any sequence of adaptively

chosen queries until overfitting*

(50)

An illustrative experiment

Data set with

2

n

= 20,000

rows and

d

= 10,000

variables. Class labels in

{-1,1}

Analyst performs

stepwise variable

selection:

1. Split data into training/holdout of size

n

(51)

No correlation

between data and labels

data are random gaussians

labels are drawn independently at random from {-1,1}

(52)

High correlation

20 attributes are highly correlated with target remaining attributes are uncorrelated

(53)

Lots of Future Directions

Theory

• Is it possible to obtain improved bounds? (We’ve basically hit the limits via differential privacy, but there may be other approaches)

i.e. is differential privacy the best way to achieve generalization? • If not, is there an information theoretic measure that characterizes

generalization for adaptive analyses? (See DFHPRR15 for a partial result)

Practice

Design practical methods for real data analysis problems

• Thresholdout is a step in this direction, but there is more to data analysis than holdout sets.

(54)

To read more

The main theorem:

• “Preserving Statistical Validity in Adaptive Data Analysis”, Dwork, Feldman, Hardt, Pitassi, Reingold, Roth. http://arxiv.org/abs/1411.2664 2014.

Improvements (Improved parameters and more general settings):

• “More General Queries and Less Generalization Error in Adaptive Data Analysis”, Bassily, Smith, Steinke, Ullman. http://arxiv.org/abs/1503.04843 2015.

• “On the Generalization Properties of Differential Privacy”, Nissim, Stemmer. http://

arxiv.org/abs/1504.05800 2015.

Description length and more general measures, as well as Thresholdout and experiments:

• “Generalization in Adaptive Data Analysis and Holdout Reuse”, Dwork, Feldman, Hardt, Pitassi, Reingold, Roth. http://arxiv.org/abs/1506.02629 2015.

For more on Differential Privacy:

• “The Algorithmic Foundations of Differential Privacy”, Dwork, Roth.

http:// http:// http://arxiv.org/abs/1504.05800 http:// http://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf

References

Related documents

Although no HP minerals have been previously reported in the Château-Renard meteorite, we have observed within the melt veins a number of HP minerals and report on them here for

The opening of the capital account was one of the important structural reforms implemented by Argentina. This liberalization increased the linkage of the real economy with the

[r]

The Plain English Guide for Perc Dry Cleaners was developed to assist owners and operators of perchloroethylene (perc) dry cleaning facilities in understanding and complying

ab First, you create a new ODI interface, INT-11-2 to load data into the TRG_PRODUCT target fer datastore table in the Oracle Sales Application model.. You specify the source

Equity Positions in Utility and/or Corporate Entities that have Coal- Fired Thermal Power Companies in their Portfolio: For proposed in- vestments in corporate entities

(b) Number of parties represented in parliament : In a next step we analyze the effects of the different electoral systems for the lower and upper house on political representation in

University of Pennsylvania Objective: In the present study, we examined the relationship between posttraumatic and depressive symptoms during prolonged exposure (PE) treatment with