• No results found

Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD

N/A
N/A
Protected

Academic year: 2021

Share "Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD"

Copied!
14
0
0

Loading.... (view fulltext now)

Full text

(1)

Speaker

First Plenary Session

THE USE OF "BIG DATA" - WHERE ARE WE

AND WHAT DOES THE FUTURE HOLD?

William H. Crown, PhD

Optum Labs

Cambridge, MA, USA

Statistical Methods and Machine Learning

(2)

Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 3

Overview

Explosion in Data Availability

Traditional Methods for Analyzing Observational Data

Machine Learning Methods

– Widely used outside of health care—especially in consumer retail

– Many methods

– Model development and testing approach

– Is more data better?

– Traditional focus on prediction versus estimation of treatment effects

How Can We Find the Best of Both?

(3)

Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 5

Market Context

Velocity

Volume

Complexity

Variety

Gartner model, adapted

Tests and Treatments

(Medical, Lab, Pharmacy Claims,

Standardized Costs)

Health Risk Assessments

Socioeconomic

(Race, Income, Education, Language, …)

Vital Signs

Medication Orders

Admissions, Discharges, Transfers

Patient Health Survey (PHQ-9)

Health Survey Measurement (SF-12, SF-36)

Care Coaching Engagements

Evidence Based Medicine

(Recommended Care Pathways)

Mobile Applications / Social Networking

Medical Research

Genomic

Future

m

m

Examples of Data Partnerships

Payer/IT

Payer/IT

Life Sciences/Payer

Life Sciences/Payer

Life Sciences/Data and Analytics

Life Sciences/Data and Analytics

Delivery System/Partners

Delivery System/Partners

Life Sciences/PBM

Life Sciences/PBM

Multi-Stakeholder

m

m

Government

PCORI CDRN

PCORI PCORnet

FDA Sentinel

(4)

Traditional Health Services Research and

Epidemiological Methods

Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 8

Statistical Analysis of Observational Data

Good methods for developing well-matched control groups but no magic

bullets--e.g., propensity score.

These methods control only for observables.

Do not control for endogeneity or confounding.

Johnson, M., Crown, W., Martin, B., Dormuth, C., Siebert U. 2009. Good Research Practices for

Comparative Effectiveness Research: Analytic Methods to Improve Causal Inference from

Non-randomized Studies of Treatment Effects Using Secondary Data Sources. Report of the ISPOR

Retrospective Database Analysis Task Force—Part III. Value in Health 12(8): 1062-1073.

(5)

Machine Learning Methods

Machine Learning Methods

Many methods:

– Classification Trees

– Neural Networks

– Random Forests

– Ridge and Lasso Regression

– Support Vector Machines

– And many others…

Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning: Data Mining,

Inference, and Prediction. 2

nd

Edition. New York: Springer.

(6)

Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 11

Basic Approach

Use learning datasets to develop highly accurate classification

algorithm.

Apply algorithm to another dataset to predict classification.

Rules should be as simple as possible while maintaining accuracy.

– Should be able to classify data without human intervention

– Should be efficient with very large datasets

Rob Schapire. Machine Learning Algorithms for Classification.

www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf.

(7)

Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 13

K-Fold Cross-Validation

Randomly divide the full dataset into learning/validation datasets

Randomly divide the learning/validation data into K equal subsamples

(typically 5 or 10)

For each subsample K, fit the data using the other K-1 subsamples

Estimate the prediction error (e.g., sum of squared errors) for

subsample K using the models estimated from the other K-1

subsamples

Pick the model specification that generates the lowest average cross

validation error

Estimate the final model using the full dataset

Classification and Regression Trees

Advantages

– Easily handle huge datasets

– Can include both qualitative and quantitative predictor variables

– Very good for missing or sparse data

– Small trees are easy to interpret

Disadvantages

– Large trees are difficult to interpret

(8)

Rob Schapire. Machine Learning Algorithms for Classification.

www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf.

Rob Schapire. Machine Learning Algorithms for Classification.

www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf.

(9)

Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 17

Approach

Pick a rule to subset data

Using the rule, divide data into subsets

Keep repeating until remaining subsets are almost “pure” (e.g,

measured by entropy or gini index)

Usual approach is to build a very large tree and then prune it back

Outcome Layer

Z

1

Z

2

Z

3

Z

4

Hidden Layer

X

1

X

2

X

3

Input Layer

Y

1

Y

2

Neural Networks

Z

i

= f(B

k

X

k

)

Y

i

=e

z

/(1+e

z

)

(10)

Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 19

Prediction Is Not the Same as Estimating Treatment Effects

Some machine learning methods (e.g., Ridge and Lasso regression)

use regression methods with a penalty term to adjust for the danger of

overfitting.

Enables application of the machine learning approach to the estimation

of treatment effects.

But we know that results from observational studies can be sensitive to

spurious correlations and methodological approach…

Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 20

(11)

Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 21

Things With Strong Trends Will Tend to Be Highly Correlated (2)

(12)

Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 23

But Even With Big Data You Have To Be Careful!

Seeger, John, Alexander Walker, Paige Williams, Gordon Saperia, Frank Sacks (2003) A

Propensity Score-Matched Cohort Study of the Effect of Statins, Mainly Fluvastatin, on the

Occurrence of Acute Myocardial Infarction. Am J. Cardiol 92:1447-1451.

MI Outcome (Unmatched)

MI Outcome (After Matching)

Statin Non-Initiators

Statin Initiators

Months of Follow-Up

C

u

mu

la

tiv

e

In

c

id

e

n

c

e

HR=2.11

(1.46-3.04)

111% (46%-204%)

Risk Increase

Statin Initiators

Statin Non-Initiators

Months of Follow-Up

C

u

mu

la

tiv

e

In

c

id

e

n

c

e

HR=0.69

(0.52-0.93)

31% (7%-48%)

Risk Reduction

Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 24

Bigger Samples Don’t Reduce Bias

0 1 2 3 -3 -2 -1 0 1 2 3

iv

ols

(,z)=0

Estimation Error

0 5 10 15 20 -3 -2 -1 0 1 2 3

iv

ols

(z,e)=0

N=200

N=10,000

Crown, W., Henk, H., VanNess D. Some Cautions on the Use of Instrumental Variables (IV) Estimators

in Outcomes Research: How Bias in IV Estimators is Affected by Instrument Strength, Instrument

Contamination, and Sample Size. Value in Health 14: 1078-1084, 2011.

(13)

Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 25

EHR/Claims Linkages Can Help Reduce Missing Variable Bias

Relevant information

Claims

alone

EHR alone

Linked data

Clinical data and severity measures

+

+

Retail/specialty drugs across treatment settings

+

+

Leakage

+

+

Patient-reported outcomes

Selection biases due to payer type

+

+

Longitudinality of patient follow-up

+

++

Self-pay data

+

+

Coding biases

+

+

Unstructured data

+

+

Timing of events

+

+

Continuous coverage

+

++

Shared Salt Code (same for all contributors) Data is then hashed by contributors at their site. Uses Confidential Salt Statistically de-identified views

Name | Address | Birthdate SSN | Phone, etc. Direct identifiers (EMR / Clinical)

Name | Address | Birthdate Member ID | Phone, etc.

Direct identifiers (Insurer example)

Primary

hash

Primary

hash

Primary

hash

Primary

hash

Secondary

hash

Secondary

hash

De-identification

(14)

Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 27

Summary

Rapid expansion in data (volume, velocity, and variety)

Machine learning approaches focus on prediction but some can also be

used to estimate treatment effects

Machine learning methods offer opportunities for speed to answer but

traditional challenges with observational data do not go away

More data doesn’t help with bias problems unless it helps with control

variables through data linkage

For treatment effect estimation still need to think about possible

sources of bias and their implications for methodology and data used

for model building

Speaker

First Plenary Session

THE USE OF "BIG DATA" - WHERE ARE WE

AND WHAT DOES THE FUTURE HOLD?

William H. Crown, PhD

Optum Labs

References

Related documents

ODBC driver managers use configuration files to define and configure ODBC data sources and drivers.. By default, the configuration files reside in the user’s home

As a further test of the role of entrepreneurship, we investigate whether entrepreneurship does not moderate the relationship between knowledge and new-to-the-firm innovation. Table

The Mysore ITS includes core systems like the Vehicle Tracking System, Real Time Passenger Information System and Central Control Station and technologies including Global

Nevertheless, the strong effect of school performance upon male (but not female) marriage does exemplify a potentially important mechanism whereby sexual selection may operate

Similar to Bowen’s (2003) US based study of undergraduate students, this study confirmed the strong postgraduate student interest in event management and publicity and

violations of the underlying assumptions tend to cancel each other and the confidence interval based on the assumptions of independent simple random samples selected with

If returns to non-cognitive skills are the same for men and women with the same measures of these skills, the human capital model may better explain gender differences between

Centralized Unit as Logical Table (LT) [185] is built using parallelism to regulate and control the classifier learning process.. It ensures the blending and tuning of