Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD

(1)

Speaker

First Plenary Session

THE USE OF "BIG DATA" - WHERE ARE WE

AND WHAT DOES THE FUTURE HOLD?

William H. Crown, PhD

Optum Labs

Cambridge, MA, USA

Statistical Methods and Machine Learning

(2)

Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 3

Overview

• Explosion in Data Availability

• Traditional Methods for Analyzing Observational Data

• Machine Learning Methods

– Widely used outside of health care—especially in consumer retail

– Many methods

– Model development and testing approach

– Is more data better?

– Traditional focus on prediction versus estimation of treatment effects

• How Can We Find the Best of Both?

(3)

Market Context

Velocity

Volume

Complexity

Variety

Gartner model, adapted

Tests and Treatments

(Medical, Lab, Pharmacy Claims,

Standardized Costs)

Health Risk Assessments

Socioeconomic

(Race, Income, Education, Language, …)

Vital Signs

Medication Orders

Admissions, Discharges, Transfers

Patient Health Survey (PHQ-9)

Health Survey Measurement (SF-12, SF-36)

Care Coaching Engagements

Evidence Based Medicine

(Recommended Care Pathways)

Mobile Applications / Social Networking

Medical Research

Genomic

Future

m

Examples of Data Partnerships

Payer/IT

Life Sciences/Payer

Life Sciences/Data and Analytics

Delivery System/Partners

Life Sciences/PBM

Multi-Stakeholder

m

Government

PCORI CDRN

PCORI PCORnet

FDA Sentinel

(4)

Traditional Health Services Research and

Epidemiological Methods

Statistical Analysis of Observational Data

• Good methods for developing well-matched control groups but no magic

bullets--e.g., propensity score.

• These methods control only for observables.

• Do not control for endogeneity or confounding.

Johnson, M., Crown, W., Martin, B., Dormuth, C., Siebert U. 2009. Good Research Practices for

Comparative Effectiveness Research: Analytic Methods to Improve Causal Inference from

Non-randomized Studies of Treatment Effects Using Secondary Data Sources. Report of the ISPOR

Retrospective Database Analysis Task Force—Part III. Value in Health 12(8): 1062-1073.

(5)

Machine Learning Methods

• Many methods:

– Classification Trees

– Neural Networks

– Random Forests

– Ridge and Lasso Regression

– Support Vector Machines

– And many others…

Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning: Data Mining,

Inference, and Prediction. 2

nd

_{Edition. New York: Springer.}

(6)

Basic Approach

• Use learning datasets to develop highly accurate classification

algorithm.

• Apply algorithm to another dataset to predict classification.

• Rules should be as simple as possible while maintaining accuracy.

– Should be able to classify data without human intervention

– Should be efficient with very large datasets

Rob Schapire. Machine Learning Algorithms for Classification.

www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf.

(7)

K-Fold Cross-Validation

• Randomly divide the full dataset into learning/validation datasets

• Randomly divide the learning/validation data into K equal subsamples

(typically 5 or 10)

• For each subsample K, fit the data using the other K-1 subsamples

• Estimate the prediction error (e.g., sum of squared errors) for

subsample K using the models estimated from the other K-1

subsamples

• Pick the model specification that generates the lowest average cross

validation error

• Estimate the final model using the full dataset

Classification and Regression Trees

• Advantages

– Easily handle huge datasets

– Can include both qualitative and quantitative predictor variables

– Very good for missing or sparse data

– Small trees are easy to interpret

• Disadvantages

– Large trees are difficult to interpret

(8)

Rob Schapire. Machine Learning Algorithms for Classification.

www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf.

Rob Schapire. Machine Learning Algorithms for Classification.

www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf.

(9)

Approach

• Pick a rule to subset data

• Using the rule, divide data into subsets

• Keep repeating until remaining subsets are almost “pure” (e.g,

measured by entropy or gini index)

• Usual approach is to build a very large tree and then prune it back

Outcome Layer

Z

₁

Z

₂

Z

₃

Z

₄

Hidden Layer

X

1 X

2 X

3 Input Layer

Y

₁

_Y

2 Neural Networks

Z

_i

= f(B

_k

X

_k

)

Y

i

=e

z

/(1+e

z

)

(10)

Prediction Is Not the Same as Estimating Treatment Effects

• Some machine learning methods (e.g., Ridge and Lasso regression)

use regression methods with a penalty term to adjust for the danger of

overfitting.

• Enables application of the machine learning approach to the estimation

of treatment effects.

• But we know that results from observational studies can be sensitive to

spurious correlations and methodological approach…

(11)

Things With Strong Trends Will Tend to Be Highly Correlated (2)

(12)

But Even With Big Data You Have To Be Careful!

Seeger, John, Alexander Walker, Paige Williams, Gordon Saperia, Frank Sacks (2003) A

Propensity Score-Matched Cohort Study of the Effect of Statins, Mainly Fluvastatin, on the

Occurrence of Acute Myocardial Infarction. Am J. Cardiol 92:1447-1451.

MI Outcome (Unmatched)

MI Outcome (After Matching)

Statin Non-Initiators

Statin Initiators

Months of Follow-Up

C

u

mu

la

tiv

e

In

c

id

e

n

c

e

HR=2.11

(1.46-3.04)

111% (46%-204%)

Risk Increase

Statin Initiators

Statin Non-Initiators

Months of Follow-Up

C

u

mu

la

tiv

e

In

c

id

e

n

c

e

HR=0.69

(0.52-0.93)

31% (7%-48%)

Risk Reduction

Bigger Samples Don’t Reduce Bias

0 1 2 3 -3 -2 -1 0 1 2 3

iv

ols

(,z)=0

Estimation Error

0 5 10 15 20 -3 -2 -1 0 1 2 3

iv

ols

(z,e)=0

N=200

N=10,000

Crown, W., Henk, H., VanNess D. Some Cautions on the Use of Instrumental Variables (IV) Estimators

in Outcomes Research: How Bias in IV Estimators is Affected by Instrument Strength, Instrument

Contamination, and Sample Size. Value in Health 14: 1078-1084, 2011.

(13)

EHR/Claims Linkages Can Help Reduce Missing Variable Bias

Relevant information

Claims

alone

EHR alone

Linked data

Clinical data and severity measures

—

+

Retail/specialty drugs across treatment settings

+

—

+

Leakage

+

—

+

Patient-reported outcomes

—

Selection biases due to payer type

—

+

Longitudinality of patient follow-up

—

+

++

Self-pay data

—

+

Coding biases

—

+

Unstructured data

—

+

Timing of events

—

+

Continuous coverage

—

+

++

Shared Salt Code (same for all contributors) Data is then hashed by contributors at their site. Uses Confidential Salt Statistically de-identified views

Name | Address | Birthdate SSN | Phone, etc. Direct identifiers (EMR / Clinical)