Speaker
First Plenary Session
THE USE OF "BIG DATA" - WHERE ARE WE
AND WHAT DOES THE FUTURE HOLD?
William H. Crown, PhD
Optum Labs
Cambridge, MA, USA
Statistical Methods and Machine Learning
Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 3
Overview
•
Explosion in Data Availability
•
Traditional Methods for Analyzing Observational Data
•
Machine Learning Methods
– Widely used outside of health care—especially in consumer retail
– Many methods
– Model development and testing approach
– Is more data better?
– Traditional focus on prediction versus estimation of treatment effects
•
How Can We Find the Best of Both?
Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 5
Market Context
Velocity
Volume
Complexity
Variety
Gartner model, adapted
Tests and Treatments
(Medical, Lab, Pharmacy Claims,
Standardized Costs)
Health Risk Assessments
Socioeconomic
(Race, Income, Education, Language, …)
Vital Signs
Medication Orders
Admissions, Discharges, Transfers
Patient Health Survey (PHQ-9)
Health Survey Measurement (SF-12, SF-36)
Care Coaching Engagements
Evidence Based Medicine
(Recommended Care Pathways)
Mobile Applications / Social Networking
Medical Research
Genomic
Future
m
m
Examples of Data Partnerships
Payer/IT
Payer/IT
Life Sciences/Payer
Life Sciences/Payer
Life Sciences/Data and Analytics
Life Sciences/Data and Analytics
Delivery System/Partners
Delivery System/Partners
Life Sciences/PBM
Life Sciences/PBM
Multi-Stakeholder
m
m
Government
PCORI CDRN
PCORI PCORnet
FDA Sentinel
Traditional Health Services Research and
Epidemiological Methods
Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 8
Statistical Analysis of Observational Data
•
Good methods for developing well-matched control groups but no magic
bullets--e.g., propensity score.
•
These methods control only for observables.
•
Do not control for endogeneity or confounding.
Johnson, M., Crown, W., Martin, B., Dormuth, C., Siebert U. 2009. Good Research Practices for
Comparative Effectiveness Research: Analytic Methods to Improve Causal Inference from
Non-randomized Studies of Treatment Effects Using Secondary Data Sources. Report of the ISPOR
Retrospective Database Analysis Task Force—Part III. Value in Health 12(8): 1062-1073.
Machine Learning Methods
Machine Learning Methods
•
Many methods:
– Classification Trees
– Neural Networks
– Random Forests
– Ridge and Lasso Regression
– Support Vector Machines
– And many others…
Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. 2
nd
Edition. New York: Springer.
Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 11
Basic Approach
•
Use learning datasets to develop highly accurate classification
algorithm.
•
Apply algorithm to another dataset to predict classification.
•
Rules should be as simple as possible while maintaining accuracy.
– Should be able to classify data without human intervention
– Should be efficient with very large datasets
Rob Schapire. Machine Learning Algorithms for Classification.
www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf.
Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 13
K-Fold Cross-Validation
•
Randomly divide the full dataset into learning/validation datasets
•
Randomly divide the learning/validation data into K equal subsamples
(typically 5 or 10)
•
For each subsample K, fit the data using the other K-1 subsamples
•
Estimate the prediction error (e.g., sum of squared errors) for
subsample K using the models estimated from the other K-1
subsamples
•
Pick the model specification that generates the lowest average cross
validation error
•
Estimate the final model using the full dataset
Classification and Regression Trees
•
Advantages
– Easily handle huge datasets
– Can include both qualitative and quantitative predictor variables
– Very good for missing or sparse data
– Small trees are easy to interpret
•
Disadvantages
– Large trees are difficult to interpret
Rob Schapire. Machine Learning Algorithms for Classification.
www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf.
Rob Schapire. Machine Learning Algorithms for Classification.
www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf.
Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 17
Approach
•
Pick a rule to subset data
•
Using the rule, divide data into subsets
•
Keep repeating until remaining subsets are almost “pure” (e.g,
measured by entropy or gini index)
•
Usual approach is to build a very large tree and then prune it back
Outcome Layer
Z
1
Z
2
Z
3
Z
4
Hidden Layer
X
1
X
2
X
3
Input Layer
Y
1
Y
2
Neural Networks
Z
i
= f(B
k
X
k
)
Y
i
=e
z
/(1+e
z
)
Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 19
Prediction Is Not the Same as Estimating Treatment Effects
•
Some machine learning methods (e.g., Ridge and Lasso regression)
use regression methods with a penalty term to adjust for the danger of
overfitting.
•
Enables application of the machine learning approach to the estimation
of treatment effects.
•
But we know that results from observational studies can be sensitive to
spurious correlations and methodological approach…
Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 20
Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 21
Things With Strong Trends Will Tend to Be Highly Correlated (2)
Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 23
But Even With Big Data You Have To Be Careful!
Seeger, John, Alexander Walker, Paige Williams, Gordon Saperia, Frank Sacks (2003) A
Propensity Score-Matched Cohort Study of the Effect of Statins, Mainly Fluvastatin, on the
Occurrence of Acute Myocardial Infarction. Am J. Cardiol 92:1447-1451.
MI Outcome (Unmatched)
MI Outcome (After Matching)
Statin Non-Initiators
Statin Initiators
Months of Follow-Up
C
u
mu
la
tiv
e
In
c
id
e
n
c
e
HR=2.11
(1.46-3.04)
111% (46%-204%)
Risk Increase
Statin Initiators
Statin Non-Initiators
Months of Follow-Up
C
u
mu
la
tiv
e
In
c
id
e
n
c
e
HR=0.69
(0.52-0.93)
31% (7%-48%)
Risk Reduction
Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 24
Bigger Samples Don’t Reduce Bias
0 1 2 3 -3 -2 -1 0 1 2 3
iv
ols
(,z)=0
Estimation Error
0 5 10 15 20 -3 -2 -1 0 1 2 3iv
ols
(z,e)=0
N=200
N=10,000
Crown, W., Henk, H., VanNess D. Some Cautions on the Use of Instrumental Variables (IV) Estimators
in Outcomes Research: How Bias in IV Estimators is Affected by Instrument Strength, Instrument
Contamination, and Sample Size. Value in Health 14: 1078-1084, 2011.
Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 25
EHR/Claims Linkages Can Help Reduce Missing Variable Bias
Relevant information
Claims
alone
EHR alone
Linked data
Clinical data and severity measures
—
+
+
Retail/specialty drugs across treatment settings
+
—
+
Leakage
+
—
+
Patient-reported outcomes
—
—
—
Selection biases due to payer type
—
+
+
Longitudinality of patient follow-up
—
+
++
Self-pay data
—
+
+
Coding biases
—
+
+
Unstructured data
—
+
+
Timing of events
—
+
+
Continuous coverage
—
+
++
Shared Salt Code (same for all contributors) Data is then hashed by contributors at their site. Uses Confidential Salt Statistically de-identified viewsName | Address | Birthdate SSN | Phone, etc. Direct identifiers (EMR / Clinical)
Name | Address | Birthdate Member ID | Phone, etc.
Direct identifiers (Insurer example)
Primary
hash
Primary
hash
Primary
hash
Primary
hash
Secondary
hash
Secondary
hash
De-identification
Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 27