Dealing with Missing Data
Roch Giorgi
UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France
email: [email protected]
Background (1)
● Importance of quality control is well known
● Covariate values may be missing for some subjects
Collected routinely: tumor size, lymph node status, metastasis (mainly)
Collected for specific studies: estrogen receptor, socioprofessional category,…
socioprofessional category,…
● Missing values may concern
Dependent variable: Time/Status in survival analysis
Independent variable(s): tumor size,…
Whatever the question (incidence, survival,…)
Background (2)
● Consequences of missing data
Loss of irrelevant/non informative information
No impact on estimates
Loss of relevant/informative information
Impact depends on the percentage of missing values
Possible bias in both point estimates and standard errors
Possible bias in both point estimates and standard errors
Loss of statistical power
Univariate/Multivariate analysis?
Multivariate analysis: increase of the total percentage of missing values
● What can we do?
Discard all the data set?
Choose an appropriate method to perform analysis?
Objectives
● Present an overview of
The types of missing data
Some methods used to deal with missing data
● Provide outline guidelines
● Provide outline guidelines
Missing Data Mechanism: Notations
●
●
( )
ijY = y
( )
ijM = m
: (n x k) rectangular data set without missing values
Defines the missingness pattern mij=1 if yij is missing
mij=0 if yij is present
Defines the missingness pattern
Y1 Y2 … Yk 1
2
n ?
Univariate
Y1 Y2 Y3 Y4 … Yk 1
2
?
?
? ?
n ? ? ?
Monotone
Y1 Y2 Y3 Y4 … Yk
1 ?
2 ? ?
?
?
n ? ?
Non-Monotone
Missing Data Mechanism: Classification
Characterized by the conditional distribution of M given Y
● Missing Completely At Random (MCAR)
Missingness mechanism independent of the values of the data Y (missing-Ymis- or observed-Yobs)
● Missing At Random (MAR)
Missingness mechanism depends only on Yobs, not on Ymis
● Missing Not At Random (MNAR)
Missingness mechanism depends on Ymis
→ Ignorable (MCAR, MAR) / Non-ignorable (MNAR) missing data
Missing Data Mechanism
● What do we learn with that?
MCAR, MAR: handling missing data in an “appropriate way” do not need to model the missingness process
Statistical tests
H0: MCAR vs MAR? ⇐⇐⇐⇐ Yes
H : ignorable vs non-ignorable? ⇐⇐⇐⇐ No
H0: ignorable vs non-ignorable? ⇐⇐⇐⇐ No
Classical methods used to handle missing data
Provide valid statistical inferences with ignorable missing data
Are not valid with non-ignorable missing data
Sensitivity analyses under various scenarios of
nonreponse when the MNAR hypothesis is suspected
(e.g. self-reported characteristics as psychological disorders, quality of life, income,…)
Classical Methods
● Complete cases
● Indicator variable
● Multiple imputation
and others…
Complete Cases Method
● Based only on the individuals having no missing values on the covariates included in the analysis
● The preferred method of many statistical softwares!
● Pos
Easy to perform! but not necessarily a good point
Easy to perform! but not necessarily a good point
Unbiased results under MCAR hypothesis
● Neg
Reduction of sample size
Loss of statistical power
Bias in standard errors
Inappropriate variable selection (regression analysis)
Indicator Variable Method
● Creation of a missing data indicator variable
● Treat missing data as just another category
● Pos
Includes all the observations for the analysis
No loss of statistical power
No loss of statistical power
May help to interpret results (similarity with another category)
● Neg
Biased estimates (usually)
May not help to interpret results (absence of similarity)
Multiple Imputation (MI): Principle
● Step 1
● Step 2
MAR assumption
Imputations of the missing values for “M” completed data sets
2 … . ..
. .M
. . .1
.
?
?
?
Imputation model
● Step 3
e (se)
Analyze of each of these
“completed” data sets
⇒ estimates and standard errors
Combination to produce a single set of estimates with their
standard errors
e2 (se2)
eM (seM) e1
(se1)
Analysis model
MI: Imputation of the Missing Values
● Goal: to account for the relationships between Ymis and Yobs, while taking into account the uncertainty of the imputation
( )
* ~ mis | obs
Y f Y Y
● Imputation model (non exhaustive)
Continuous variable (e.g.: age at diagnosis): propensity methods, predictive mean matching
Binary data (e.g.: M-stage): logistic regression
Categorical data (e.g.: T-stage): polytomous logistic regression, proportional odds
MI: Analyses of the Completed Data Sets
● Analysis model: classical methods used to estimate
Incidence
Survival
Effect of prognostic factors
…
● Independent analyses
● Each applied on the “new” completed data sets
MI: Combined Analysis
● Combination of the M estimates into an overall estimate and variance–covariance matrix using Rubin’s rules
● Take into account the uncertainty due to missing data
Statistics that can be combined
Statistics that may require transformation
Statistics that cannot be combined
Mean, proportion,
regression coefficient,…
Odds ratio, hazard ratio, baseline hazard, survival probability,…
P-value, likelihood ratio test statistic,…
Adapted from: White IR, et coll. Statistics in Medicine 2009
MI: Issue and Guidance for Practice (1)
● How many missing at most?
Do not think in term of % of missing by covariate, but in term of reduction of % from the original data set when all variables used for the analyses are considered
Think about the missingness mechanism
Which variables to include in the imputation model?
● Which variables to include in the imputation model?
Covariates and outcome from the analysis model
In survival model: status, time (t, log(t)) or cumulative baseline hazard function
All predictors of the incomplete variable
The number of variables in the imputation model may be greater than in the analysis model
MI: Issue and Guidance for Practice (2)
● Should we pay attention to the form of the imputation model?
Yes in theory, hard to do (linearity? Interaction term?...)
● How many imputations are necessary?
M=5-10 usually considered to be adequate
M=5-10 usually considered to be adequate
Other rule exist based on the fraction of missing data
● Do we have to perform new imputations for each analysis?
The imputed data set may be used for several analysis
Need attention on the elaboration of the imputation model (more ‘congenial’)
MI: Issue and Guidance for Practice (3)
● Is there a particular model building strategy?
Variable selection can be performed to all imputed data sets, or considering a single data set (after merging) with an appropriate weighting procedure
Model checking could be performed on each imputed data set
data set
Prediction could be obtained using Rubin’s rules
● How to be confident about the fact that the missingness mechanism is ignorable?
Think about your data
Perform sensitivity analysis
Thank you
Roch Giorgi
email: [email protected] email: [email protected]
Challenges in the Estimation of Net SURvival working survival group French National Research Agency (ANR-12-BSV1-0028)
References
● Eisemann N, Waldmann A, Katalinic A. Imputation of missing values of tumour stage in population-based cancer registration. BMC Medical Research in Methodololgy
2011;11:129.
● Giorgi R, Belot A, Gaudart J, Launoy G; French Network of Cancer Registries FRANCIM.
The performance of multiple imputation for missing covariate data within the context of regression relative survival analysis. Statistics in Medicine 2008;27(30):6310-31.
● Howlader N, Noone AM, Yu M, Cronin KA. Use of imputed population-based cancer
registry data as a method of accounting for missing information: application to estrogen receptor status for breast cancer. American Journal of Epidemiology 2012;176(4):347- receptor status for breast cancer. American Journal of Epidemiology 2012;176(4):347- 56.
● Little RJA, Rubin DB. Statistical Analysis with Missing Data (2nd edn). Wiley: New York, 2002.
● Nur U, Shack LG, Rachet B, Carpenter JR, Coleman MP. Modelling relative survival in the presence of incomplete data: a tutorial. International Journal of Epidemiology 2010;39(1):118-28.
● Resseguier N, Giorgi R, Paoletti X. Sensitivity analysis when data are missing not-at- random. Epidemiology 2011;22(2):282.
● White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine 2011;30(4):377-99.