Dealing with Missing Data

(1)

Dealing with Missing Data

Roch Giorgi

UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France

email: [email protected]

(2)

Background (1)

● Importance of quality control is well known

● Covariate values may be missing for some subjects

Collected routinely: tumor size, lymph node status, metastasis (mainly)

Collected for specific studies: estrogen receptor, socioprofessional category,…

socioprofessional category,…

● Missing values may concern

Dependent variable: Time/Status in survival analysis

Independent variable(s): tumor size,…

Whatever the question (incidence, survival,…)

(3)

Background (2)

● Consequences of missing data

Loss of irrelevant/non informative information

No impact on estimates

Loss of relevant/informative information

Impact depends on the percentage of missing values

Possible bias in both point estimates and standard errors

Loss of statistical power

Univariate/Multivariate analysis?

Multivariate analysis: increase of the total percentage of missing values

● What can we do?

Discard all the data set?

Choose an appropriate method to perform analysis?

(4)

Objectives

● Present an overview of

The types of missing data

Some methods used to deal with missing data

● Provide outline guidelines

(5)

Missing Data Mechanism: Notations

●

( )

ij

Y = y

( )

^ij

M = m

: (n x k) rectangular data set without missing values

Defines the missingness pattern m_ij=1 if y_ij is missing

m_ij=0 if y_ij is present

Defines the missingness pattern

Y₁ Y₂ … Y_k 1

2

n ?

Univariate

Y₁ Y₂ Y₃ Y₄ … Y_k 1

2

?

? ?

n ? ? ?

Monotone

Y₁ Y₂ Y₃ Y₄ … Y_k

1 ?

2 ? ?

?

n ? ?

Non-Monotone

(6)

Missing Data Mechanism: Classification

Characterized by the conditional distribution of M given Y

● Missing Completely At Random (MCAR)

Missingness mechanism independent of the values of the data Y (missing-Y_mis- or observed-Y_obs)

● Missing At Random (MAR)

Missingness mechanism depends only on Y_obs, not on Y_mis

● Missing Not At Random (MNAR)

Missingness mechanism depends on Y_mis

→ Ignorable (MCAR, MAR) / Non-ignorable (MNAR) missing data

(7)

Missing Data Mechanism

● What do we learn with that?

MCAR, MAR: handling missing data in an “appropriate way” do not need to model the missingness process

Statistical tests

H₀: MCAR vs MAR? ⇐⇐⇐⇐ Yes

H : ignorable vs non-ignorable? ⇐⇐⇐⇐ No

H₀: ignorable vs non-ignorable? ⇐⇐⇐⇐ No

Classical methods used to handle missing data

Provide valid statistical inferences with ignorable missing data

Are not valid with non-ignorable missing data

Sensitivity analyses under various scenarios of

nonreponse when the MNAR hypothesis is suspected

(e.g. self-reported characteristics as psychological disorders, quality of life, income,…)

(8)

Classical Methods

● Complete cases

● Indicator variable

● Multiple imputation

and others…

(9)

Complete Cases Method

● Based only on the individuals having no missing values on the covariates included in the analysis

● The preferred method of many statistical softwares!

● Pos

Easy to perform! but not necessarily a good point

Unbiased results under MCAR hypothesis

● Neg

Reduction of sample size

Loss of statistical power

Bias in standard errors

Inappropriate variable selection (regression analysis)

(10)

Indicator Variable Method

● Creation of a missing data indicator variable

● Treat missing data as just another category

● Pos

Includes all the observations for the analysis

No loss of statistical power

May help to interpret results (similarity with another category)

● Neg

Biased estimates (usually)

May not help to interpret results (absence of similarity)

(11)

Multiple Imputation (MI): Principle

● Step 1

● Step 2

MAR assumption

Imputations of the missing values for “M” completed data sets

2 … . ..

. .M

. . .1

.

?

Imputation model

● Step 3

e (se)

Analyze of each of these

“completed” data sets

⇒ estimates and standard errors

Combination to produce a single set of estimates with their

standard errors

e₂ (se₂)

e_M (se_M) e₁

(se₁)

Analysis model

(12)

MI: Imputation of the Missing Values

● Goal: to account for the relationships between Y_mis and Y_obs, while taking into account the uncertainty of the imputation

( )

* ~ _mis | _obs

Y f Y Y

● Imputation model (non exhaustive)

Continuous variable (e.g.: age at diagnosis): propensity methods, predictive mean matching

Binary data (e.g.: M-stage): logistic regression

Categorical data (e.g.: T-stage): polytomous logistic regression, proportional odds

(13)

MI: Analyses of the Completed Data Sets

● Analysis model: classical methods used to estimate

Incidence

Survival

Effect of prognostic factors

…

● Independent analyses

● Each applied on the “new” completed data sets

(14)

MI: Combined Analysis

● Combination of the M estimates into an overall estimate and variance–covariance matrix using Rubin’s rules

● Take into account the uncertainty due to missing data

Statistics that can be combined

Statistics that may require transformation

Statistics that cannot be combined

Mean, proportion,

regression coefficient,…

Odds ratio, hazard ratio, baseline hazard, survival probability,…

P-value, likelihood ratio test statistic,…

Adapted from: White IR, et coll. Statistics in Medicine 2009

(15)

MI: Issue and Guidance for Practice (1)

● How many missing at most?

Do not think in term of % of missing by covariate, but in term of reduction of % from the original data set when all variables used for the analyses are considered

Think about the missingness mechanism

Which variables to include in the imputation model?

● Which variables to include in the imputation model?

Covariates and outcome from the analysis model

In survival model: status, time (t, log(t)) or cumulative baseline hazard function

All predictors of the incomplete variable

The number of variables in the imputation model may be greater than in the analysis model

(16)

MI: Issue and Guidance for Practice (2)

● Should we pay attention to the form of the imputation model?

Yes in theory, hard to do (linearity? Interaction term?...)

● How many imputations are necessary?

M=5-10 usually considered to be adequate

Other rule exist based on the fraction of missing data

● Do we have to perform new imputations for each analysis?

The imputed data set may be used for several analysis

Need attention on the elaboration of the imputation model (more ‘congenial’)

(17)

MI: Issue and Guidance for Practice (3)

● Is there a particular model building strategy?

Variable selection can be performed to all imputed data sets, or considering a single data set (after merging) with an appropriate weighting procedure

Model checking could be performed on each imputed data set

data set

Prediction could be obtained using Rubin’s rules

● How to be confident about the fact that the missingness mechanism is ignorable?

Think about your data

Perform sensitivity analysis

(18)

Thank you

Roch Giorgi

email: [email protected] email: [email protected]

Challenges in the Estimation of Net SURvival working survival group French National Research Agency (ANR-12-BSV1-0028)

(19)

References

● Eisemann N, Waldmann A, Katalinic A. Imputation of missing values of tumour stage in population-based cancer registration. BMC Medical Research in Methodololgy

2011;11:129.

● Giorgi R, Belot A, Gaudart J, Launoy G; French Network of Cancer Registries FRANCIM.

The performance of multiple imputation for missing covariate data within the context of regression relative survival analysis. Statistics in Medicine 2008;27(30):6310-31.

● Howlader N, Noone AM, Yu M, Cronin KA. Use of imputed population-based cancer

registry data as a method of accounting for missing information: application to estrogen receptor status for breast cancer. American Journal of Epidemiology 2012;176(4):347- receptor status for breast cancer. American Journal of Epidemiology 2012;176(4):347- 56.

● Little RJA, Rubin DB. Statistical Analysis with Missing Data (2nd edn). Wiley: New York, 2002.

● Nur U, Shack LG, Rachet B, Carpenter JR, Coleman MP. Modelling relative survival in the presence of incomplete data: a tutorial. International Journal of Epidemiology 2010;39(1):118-28.

● Resseguier N, Giorgi R, Paoletti X. Sensitivity analysis when data are missing not-at- random. Epidemiology 2011;22(2):282.

● White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine 2011;30(4):377-99.