• No results found

Dealing with Missing Data

N/A
N/A
Protected

Academic year: 2021

Share "Dealing with Missing Data"

Copied!
19
0
0

Loading.... (view fulltext now)

Full text

(1)

Dealing with Missing Data

Roch Giorgi

UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France

email: [email protected]

(2)

Background (1)

Importance of quality control is well known

Covariate values may be missing for some subjects

 Collected routinely: tumor size, lymph node status, metastasis (mainly)

 Collected for specific studies: estrogen receptor, socioprofessional category,…

socioprofessional category,…

Missing values may concern

 Dependent variable: Time/Status in survival analysis

 Independent variable(s): tumor size,…

 Whatever the question (incidence, survival,…)

(3)

Background (2)

Consequences of missing data

 Loss of irrelevant/non informative information

 No impact on estimates

 Loss of relevant/informative information

 Impact depends on the percentage of missing values

 Possible bias in both point estimates and standard errors

 Possible bias in both point estimates and standard errors

 Loss of statistical power

 Univariate/Multivariate analysis?

 Multivariate analysis: increase of the total percentage of missing values

What can we do?

 Discard all the data set?

 Choose an appropriate method to perform analysis?

(4)

Objectives

Present an overview of

 The types of missing data

 Some methods used to deal with missing data

Provide outline guidelines

Provide outline guidelines

(5)

Missing Data Mechanism: Notations

( )

ij

Y = y

( )

ij

M = m

: (n x k) rectangular data set without missing values

Defines the missingness pattern mij=1 if yij is missing

mij=0 if yij is present

Defines the missingness pattern

Y1 Y2 Yk 1

2

n ?

Univariate

Y1 Y2 Y3 Y4 Yk 1

2

?

?

? ?

n ? ? ?

Monotone

Y1 Y2 Y3 Y4 Yk

1 ?

2 ? ?

?

?

n ? ?

Non-Monotone

(6)

Missing Data Mechanism: Classification

Characterized by the conditional distribution of M given Y

Missing Completely At Random (MCAR)

 Missingness mechanism independent of the values of the data Y (missing-Ymis- or observed-Yobs)

Missing At Random (MAR)

 Missingness mechanism depends only on Yobs, not on Ymis

Missing Not At Random (MNAR)

 Missingness mechanism depends on Ymis

→ Ignorable (MCAR, MAR) / Non-ignorable (MNAR) missing data

(7)

Missing Data Mechanism

What do we learn with that?

 MCAR, MAR: handling missing data in an “appropriate way” do not need to model the missingness process

 Statistical tests

 H0: MCAR vs MAR? ⇐ Yes

 H : ignorable vs non-ignorable? ⇐ No

 H0: ignorable vs non-ignorable? ⇐ No

 Classical methods used to handle missing data

 Provide valid statistical inferences with ignorable missing data

 Are not valid with non-ignorable missing data

 Sensitivity analyses under various scenarios of

nonreponse when the MNAR hypothesis is suspected

(e.g. self-reported characteristics as psychological disorders, quality of life, income,…)

(8)

Classical Methods

Complete cases

Indicator variable

Multiple imputation

and others…

(9)

Complete Cases Method

Based only on the individuals having no missing values on the covariates included in the analysis

The preferred method of many statistical softwares!

Pos

 Easy to perform! but not necessarily a good point

 Easy to perform! but not necessarily a good point

 Unbiased results under MCAR hypothesis

Neg

 Reduction of sample size

 Loss of statistical power

 Bias in standard errors

 Inappropriate variable selection (regression analysis)

(10)

Indicator Variable Method

Creation of a missing data indicator variable

Treat missing data as just another category

Pos

 Includes all the observations for the analysis

 No loss of statistical power

 No loss of statistical power

 May help to interpret results (similarity with another category)

Neg

 Biased estimates (usually)

 May not help to interpret results (absence of similarity)

(11)

Multiple Imputation (MI): Principle

Step 1

Step 2

 MAR assumption

 Imputations of the missing values for “M” completed data sets

2 . ..

. .M

. . .1

.

?

?

?

Imputation model

Step 3

e (se)

 Analyze of each of these

“completed” data sets

estimates and standard errors

 Combination to produce a single set of estimates with their

standard errors

e2 (se2)

eM (seM) e1

(se1)

Analysis model

(12)

MI: Imputation of the Missing Values

Goal: to account for the relationships between Ymis and Yobs, while taking into account the uncertainty of the imputation

( )

* ~ mis | obs

Y f Y Y

Imputation model (non exhaustive)

 Continuous variable (e.g.: age at diagnosis): propensity methods, predictive mean matching

 Binary data (e.g.: M-stage): logistic regression

 Categorical data (e.g.: T-stage): polytomous logistic regression, proportional odds

(13)

MI: Analyses of the Completed Data Sets

Analysis model: classical methods used to estimate

 Incidence

 Survival

 Effect of prognostic factors



Independent analyses

Each applied on the “new” completed data sets

(14)

MI: Combined Analysis

Combination of the M estimates into an overall estimate and variance–covariance matrix using Rubin’s rules

Take into account the uncertainty due to missing data

Statistics that can be combined

Statistics that may require transformation

Statistics that cannot be combined

Mean, proportion,

regression coefficient,…

Odds ratio, hazard ratio, baseline hazard, survival probability,…

P-value, likelihood ratio test statistic,…

Adapted from: White IR, et coll. Statistics in Medicine 2009

(15)

MI: Issue and Guidance for Practice (1)

How many missing at most?

 Do not think in term of % of missing by covariate, but in term of reduction of % from the original data set when all variables used for the analyses are considered

 Think about the missingness mechanism

Which variables to include in the imputation model?

Which variables to include in the imputation model?

 Covariates and outcome from the analysis model

 In survival model: status, time (t, log(t)) or cumulative baseline hazard function

 All predictors of the incomplete variable

 The number of variables in the imputation model may be greater than in the analysis model

(16)

MI: Issue and Guidance for Practice (2)

Should we pay attention to the form of the imputation model?

 Yes in theory, hard to do (linearity? Interaction term?...)

How many imputations are necessary?

 M=5-10 usually considered to be adequate

 M=5-10 usually considered to be adequate

 Other rule exist based on the fraction of missing data

Do we have to perform new imputations for each analysis?

 The imputed data set may be used for several analysis

 Need attention on the elaboration of the imputation model (more ‘congenial’)

(17)

MI: Issue and Guidance for Practice (3)

Is there a particular model building strategy?

 Variable selection can be performed to all imputed data sets, or considering a single data set (after merging) with an appropriate weighting procedure

 Model checking could be performed on each imputed data set

data set

 Prediction could be obtained using Rubin’s rules

How to be confident about the fact that the missingness mechanism is ignorable?

 Think about your data

 Perform sensitivity analysis

(18)

Thank you

Roch Giorgi

email: [email protected] email: [email protected]

Challenges in the Estimation of Net SURvival working survival group French National Research Agency (ANR-12-BSV1-0028)

(19)

References

Eisemann N, Waldmann A, Katalinic A. Imputation of missing values of tumour stage in population-based cancer registration. BMC Medical Research in Methodololgy

2011;11:129.

Giorgi R, Belot A, Gaudart J, Launoy G; French Network of Cancer Registries FRANCIM.

The performance of multiple imputation for missing covariate data within the context of regression relative survival analysis. Statistics in Medicine 2008;27(30):6310-31.

Howlader N, Noone AM, Yu M, Cronin KA. Use of imputed population-based cancer

registry data as a method of accounting for missing information: application to estrogen receptor status for breast cancer. American Journal of Epidemiology 2012;176(4):347- receptor status for breast cancer. American Journal of Epidemiology 2012;176(4):347- 56.

Little RJA, Rubin DB. Statistical Analysis with Missing Data (2nd edn). Wiley: New York, 2002.

Nur U, Shack LG, Rachet B, Carpenter JR, Coleman MP. Modelling relative survival in the presence of incomplete data: a tutorial. International Journal of Epidemiology 2010;39(1):118-28.

Resseguier N, Giorgi R, Paoletti X. Sensitivity analysis when data are missing not-at- random. Epidemiology 2011;22(2):282.

White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine 2011;30(4):377-99.

References

Related documents

A survey (n=28) was conducted among academics of the institutions in order to ascertain a better understanding of their perceptions on entrepreneurship education, the

Acquire and use accurately general academic and domain-specific words and phrases, sufficient for reading, writing, speaking, and listening at the college and career readiness

Table 1 Human and animal studies involving carnitine supplementation to improve female fertility/reproductive status (Continued) Study aim Carni tine(s) supplem entat ion (dose

The aim of this article is to show that wealth must be treated as a distinct dimension of social stratification alongside income. In a first step, we explain why social

Y-Tween can form the wall units of a bathroom, separating dinning room and kitchen, in a child’s bedroom to provide the simultaneous functions of wall, shelves and cupboards....

As usual in neo-Schumpeterian growth models where there exists vertical innovation over a continuum of goods and where the innovations are no-drastic, each monopolistic producer

chicken eggs in China. The original Chinese system used 80 duck eggs per bundle. Incubation of eggs by rice husk incubator with the help of kerosene lamp without

Published by the Government Accountability (GAO) Office, one of the most visible public reporting efforts of antipsychotic medication treatment among youth was the 2008