Multiple imputations for missing data in lifecourse studies

(1)

Multiple imputation for missing

data in life course studies

Motivating example Motivating example

Types of missingness and common strategies Types of missingness and common strategies

Multiple imputationMultiple imputation

A suite of MI programsA suite of MI programs

Bianca De Stavola and Valerie McCormack

(London School of Hygiene and Tropical Medicine) (London School of Hygiene and Tropical Medicine)

(2)

2

Motivating example

Breast cancer aetiology:

Several established risk factors (adult HT, menarche)

New focus on early life and childhood growth Breast

cancer

Growth Growth

Birth size _{Adult HT}

Menarche

(3)

41.3 904 ALL 22.8 1689 15 yrs 14.9 1862 11 yrs 12.0 1925 7 yrs 11.1 1944 4 yrs 18.5 1782 2 yrs % missing N Childhood height measured at: MRC 1946 birth cohort (N=2187): MRC 1946 birth cohort (N=2187):

(4)

• • • • … • • n n • • … … • • • • … … • • 4 4 • • 3 3 2 2 • • 1 1 X_p X₂ X₁ y

Pattern and type of missingness

Data

: Y=(y,X

:

₁

, …, X

_p

)= (Y

_obs

, Y

_mis

)

MCAR:

Pr (missing) = not f(Y_obs, Y_mis) MAR:

Pr (missing) = f (Y_obs) NMAR:

(5)

Strategies

1.

Analyse only those with complete data

2.

Available case analysis

3.

Inclusion of a “missing value” category

4.

Use methods not requiring complete data

5.

Replacing missing value with imputed

(6)

Strategies

1.

2.

3.

X

1.51 NK 2.03 Level 2 1.75 overall 1.45 Level 1 RR_E confounder Biased

Biased even when data even when data are MCAR

are MCAR

(Greenland and

(7)

Strategies

1.

2.

3.

4.

Use methods not requiring complete data

5.

Replacing missing value with imputed

X

(8)

5

5 -

-

Imputations

If MAR:

Idea:

Idea: replace missing values with a “guess”replace missing values with a “guess” Analysis:

Analysis: same as with complete datasame as with complete data Two types,

Two types, manymany variants:variants: I.

I. SINGLE IMPUTATIONSINGLE IMPUTATION II.

(9)

I - SINGLE IMPUTATION

a) from a regression model::

130 140 150 160 170 180

Ht at age 15(cm)/Fitted values

1.000 2.000 3.000 4.000 5.000 Birth weight (gr)

Ht at age 15(cm) Fitted values

X

X Replace missing values

with predicted

not good:

true data variation

(10)

SINGLE IMPUTATION

b) predicted value + random term

130 140 150 160 170 180

Ht at age 15(cm)/Fitted values

1.000 2.000 3.000 4.000 5.000 Birth weight (gr)

Ht at age 15(cm) Fitted values

X X X X Size of random term depends on residual variance of the model

UNSATISFACTORY:

Still pretending the data are observed!

(11)

SINGLE IMPUTATION

c) Hot-deck

Replaces record with any missing values with another, but complete, selected at random

Not recommended if several records are incomplete

(12)

II - MULTIPLE IMPUTATION

Not one but several data sets are created

Each has a different set of random draws to replace missing value

Separate analyses on each data set

Results summarised

PROBLEM:

generating a `proper’ predictive distribution

(13)

13

•MI replacements are simulated draws from a

predictive distribution of the missing data: Y*_mis ~ P(Y_mis | Y_obs, θ*)

where θ * ~ P(θ | Y_obs)

•Require a model for the complete data

P(Y | θ )

• Proper, i.e. reflect uncertainty about missing

data and the parameters

More technically…..

(Shafer, Multiple imputation: a primer.

(Shafer, Multiple imputation: a primer. Stat Methods in Stat Methods in Medical Research

(14)

MI: three steps

A. Imputation

of plausible values:

• Missing values replaced by imputed

• m times

B. Analysis

of the imputed datasets

(15)

B. Analysis

Each dataset is analysed in the same way:

e.g. : logistic regression

Save :

Point estimates of the statistics of interest:

log(OR) =

Q

_(l)

Their variance matrix:

U

_(l)

(16)

C. Combination

Take the m sample estimates Q_j and variance U_j For one parameter:

•Overall estimator: Mean (Q_j )

•Its variance: Mean(U_j) + (1+1/m) Var(Q_j) For k parameters:

•Overall estimators: Mean (Q_j)

•Variance matrix : ( 1+ r₁) Mean (U_j)

(17)

A. Imputation

Most difficult part

Say x

₁

has missing values. Approaches:

i.

Use draws from available observations of x

₁

(unconditional draws)

ii.

Use draws from regression models of x

₁

(conditional draws) iii. [Hot-deck imputation]

(18)

18

18 1. Draw σ2

(l) from (a-1) S2_obs/ χ2_(a-1)

2. Draw µ _(l) from N( xx_1,obs, σ2

(l) /a)

3. Draw missing values from N(µ _(l), σ2

(l))

4. New dataset l: observed x₁₁ x₁₂…. x_1a plus imputed in step 3

x_1i ~N(µ , σ2_{), i=1,…,N}

only x₁₁x_12….x_1aobserved, for a<N : ( x₁_,obs , S2

obs )

i)

unconditional draws

_{unconditional draws}

Not

go

od

for

MA

R

Not

go

od

for

MA

R

should condi tion on : •Facto rs affec ting x1 tors in fluenci ng mis singne ss For imputation run l =1,…,m:

(19)

ii) Simple conditional draws

Assume x_1i ~ N( β₀ + β₁ x_2i, σ2),

x_2ialways observed , x₂missing mechanism, X=[1 x2], For imputation run l =1,…, m:

1. Draw σ2

(l) from (a-2) S2obs / χ2(a-2)

2. Draw (β_{0 (l)}, β_{1 (l)}) from N(( β₀, β₁), σ2 ₍_l) (X’X)-1 )

3. Draw missing values from N(β_{0 (l)} + β_{1 (l)} x_i, σ2 _(l)₎)

4. New dataset: observed plus imputed in step 3

^ ^

(20)

Example

MRC 1946 birth cohort:

2187

women

•x₁: HT at 15 and breast cancer by age 53: 1689

•MI procedure:

•HT15= f(birth weight)

•Prob(missing)=f(birth weight, breast cancer)

130 140 150 160 170 180 Height at age 15 (cm)

(21)

MI programs:

A. mi_create_reg.ado

draws missing HT15 using results from regression of observed HT15 on (BW, BRCA) m times to create m imputed datasets B. mi_logit.ado

runs logistic regression on each imputed dataset and saves the results

C. mi_summary.ado

(22)

22 22 2.61 2.60 2.59 2.60 2.60 β β₁₁_(l)_(l) 149.83 149.74 149.43 149.91 149.75 β β₀₀_(l)_(l) 5 4 3 2 1 (l) (l) 6.03 6.20 6.26 6.34 6.20 σ σ_(l)_(l) 1.52 1.41 1.62 1.40 1.54 β β₂₂_(l)_(l) Original estimates: σ=6.19, β₀ =149.79, β₁=2.60, β₂ =1.48 (N=1683)

Draws for ht at age 15 cond. on BW and BRCA:

OBSERVED MI ---OR (95% CI) | OR 95% CI. ---1.045 (1.00,1.10) | 1.044 (1.00,1.09) ^ ^ ^ ^ ^ ^ ^ ^

(23)

Explanatory variables: X_i

Loading factors (fcn of observation times): Z

i.e. y_i ~ N(β X_i, Z_i Ψ Z_i’ +Σ)

Imputation procedure in similar steps

iii) conditional draws from a random

effects model

• Using all childhood growth data y_i (px1 vector):

p Observed values: y_i = Z η_i + + ε_i

q Latent factors: η_i = β X_i + u_i

(24)

For imputation run l =1,…,m:

1. Draw Σ_(l) from inverse Wishart based on Σ 2. Draw Ψ_(l) from inverse Wishart based on Ψ 3. Draw η _(l) from N(η_pred , Z’_Ψ

(l) Z )

4. Draw missing values from N(η _(l) , Σ _(l)₎)

5. New dataset: observed plus imputed in step 3 MI program:

^

(25)

25

Logistic regression with imputed growth

variables

Use conditional draws, with several explanatory factors (including breast cancer)

0.71,1.24 0.94 0.70,1.58 1.05 cm/yr Velocity 15-0.99,1.70 1.30 0.78,1.93 1.23 cm/yr Velocity 11-15ys 0.81,1.62 1.15 0.92,2.25 1.44 cm/yr Velocity 7-11yrs 1.08,1.85 1.41 1.04,2.24 1.53 cm/yr Velocity 4-7 yrs 0.88,1.51 1.14 0.67,1.56 1.02 cm/yr Velocity 2-4 yrs 0.87,1.60 1.18 0.71,1.66 1.08 cm Intercept at 2yrs 95%CI OR 95%CI OR Units HEIGHT Observed and imputed data (N=2187, D=59) Observed data (N=904, D=33)

(26)

Summary

MI requires great care in creating imputed values

A. mi_create_reg.ado & mi_create_growth.ado

B. mi_logit.ado & mi_ologit.ado

C. mi_summary.ado

Other Stata programs:

impute.ado

regmsng.ado

hotdeck.ado

implogit.ado

Gary Kings’ programs: clarify

Multiple imputations for missing data in lifecourse studies

Multiple imputation for missing

Multiple imputation for missing

data in life course studies

data in life course studies

















Motivating example

Motivating example



Breast cancer aetiology:

Pattern and type of missingness

Pattern and type of missingness

Data

Data

: Y=(y,X

:

, …, X

)= (Y

, Y

)

Strategies

Strategies

1.

2.

3.

4.

5.

Strategies

Strategies

1.

2.

3.

X

X

X

X

Strategies

Strategies

1.

2.

3.

4.

5.

X

X

X

X

X

X

5

5

-

-

Imputations

Imputations

UNSATISFACTORY:

UNSATISFACTORY:

Still pretending the data are observed!

PROBLEM:

PROBLEM:

generating a `proper’ predictive distribution

More technically…..

More technically…..

MI: three steps

MI: three steps

A. Imputation

of plausible values:

B. Analysis

of the imputed datasets

B. Analysis

B. Analysis



Each dataset is analysed in the same way:

Point estimates of the statistics of interest:

Their variance matrix:

_{unconditional draws}