Single Imputation Methods - Dealing with missing data II: Imputation

3.3 Dealing with missing data II: Imputation

3.3.1 Single Imputation Methods

Little & Rubin (2002) classify single imputation into two generic approaches: Explicit modelling and Implicit modelling.

Explicit modelling

The imputed missing values are generated from formal statistical models. Hence, the results are explicit. Here are some explicit modelling examples:

(1) Mean imputation

One of the simplest imputation methods is to replace all the missing values with the mean of the observed values for the numerical variable. There are two mean imputation methods. One is called “unconditional mean imputation”. The other one is called “conditional mean imputation”

Unconditional mean imputation: This method simply replaces missing values with the mean of all the observed values for the variable. It is very easy but it also is very likely to distort the distribution for the variable. Normally, this method is not recommended. Let us define ¯Y as the mean of our variable Y with missing values. For n units, we have:

¯ Y_k= 1 C_k n

∑

j=1 R_jkY_jk (3.2) where C_k= n

∑

j=1 R_jk R_jk= (

1 if item k is not missing for unit j 0 Otherwise

Conditional mean imputation: This is an improvement of unconditional mean imputation. It simply divides data into several groups or strata based on fully observed variables or auxiliary data. Then, means of variables with missing data will be calculated for each stratum. The missing values can be replaced by the means of their stratum. Compared to unconditional mean imputation, this method can preserve the distribution of the variable. Let us define ¯Ygas

the mean of our variable Y with missing values for group g. For n units, we have item k: ¯ Y_gk= 1 C_gk n

∑

j=1 R_jkI_jgY_jk (3.3) where C_gk= n

∑

j=1 R_jkI_ig. R_jk= (

1 if item k is not missing for unit j 0 Otherwise Ijg= ( 1 if unit j is in group g 0 Otherwise Then ¯ Y_k=

_∑

g W_gY¯_gk where W_g= ng n

∑

g W_g= 1 (2) Regression imputation

A statistical model is established by using observed data. Then the model can be used to predict the missing values. Conditional mean imputation can be considered a special case of regression imputation. Suppose a random variable Y has density f (Y |X , θ ) for random variable X , given the X is observed. We can estimate θ from complete data Yobs and Xobs.

Then, each missing Y can be imputed independently as Ymis,i= EYi|Xi; ˆθ

(3.4) where i = 1, ..., n, n is the sample size, and ˆθ = ˆθ (Y_obs, X_obs).

(3) Stochastic regression imputation

This is like the regression imputation method. The only difference is that we introduce uncertainty to the predicted missing values. The missing Y (Ymis) is randomly drawn from the

distribution f (Y |X ; ˆθ ):

Ymis,i∼ f Yi|Xi, ˆθ

(3.5)

Implicit modelling

In implicit modelling, the imputation is based on an algorithm. There are no formal statistical models. Here are some implicit modelling examples:

(1) Simple random imputation

This is the simplest imputation method. It simply imputes missing values of variable y from a random draw of the observed data from all observed records for this variable.

(2) Hot deck imputation

The basic idea of this method is to replace individual missing values drawn from the observed values of “similar” responding units. In other words, for each unit with a missing Y , find a unit with similar values of X in the observed data and take its Y value. The hot deck method can be very complex. Here is some examples of hot deck imputation.

• Sequential hot deck Imputation: There are hot deck imputation procedures that impute the value in the same subgroup that was last read by the computer in a single scan of the data

• Random hot deck Imputation: A donor is randomly chosen from the respondents with information on all missing items. It is just like the “Simple random imputation”

• Adjustment cell hot deck imputation: Adjustment cells are formed from the joint levels of categorical variables which have observed values for variables with missing data. Then, a donor is randomly chosen from the respondents within each adjustment cell to replace the missing data with that cell. This is a similar idea to conditional mean imputation.

• Nearest-Neighbor hot deck Imputation: Define a distance measure between units, and impute the value of a respondent who is “closest” to the person with the missing item, where closeness is defined using the distance function, such as the Mahalanobis distance (Andridge & Little 2010):

d(i, j) = (xi− xj)TVar(xˆ i)−1(xi− xj)

(3) Substitution

This method is basically to replace non responding units with alternative units not selected into the sample. The method can only be applied during the data collection stage. This method is not recommended. This is because the substituted responding units may differ systematically from non-respondents. Although we may end up with a complete dataset, the non-respondents’ information is actually missing (Little & Rubin 2002).

(4) Cold deck imputation

This method imputes missing values from historical records of a particular unit. For example, we might be able to find a value of an unit from the same survey of previous period to replace the missing value of current survey.

(5) Imputation based on logical rules (or deductive imputation)

Sometimes we can impute using logical rules: for example, suppose a survey has questions such as “ whether drinking water is free of charge to students during the school day” and “Whether drinking water is available to students through drinking water fountains”. If the an- swer for free drinking water is “no” and for availability of drinking water fountains is missing, it is reasonable for us to conclude that there is no drinking water fountains in the school. This is also called the deductive imputation method (Kalton 1983).

3.3.2 Likelihood Based Approaches

In document Imputation on the Food, Nutrition and Environment Surveys 2007 and 2009 data (Page 32-35)