5. Experimental Evaluation of the Method
5.2 Comparing the Predictive Power of Candidate Imputation Methods
5.2.3 Least Distortion Evaluation
In many cases the criteria used to evaluate the results of the imputation process will depend primarily on how the imputed dataset will be used in practice. For example, in some cases it will be of primary importance that the mean value for a particular variable is not badly “damaged” by the imputation process - i.e. the mean value measured prior to imputation should not be significantly different to the mean value measured after imputation. Avoiding “damaging” the data can be particularly important when data mining algorithms are to be executed against the data, as Pyle (1999) explains;
“Even replacing the values at all has its dangers unless it is carefully done so as to cause the least damage to the data. It is every bit as important to avoid adding bias and distortion to the data as it is to make the information that is present available to the mining tool”
Consequently, it was considered to be important that the proposed imputation evaluation method should measure the distortion caused by the imputation process. The distortion of the parameters that describe the relationships between variables (such as correlations and regression coefficients) can be measured. And the distortion of the parameters that describe each variable (such as the mean and standard deviation) can also be measured. This approach has been implemented as part of the software application developed by the author, using the process shown below;
Fig 5.7 - Implementation of the least distortion evaluation process
The distortions caused by imputation are calculated using the following general formula; Distortion = After Imputation Statistic – Before Imputation Statistic
Before Imputation Statistic
For example, the post-imputation distortion in the mean might be calculated as follows,
Distortion = 0.2 20 %
10 10
12 − = =
However, it is important to note that it will usually be impossible to generate imputed values that minimise the distortion of more than one parameter. For example, consider the following
Load the dataset into the imputation software
Calculate the BEFORE imputation statistics
Impute the missing values
Calculate the AFTER imputation statistics
Compare the BEFORE and AFTER statistics
The mean and standard deviation for each variable
Using the imputation method being evaluated
The mean and standard deviation for each variable
Measures the distortion caused by the imputation process See section 5.1.2 for more details
Mean before imputation = 10 Mean after imputation = 12
“For instance, consider the numbers 1, 2, 3, x, 5, where ‘x’ represents a missing value. What number should be plugged in as an unbiased estimate of the missing value? Ideally, a value is needed that will at least do no harm to the existing data. And here is a critical point – what does ‘least harm’ mean exactly? If the mean is to be unbiased, the missing value needs to be 2.75. If the standard deviation is to be unbiased, the missing value needs to be about 4.659”
In such a situation the user of the imputation software would need to decide whether it was more important to minimise the distortion of the mean, or to minimise the distortion of the standard deviation - since it impossible to do both (as Pyle shows).
Imputation methods usually reduce the variance within imputed variables
In practice, the most common data distortion problem caused by imputation is reduced variance within the imputed variables. For example, regression based imputation methods almost always reduce the variance, because they use the patterns within the known values to generate regression equations - which are then used to estimate the missing values. Therefore, they must (by their very nature) strengthen the patterns that existed before imputation was performed - e.g. see the discussion of the dataset given in Fig. 5.5.
The multiple imputation (MI) method (described in section 4.3.3) was devised, at least in part, to solve the problem of variance distortion caused by imputation (Rubin, 1987). The idea of MI is to generate several different estimates for each missing value, using “repeated random draws from the predictive distribution of the missing values” (Little and Rubin, 2002) - i.e. stochastic (usually Bayesian) techniques are employed to generate the estimates. The set of estimates generated for each missing value are then “combined” to form a single estimate (e.g. the mean is usually taken). This process will increase the variance within the imputed values in most cases, and hence the distortion of the variance caused by the imputation process should be less severe.
The proponents of MI argue that the MI process allows “statistically valid” (Rubin, 1996a) inferences (such as estimates of the variance) to be drawn when analysing the variables in the imputed dataset. However, some doubt has been cast upon the general truth of this assertion. For example, Binder (1996) points out that this depends on the properties of the dataset and on the nature of the missing data problem. Binder makes particular reference to the related papers on this topic by Fay (1991) and Fay (1992), arguing that, “Fay (1991, 1992) has described what I consider to be a scientific gem. He presented a simple situation where multiple imputation (MI) is not proper, even though one might expect it to be”. In other words, Fay presents some examples of the MI process which show that equation (4.10), given in section 4.3.3 - does not yield an unbiased estimate of the variance - i.e. the MI process used is not “proper”, in the sense defined by Rubin (1987).