Chapter 5: Multiple Imputation, Maximum Likelihood and Predictive Mean Matching for
5.5 Missing Data Imputation Methods
5.6.1 Semi-continuous variable in univariate missing data
The purpose of the first study is to investigate the effects on results of omitting a semi-continu- ous variable on the standardized bias, RMSE, and coverage rate. Previous studies claim that PMM preserves data distribution and imputes only non-negative values when the data consist of non-negative values. In contrast, the log-transformation procedure of MI-LOGIT may lead to im- puting non-negative values that are far outside the range of observed values and change the dis- tribution. Therefore, it is expected that PMM tends to outperform MI-LOGIT and other methods in the condition of using semi-continuous data.
The evaluations for the estimated coefficients of the variable with missing data can be found in
Table 5.2. In general, most of the biases are high and exceed the significant criterion of 40 (shaded areas). All MDT perform the worst under the assumption of MAR instead of MNAR. We also found that the biases of all techniques seem to positively associate with the missing rate, while the relations between them are not strictly linear. Another interesting finding is that the sensitivity of biases to the missing rate is affected by sample size. It seems that the biases are less sensitive to changing missing rates when the sample size is smaller.
When the model complexity is low, the sample size is very small, and collecting a bit more data may dramatically reduce the generalization error (GE) in the perspective of machine learning (which is indicated by BIAS in this chapter), we are likely to overfit the data. GE is the sum of MSE and variance, which associate with sample size and missing rate respectively. The relation-
ship between variance and GE is convex, and the sensitivity between them reaches the maxi- mum level at the lowest point of the GE curve. The finding of the sensitivity between biases and changing missing rates stated above can be explained by the tradeoff between MSE and vari- ance.
For most of the conditions with a significant bias, there is a correspondingly greater disruption in coverage. Hence, the coverage probability is also higher in the conditions where the sample size is large, and the missing rate is low. In most of the imputations, the coverage rates are very low and under 90% (Table 5.2).
All these results indicate that data quantity is in great demand in a univariate imputation of semi-continuous data with a high portion of zeros. When conducting imputation for semi-con- tinuous data with less than 3,200 cases, it is recommended that the percentage of missing val- ues should be no greater than 10%. The demand for data quantity for imputation would be slightly lower under the assumption of MCAR.
In Table 5.2, we can see that PMM has lower biases and higher coverage rates than MI-LOGIT and ML when the sample size is large enough. Besides, the RMSEs of PMM are lower than those of MI-LOGIT and ML when the missing rates are lower than a certain percentage.
We also see that MI-LOGIT has lower biases and RMSEs than ML, while their coverage rates are approximately the same. In fact, Collins et al. (2001) have indicated that a likelihood-based anal- ysis (ML) and a Bayesian analysis (MI) produce very similar results when the sample size is large enough. With 16 variables in our regression function, the estimation of an unstructured 16 *16 covariance matrix should be relatively stable with more than 1,200 cases. This is why the perfor- mance differences between MI-LOGIT and ML showed here are so small.
On the other hand, we also notice that CCA outperforms all other MDT in many cases, especially when the missing rates are high (≥40%). These findings remind us that modern missing data im‐ putation techniques are not always the best. Sometimes simpler is better.
As illustrated above, the advantage of PMM is to preserve the distributional shapes of the varia- bles even for the most extremely skewed semi-continuous ones. Its main drawback is the infor- mation lost in the right tail of the distributions due to sampling. In contrast, MI-LOGIT and ML will preserve the continuous part of a semi-continuous variable which clearly shows from the
plots. Regarding semi-continuous data, point mass is the most influential factor affecting its dis- tribution. The size of the point mass mentioned in prior studies related to semi-continuous data imputation is mostly around 20% to 60% (e.g., Yu et al., 2007). In practice, the point mass of ad- ministrative loan book data that provided by financial institutions is much higher. In this paper, the proportion of zeros in the raw data used in this paper is very high (83.5%). Therefore, the information lost in right tail caused by PMM may require a greater quantity of data to compen- sate. It might be the reason why CCA is better than PMM in most conditions here.
What is more, it is also found that the break-even points for performances between PMM and CCA shift downwards as we change the missing mechanism from MCAR to MAR and MNAR. For instance, by comparing the RMSEs of PMM and CCA with 10% missing data, we can see that PMM outperforms CCA when the sample size is 3,200 with MCAR pattern (Table 5.2 Panel 1), 2,700 with MAR pattern (Panel 2), or 1,700 with MNAR pattern (Panel 3).
One possible explanation of these results is that the point masses were slightly reduced to lower levels during the simulation of MAR and MNAR mechanisms. For instance, in this paper, MNAR on the variable of Arrears is designed to simulate a common situation that clients with delin- quency have a higher probability of missing from reporting. As a result, more information is pre- served by PMM, and it leads to better imputations. Nevertheless, we should notice that there are some other MAR and MNAR simulation methods which have no impacts on the distribution of semi-continuous data. In these cases, the performance of PMM might be consistent across different missingness mechanisms.