Chapter 5: Multiple Imputation, Maximum Likelihood and Predictive Mean Matching for
5.4 Data and Missingness Simulation
In order to compare the performance of the imputations and demonstrate their validity on a real microfinance dataset, we use the 2010 administrative loan book data of Cooperativa de Ahorro y Credito CeibeΓ±a Ltda. (COAC CeibeΓ±a). It is a subset of the dataset which was used in chapter 3 previously. COAC CeibeΓ±a was founded in 1974 on the initiative of Father Donaldo McMillan, and a group of women gathered to the local Catholic Church at La Ceiba Honduras. It is a credit union offering safe and transparent microfinance financial products and services to the local community. The raw data of 2010 COAC CeibeΓ±a has 24 variables and 8,063 cases. The 11 explanation variables (Table 5.1) selected in this paper is based on previous studies and ex- pert advice from the microlender staffs, as there is no universally accepted approach to select the explanatory variables for credit scoring (Dinh & Kleimeier, 2007).
5.4.1 Modifying the population
As the size of 8,063 is relatively large for most microfinance institutions in developing countries, we would like to scale down the population to generalize the administrative loan book data. We begin with handling outliers. The occurrence of outliers in our data is very limited, and there are no signs of correlated outliers. Therefore, the simple winsorizing and trimming of Wainer (1976) are adopted here. All observations of Outstanding Balance under $50 or over $10,000 are re- placed by the limits. Arrears is trimmed at $2,000. Age is restricted to the range from 20 to 80. Next, we separate the raw data on the level of the point mass and generate two populations. Both populations have size N=3,200, but the populations differ in the size of the point mass: 83.50% and 85.34% point masses at zero for the data with Arrears and the data with Credit Risk respectively. These two populations will be used as the benchmark datasets. It should be no- ticed that the estimates such as the mean, median, and variances will also change as the size of the point mass changes. The summary statistics of the 11 selected variables with modified popu- lation are presented in Table 5.1.
5.4.2 Sampling benchmark datasets and skewness preservation
To investigate the performance of MDT under different sample sizes (π), we sample 1,000, 1,700, 2,200, 2,700, and 3,200 cases from the populations. As shown in Table 5.1, the variables of interest (π¦), Arrears, dichotomized Arrears, and Credit Risk, are heavily skewed in practice. In
theory, it may severely impair a methodβs imputation performance. In order to evaluate the im- putation methods at the same distributions across different simulations, we have generated 30 benchmark datasets in this stage (5 sample sizes * 3 variables of interest * 2 models with differ- ent number of missing variables). Each table of empirical results in this chapter has 5 sections and each section corresponds to a specific benchmark.
The samples of univariate missing data imputation in this paper are generated following the pro- cess as below:
1. We start by separating the data into two sections π0and π+based on the zeros and non-zeros of π¦. ππ¦represents a section in whichπ¦ β [0, +]
2. Subsample ππ’π0is generated by random sampling a certain percentage (πππ‘) in π0. 3. In terms of generation of ππ’π+, it depends on the data types of π¦:
a. If π¦is binary, then ππ’π+is generated by random sampling πππ‘in π+.
b. If the π¦is ordinal categorical (3 levels), we divide π+into π+1and π+2based on the values of π¦, then randomly sample πππ‘in π+1and π+2, and finally merge π+1and π+2to generate ππ’π+.
c. If π¦is semi-continuous, we sort all non-zero cases based on the values of π¦, next, divide π+into πsections π+1, π+2,β¦, π+πwith equal number of cases, then random sample πππ‘in each π+π, and finally merge π+πto generate ππ’π+. 4. At last, combing ππ’π0and ππ’π+into a sample ready for missingness simulation. On the other hand, sample generation for multivariate missing data is similar to that for univari- ate missing data. The only difference is that the samples should be divided into more subsam- ples as we need to preserve the skewness of a new continuous variable (π¦β), such as Loan Ma- turity, with missing data as well. The details of the generation process is as follows:
1. We start by separating the data into two groups of sections π0,1, π0,2,β¦, π0,πand π+,1, π+,2,β¦, π+,πbased on the zeros and non-zeros of π¦and π¦β. ππ¦,π¦βrepresents a section in which π¦ β [0, +], and π¦ββ [1, 2, β¦ , π].
2. Randomly samplingπππ‘in each π0,π¦β, and then merge all π0,π¦βto generate ππ’π0. 3. In terms of generation of ππ’π+, it depends on the data types of π¦:
a. If π¦is binary, we divide π+into π+,1, π+,2,β¦, π+,πbased on the values of π¦β, then randomly sample πππ‘in each π+,π¦β, finally merge all π+,π¦βto generate ππ’π+.
b. If the π¦is ordinal categorical (3 levels), we divide π+into
π
+1,1,π
+1,2,β¦,π
+1,π,π
+2,1,π
+2,2,β¦,π
+2,πbased on the values of π¦and π¦β, then random sample πππ‘in each
π
+π,π¦β, and finally merge allπ
+π,π¦βto generate ππ’π+.c. If π¦is semi-continuous, we divide π+into π+,1, π+,2,β¦, π+,πbased on the values of π¦β, then we sort all non-zero cases based on the values of π¦in each π+,π, next, we divide each π+,π¦βinto πsections
π
+1,π¦β,
π
+2,π¦β,β¦,π
+π,π¦βwith equalnumber of cases, then we random sample πππ‘in each
π
+π,π¦β, and finally mergeπ
+π,π¦βto generate ππ’π+.4. At last, combing ππ’π0and ππ’π+into a sample ready for missingness simulation.
5.4.3 Generating missingness
In terms of the missing data mechanisms, MCAR, MAR, and MNAR are used in our simulations. To investigate the performance of the methods under different missing rates (π ), the details of the three mechanisms are adjusted to yield an overall rate of missingness at five different levels (10%, 20%, 30%, 40%, and 50%) in this paper. Regarding to the functions in missing data imputa- tion, all variables can be classified into three types: π, which always observed; π, which is partly observed; and π, which may be observed and is a potential cause of missingness for π. πand π represent variables that will automatically appear in an imputation because they are of research interest. πrepresents variables that is not of direct interest but might be included in the model if the researchers consider it is beneficial.
To model MCAR, missing values are randomly imposed on πindependently of π, π, and πat dif- ferent missing rates stated above. It is straightforward.
To model MAR, the only requirement is that the missingness of πassociates with π. The poten- tial relations between πand πare countless, and it is impossible to model all of them. Common conditions for MAR in the previous literature include: Linear, in which the missingness of πis lin- early related to π; Quadratic, in which the missingness of πat the extremes of πis different from that in the middle; and Sinister, in which the missingness of πis a function of the correla- tion between πand π. The study of Collins et al. (2001) shows that the selection of MAR condi- tions has little effect on the biases of correlation estimation between πand π. Therefore, we only focus on linear MAR in this paper for simplicity. In terms of administrative loan books and surveys of MFIs, a typical scenario of MAR would be that males (π) have higher probability of
nonresponse on the question of Arrears (π) than females. To simulate such a scenario, we im- pose a linear MAR missing mechanism on πΊπππππ(π) following a semi-random sampling pro- cess designed as follows:
1. We start by separating the initial sample (ππππππππ‘π) into four sections πΆ+,π, πΆ+,π, πΆ0,π, and πΆ0,πbased on the values ofπ¦and π§. πΆπ¦,π§represents the sample size of a section in whichπ¦ β [0, +], π§ β [ππππππ, ππππ].
To simulate MAR missingness, we impose different weights (ππ§ β [0,1], π§ β [ππππππ, ππππ] ) on the missing rates based on πΊπππππ. The gap between ππand ππindicates the strength of association between πΊπππππand missingness of π΄ππππππ . When ππ = ππ, the missing data is MCAR. ππ β€ ππsimulates the scenario that women have lower probability of being missing from the datasets. For simplicity, we setup ππ = 1and ππ = 0.9in this study. In next stages, we impose different missing rates on the four sections as follows:
2. Randomly drop π +,πpercent of cases in the section with πΆ+,π: π +,π= π β πΆ / (4 β πΆ+,π) β ππβ 100
3. Randomly drop π +,πpercent of cases in the section with πΆ+,π: π +,π= π β πΆ / (4 β πΆ+,π) β ππβ 100
4. For the section with πΆ0,π, we sort the data based on the values of π¦and π§in ascending order, divide the section into 20 subsections with equal amount of data, randomly drop
π 0,πpercent of cases in each subsection, and then merge all subsection back toπΆ0,π: π 0,π= π β πΆ / (4 β πΆ0,π) β ππβ 100
5. For the section with πΆ0,π, we sort the data based on the values of π¦and π§in ascending order, divide the section into 20 subsections with equal amount of data, randomly drop
π 0,πpercent of cases in each subsection, and then merge all subsection back to πΆ0,π: π 0,π= π β πΆ / (4 β πΆ0,π) β ππβ 100
After merging the processed πΆ+,π, πΆ+,π, πΆ0,π, and πΆ0,πback to a single sample (ππππππππππ‘π), we use πΆβto denote the sample size of ππππππππππ‘π, and calculateπΆβas:
πΆβ= πΆ
+,πβ (1 β π +,π) + πΆ+,πβ (1 β π +,π) + πΆ0,πβ (1 β π 0,π) + πΆ0,πβ (1 β π 0,π).
When ππ = 1and ππ = 0.9, we can infer thatπΆβ< C β (1 β π ). In order to preserve the joint distributions embedded in ππππππππππ‘πand increase its sample size to C β (1 β π ), we refill the
incomplete sample and generate the final sample with MAR missingness for further imputation evaluations as follows:
6. Randomly selectπ+,πcases from the abandoned data generated in step 2 and merge them back to ππππππππππ‘π. π+,πis calculated as: π+,π= (πΆ β πΆβ) β πΆ+,πβ (1 β π +,π) / πΆβ
7. Randomly select π+,πcases from the abandoned data generated in step 3 and merge them back to ππππππππππ‘π. π+,πis calculated as: π+,π= (πΆ β πΆβ) β πΆ
+,πβ (1 β π +,π) / πΆβ
8. Randomly select π0,πcases from the abandoned data generated in step 4 and merge them back to ππππππππππ‘π. π0,πis calculated as: π0,π= (πΆ β πΆβ) β πΆ
0,πβ (1 β π 0,π) / πΆβ
9. Randomly select π0,πcases from the abandoned data generated in step 5 and merge them back to ππππππππππ‘π. π0,πis calculated as: π0,π= (πΆ β πΆβ) β πΆ
0,πβ (1 β π 0,π) / πΆβ
Note that in actuality, the mechanism shown above will be MAR only if π(e.g., πΊπππππ) appears in the procedure . If πis omitted, then the mechanism is actually MNAR and procedures based on an assumption of ignorability may be biased. Again, the potential unobservable variables as- sociated to πare countless and we cannot model all of them. Therefore, we only consider the most common form of MNAR in this paper. For the microfinance loan books, one example of MNAR would be that clients with non-zero Arrears (π) have higher probability of nonresponse to a question of Arrears (π) than clients with zero Arrears. To simulate such scenario, we can simply allow π(e.g., Arrears in this case) to take the place of π(e.g., πΊπππππ) in mechanism MAR above, forcing it to be MNAR. In addition, the generation process of MNAR can refer to the process of MAR.