• No results found

Chapter 5: Multiple Imputation, Maximum Likelihood and Predictive Mean Matching for

5.4 Data and Missingness Simulation

In order to compare the performance of the imputations and demonstrate their validity on a real microfinance dataset, we use the 2010 administrative loan book data of Cooperativa de Ahorro y Credito CeibeΓ±a Ltda. (COAC CeibeΓ±a). It is a subset of the dataset which was used in chapter 3 previously. COAC CeibeΓ±a was founded in 1974 on the initiative of Father Donaldo McMillan, and a group of women gathered to the local Catholic Church at La Ceiba Honduras. It is a credit union offering safe and transparent microfinance financial products and services to the local community. The raw data of 2010 COAC CeibeΓ±a has 24 variables and 8,063 cases. The 11 explanation variables (Table 5.1) selected in this paper is based on previous studies and ex- pert advice from the microlender staffs, as there is no universally accepted approach to select the explanatory variables for credit scoring (Dinh & Kleimeier, 2007).

5.4.1 Modifying the population

As the size of 8,063 is relatively large for most microfinance institutions in developing countries, we would like to scale down the population to generalize the administrative loan book data. We begin with handling outliers. The occurrence of outliers in our data is very limited, and there are no signs of correlated outliers. Therefore, the simple winsorizing and trimming of Wainer (1976) are adopted here. All observations of Outstanding Balance under $50 or over $10,000 are re- placed by the limits. Arrears is trimmed at $2,000. Age is restricted to the range from 20 to 80. Next, we separate the raw data on the level of the point mass and generate two populations. Both populations have size N=3,200, but the populations differ in the size of the point mass: 83.50% and 85.34% point masses at zero for the data with Arrears and the data with Credit Risk respectively. These two populations will be used as the benchmark datasets. It should be no- ticed that the estimates such as the mean, median, and variances will also change as the size of the point mass changes. The summary statistics of the 11 selected variables with modified popu- lation are presented in Table 5.1.

5.4.2 Sampling benchmark datasets and skewness preservation

To investigate the performance of MDT under different sample sizes (𝑁), we sample 1,000, 1,700, 2,200, 2,700, and 3,200 cases from the populations. As shown in Table 5.1, the variables of interest (𝑦), Arrears, dichotomized Arrears, and Credit Risk, are heavily skewed in practice. In

theory, it may severely impair a method’s imputation performance. In order to evaluate the im- putation methods at the same distributions across different simulations, we have generated 30 benchmark datasets in this stage (5 sample sizes * 3 variables of interest * 2 models with differ- ent number of missing variables). Each table of empirical results in this chapter has 5 sections and each section corresponds to a specific benchmark.

The samples of univariate missing data imputation in this paper are generated following the pro- cess as below:

1. We start by separating the data into two sections 𝑆0and 𝑆+based on the zeros and non-zeros of 𝑦. 𝑆𝑦represents a section in which𝑦 ∈ [0, +]

2. Subsample 𝑆𝑒𝑏0is generated by random sampling a certain percentage (𝑃𝑐𝑑) in 𝑆0. 3. In terms of generation of 𝑆𝑒𝑏+, it depends on the data types of 𝑦:

a. If 𝑦is binary, then 𝑆𝑒𝑏+is generated by random sampling 𝑃𝑐𝑑in 𝑆+.

b. If the 𝑦is ordinal categorical (3 levels), we divide 𝑆+into 𝑆+1and 𝑆+2based on the values of 𝑦, then randomly sample 𝑃𝑐𝑑in 𝑆+1and 𝑆+2, and finally merge 𝑆+1and 𝑆+2to generate 𝑆𝑒𝑏+.

c. If 𝑦is semi-continuous, we sort all non-zero cases based on the values of 𝑦, next, divide 𝑆+into 𝑁sections 𝑆+1, 𝑆+2,…, 𝑆+𝑛with equal number of cases, then random sample 𝑃𝑐𝑑in each 𝑆+π‘˜, and finally merge 𝑆+π‘˜to generate 𝑆𝑒𝑏+. 4. At last, combing 𝑆𝑒𝑏0and 𝑆𝑒𝑏+into a sample ready for missingness simulation. On the other hand, sample generation for multivariate missing data is similar to that for univari- ate missing data. The only difference is that the samples should be divided into more subsam- ples as we need to preserve the skewness of a new continuous variable (π‘¦βˆ—), such as Loan Ma- turity, with missing data as well. The details of the generation process is as follows:

1. We start by separating the data into two groups of sections 𝑆0,1, 𝑆0,2,…, 𝑆0,π‘šand 𝑆+,1, 𝑆+,2,…, 𝑆+,π‘šbased on the zeros and non-zeros of 𝑦and π‘¦βˆ—. 𝑆𝑦,π‘¦βˆ—represents a section in which 𝑦 ∈ [0, +], and π‘¦βˆ—βˆˆ [1, 2, … , π‘š].

2. Randomly sampling𝑃𝑐𝑑in each 𝑆0,π‘¦βˆ—, and then merge all 𝑆0,π‘¦βˆ—to generate 𝑆𝑒𝑏0. 3. In terms of generation of 𝑆𝑒𝑏+, it depends on the data types of 𝑦:

a. If 𝑦is binary, we divide 𝑆+into 𝑆+,1, 𝑆+,2,…, 𝑆+,π‘šbased on the values of π‘¦βˆ—, then randomly sample 𝑃𝑐𝑑in each 𝑆+,π‘¦βˆ—, finally merge all 𝑆+,π‘¦βˆ—to generate 𝑆𝑒𝑏+.

b. If the 𝑦is ordinal categorical (3 levels), we divide 𝑆+into

𝑆

+1,1,

𝑆

+1,2,…,

𝑆

+1,π‘š,

𝑆

+2,1,

𝑆

+2,2,…,

𝑆

+2,π‘šbased on the values of 𝑦and π‘¦βˆ—, then random sample 𝑃𝑐𝑑

in each

𝑆

+π‘˜,π‘¦βˆ—, and finally merge all

𝑆

+π‘˜,π‘¦βˆ—to generate 𝑆𝑒𝑏+.

c. If 𝑦is semi-continuous, we divide 𝑆+into 𝑆+,1, 𝑆+,2,…, 𝑆+,π‘šbased on the values of π‘¦βˆ—, then we sort all non-zero cases based on the values of 𝑦in each 𝑆+,𝑗, next, we divide each 𝑆+,π‘¦βˆ—into 𝑁sections

𝑆

+

1,π‘¦βˆ—,

𝑆

+2,π‘¦βˆ—,…,

𝑆

+𝑛,π‘¦βˆ—with equal

number of cases, then we random sample 𝑃𝑐𝑑in each

𝑆

+π‘˜,π‘¦βˆ—, and finally merge

𝑆

+π‘˜,π‘¦βˆ—to generate 𝑆𝑒𝑏+.

4. At last, combing 𝑆𝑒𝑏0and 𝑆𝑒𝑏+into a sample ready for missingness simulation.

5.4.3 Generating missingness

In terms of the missing data mechanisms, MCAR, MAR, and MNAR are used in our simulations. To investigate the performance of the methods under different missing rates (𝑅), the details of the three mechanisms are adjusted to yield an overall rate of missingness at five different levels (10%, 20%, 30%, 40%, and 50%) in this paper. Regarding to the functions in missing data imputa- tion, all variables can be classified into three types: 𝑋, which always observed; π‘Œ, which is partly observed; and 𝑍, which may be observed and is a potential cause of missingness for π‘Œ. 𝑋and π‘Œ represent variables that will automatically appear in an imputation because they are of research interest. 𝑍represents variables that is not of direct interest but might be included in the model if the researchers consider it is beneficial.

To model MCAR, missing values are randomly imposed on π‘Œindependently of 𝑋, π‘Œ, and 𝑍at dif- ferent missing rates stated above. It is straightforward.

To model MAR, the only requirement is that the missingness of π‘Œassociates with 𝑍. The poten- tial relations between π‘Œand 𝑍are countless, and it is impossible to model all of them. Common conditions for MAR in the previous literature include: Linear, in which the missingness of π‘Œis lin- early related to 𝑍; Quadratic, in which the missingness of π‘Œat the extremes of 𝑍is different from that in the middle; and Sinister, in which the missingness of π‘Œis a function of the correla- tion between 𝑋and 𝑍. The study of Collins et al. (2001) shows that the selection of MAR condi- tions has little effect on the biases of correlation estimation between 𝑋and π‘Œ. Therefore, we only focus on linear MAR in this paper for simplicity. In terms of administrative loan books and surveys of MFIs, a typical scenario of MAR would be that males (𝑍) have higher probability of

nonresponse on the question of Arrears (π‘Œ) than females. To simulate such a scenario, we im- pose a linear MAR missing mechanism on πΊπ‘’π‘›π‘‘π‘’π‘Ÿ(𝑍) following a semi-random sampling pro- cess designed as follows:

1. We start by separating the initial sample (π‘†π‘π‘œπ‘šπ‘π‘™π‘’π‘‘π‘’) into four sections 𝐢+,𝑓, 𝐢+,π‘š, 𝐢0,𝑓, and 𝐢0,π‘šbased on the values of𝑦and 𝑧. 𝐢𝑦,𝑧represents the sample size of a section in which𝑦 ∈ [0, +], 𝑧 ∈ [π‘“π‘’π‘šπ‘Žπ‘™π‘’, π‘šπ‘Žπ‘™π‘’].

To simulate MAR missingness, we impose different weights (π‘Šπ‘§ ∈ [0,1], 𝑧 ∈ [π‘“π‘’π‘šπ‘Žπ‘™π‘’, π‘šπ‘Žπ‘™π‘’] ) on the missing rates based on πΊπ‘’π‘›π‘‘π‘’π‘Ÿ. The gap between π‘Šπ‘“and π‘Šπ‘šindicates the strength of association between πΊπ‘’π‘›π‘‘π‘’π‘Ÿand missingness of π΄π‘Ÿπ‘Ÿπ‘’π‘Žπ‘Ÿπ‘ . When π‘Šπ‘“ = π‘Šπ‘š, the missing data is MCAR. π‘Šπ‘“ ≀ π‘Šπ‘šsimulates the scenario that women have lower probability of being missing from the datasets. For simplicity, we setup π‘Šπ‘“ = 1and π‘Šπ‘š = 0.9in this study. In next stages, we impose different missing rates on the four sections as follows:

2. Randomly drop 𝑅+,𝑓percent of cases in the section with 𝐢+,𝑓: 𝑅+,𝑓= 𝑅 βˆ— 𝐢 / (4 βˆ— 𝐢+,𝑓) βˆ— π‘Šπ‘“βˆ— 100

3. Randomly drop 𝑅+,π‘špercent of cases in the section with 𝐢+,π‘š: 𝑅+,π‘š= 𝑅 βˆ— 𝐢 / (4 βˆ— 𝐢+,π‘š) βˆ— π‘Šπ‘šβˆ— 100

4. For the section with 𝐢0,𝑓, we sort the data based on the values of 𝑦and 𝑧in ascending order, divide the section into 20 subsections with equal amount of data, randomly drop

𝑅0,𝑓percent of cases in each subsection, and then merge all subsection back to𝐢0,𝑓: 𝑅0,𝑓= 𝑅 βˆ— 𝐢 / (4 βˆ— 𝐢0,𝑓) βˆ— π‘Šπ‘“βˆ— 100

5. For the section with 𝐢0,π‘š, we sort the data based on the values of 𝑦and 𝑧in ascending order, divide the section into 20 subsections with equal amount of data, randomly drop

𝑅0,π‘špercent of cases in each subsection, and then merge all subsection back to 𝐢0,π‘š: 𝑅0,π‘š= 𝑅 βˆ— 𝐢 / (4 βˆ— 𝐢0,π‘š) βˆ— π‘Šπ‘šβˆ— 100

After merging the processed 𝐢+,𝑓, 𝐢+,π‘š, 𝐢0,𝑓, and 𝐢0,π‘šback to a single sample (π‘†π‘–π‘›π‘π‘œπ‘šπ‘π‘™π‘’π‘‘π‘’), we use πΆβˆ—to denote the sample size of π‘†π‘–π‘›π‘π‘œπ‘šπ‘π‘™π‘’π‘‘π‘’, and calculateπΆβˆ—as:

πΆβˆ—= 𝐢

+,π‘“βˆ— (1 βˆ’ 𝑅+,𝑓) + 𝐢+,π‘šβˆ— (1 βˆ’ 𝑅+,π‘š) + 𝐢0,π‘“βˆ— (1 βˆ’ 𝑅0,𝑓) + 𝐢0,π‘šβˆ— (1 βˆ’ 𝑅0,π‘š).

When π‘Šπ‘“ = 1and π‘Šπ‘š = 0.9, we can infer thatπΆβˆ—< C βˆ— (1 βˆ’ 𝑅). In order to preserve the joint distributions embedded in π‘†π‘–π‘›π‘π‘œπ‘šπ‘π‘™π‘’π‘‘π‘’and increase its sample size to C βˆ— (1 βˆ’ 𝑅), we refill the

incomplete sample and generate the final sample with MAR missingness for further imputation evaluations as follows:

6. Randomly select𝑁+,𝑓cases from the abandoned data generated in step 2 and merge them back to π‘†π‘–π‘›π‘π‘œπ‘šπ‘π‘™π‘’π‘‘π‘’. 𝑁+,𝑓is calculated as: 𝑁+,𝑓= (𝐢 βˆ’ πΆβˆ—) βˆ— 𝐢+,π‘“βˆ— (1 βˆ’ 𝑅+,𝑓) / πΆβˆ—

7. Randomly select 𝑁+,π‘šcases from the abandoned data generated in step 3 and merge them back to π‘†π‘–π‘›π‘π‘œπ‘šπ‘π‘™π‘’π‘‘π‘’. 𝑁+,π‘šis calculated as: 𝑁+,π‘š= (𝐢 βˆ’ πΆβˆ—) βˆ— 𝐢

+,π‘šβˆ— (1 βˆ’ 𝑅+,π‘š) / πΆβˆ—

8. Randomly select 𝑁0,𝑓cases from the abandoned data generated in step 4 and merge them back to π‘†π‘–π‘›π‘π‘œπ‘šπ‘π‘™π‘’π‘‘π‘’. 𝑁0,𝑓is calculated as: 𝑁0,𝑓= (𝐢 βˆ’ πΆβˆ—) βˆ— 𝐢

0,π‘“βˆ— (1 βˆ’ 𝑅0,𝑓) / πΆβˆ—

9. Randomly select 𝑁0,π‘šcases from the abandoned data generated in step 5 and merge them back to π‘†π‘–π‘›π‘π‘œπ‘šπ‘π‘™π‘’π‘‘π‘’. 𝑁0,π‘šis calculated as: 𝑁0,π‘š= (𝐢 βˆ’ πΆβˆ—) βˆ— 𝐢

0,π‘šβˆ— (1 βˆ’ 𝑅0,π‘š) / πΆβˆ—

Note that in actuality, the mechanism shown above will be MAR only if 𝑍(e.g., πΊπ‘’π‘›π‘‘π‘’π‘Ÿ) appears in the procedure . If 𝑍is omitted, then the mechanism is actually MNAR and procedures based on an assumption of ignorability may be biased. Again, the potential unobservable variables as- sociated to π‘Œare countless and we cannot model all of them. Therefore, we only consider the most common form of MNAR in this paper. For the microfinance loan books, one example of MNAR would be that clients with non-zero Arrears (π‘Œ) have higher probability of nonresponse to a question of Arrears (π‘Œ) than clients with zero Arrears. To simulate such scenario, we can simply allow π‘Œ(e.g., Arrears in this case) to take the place of 𝑍(e.g., πΊπ‘’π‘›π‘‘π‘’π‘Ÿ) in mechanism MAR above, forcing it to be MNAR. In addition, the generation process of MNAR can refer to the process of MAR.