Data and Missingness Simulation - Multiple Imputation, Maximum Likelihood and Predictive Mean M

Chapter 5: Multiple Imputation, Maximum Likelihood and Predictive Mean Matching for

5.4 Data and Missingness Simulation

In order to compare the performance of the imputations and demonstrate their validity on a real microfinance dataset, we use the 2010 administrative loan book data of Cooperativa de Ahorro y Credito Ceibeña Ltda. (COAC Ceibeña). It is a subset of the dataset which was used in chapter 3 previously. COAC Ceibeña was founded in 1974 on the initiative of Father Donaldo McMillan, and a group of women gathered to the local Catholic Church at La Ceiba Honduras. It is a credit union offering safe and transparent microfinance financial products and services to the local community. The raw data of 2010 COAC Ceibeña has 24 variables and 8,063 cases. The 11 explanation variables (Table 5.1) selected in this paper is based on previous studies and ex- pert advice from the microlender staffs, as there is no universally accepted approach to select the explanatory variables for credit scoring (Dinh & Kleimeier, 2007).

5.4.1 Modifying the population

As the size of 8,063 is relatively large for most microfinance institutions in developing countries, we would like to scale down the population to generalize the administrative loan book data. We begin with handling outliers. The occurrence of outliers in our data is very limited, and there are no signs of correlated outliers. Therefore, the simple winsorizing and trimming of Wainer (1976) are adopted here. All observations of Outstanding Balance under $50 or over $10,000 are re- placed by the limits. Arrears is trimmed at $2,000. Age is restricted to the range from 20 to 80. Next, we separate the raw data on the level of the point mass and generate two populations. Both populations have size N=3,200, but the populations differ in the size of the point mass: 83.50% and 85.34% point masses at zero for the data with Arrears and the data with Credit Risk respectively. These two populations will be used as the benchmark datasets. It should be no- ticed that the estimates such as the mean, median, and variances will also change as the size of the point mass changes. The summary statistics of the 11 selected variables with modified population are presented in Table 5.1.

5.4.2 Sampling benchmark datasets and skewness preservation

To investigate the performance of MDT under different sample sizes (𝑁), we sample 1,000, 1,700, 2,200, 2,700, and 3,200 cases from the populations. As shown in Table 5.1, the variables of interest (𝑦), Arrears, dichotomized Arrears, and Credit Risk, are heavily skewed in practice. In

theory, it may severely impair a method’s imputation performance. In order to evaluate the imputation methods at the same distributions across different simulations, we have generated 30 benchmark datasets in this stage (5 sample sizes * 3 variables of interest * 2 models with different number of missing variables). Each table of empirical results in this chapter has 5 sections and each section corresponds to a specific benchmark.

The samples of univariate missing data imputation in this paper are generated following the process as below:

1. We start by separating the data into two sections 𝑆0and 𝑆+based on the zeros and non-zeros of 𝑦. 𝑆𝑦represents a section in which𝑦 ∈ [0, +]

2. Subsample 𝑆𝑢𝑏0is generated by random sampling a certain percentage (𝑃𝑐𝑡) in 𝑆0. 3. In terms of generation of 𝑆𝑢𝑏+, it depends on the data types of 𝑦:

a. If 𝑦is binary, then 𝑆𝑢𝑏+is generated by random sampling 𝑃𝑐𝑡in 𝑆+.

b. If the 𝑦is ordinal categorical (3 levels), we divide 𝑆+into 𝑆₊₁and 𝑆₊₂based on the values of 𝑦, then randomly sample 𝑃𝑐𝑡in 𝑆₊₁and 𝑆₊₂, and finally merge 𝑆₊₁and 𝑆₊₂to generate 𝑆𝑢𝑏+.

c. If 𝑦is semi-continuous, we sort all non-zero cases based on the values of 𝑦, next, divide 𝑆+into 𝑁sections 𝑆₊₁, 𝑆₊₂,…, 𝑆_+𝑛with equal number of cases, then random sample 𝑃𝑐𝑡in each 𝑆_+𝑘, and finally merge 𝑆_+𝑘to generate 𝑆𝑢𝑏+. 4. At last, combing 𝑆𝑢𝑏0and 𝑆𝑢𝑏+into a sample ready for missingness simulation. On the other hand, sample generation for multivariate missing data is similar to that for univariate missing data. The only difference is that the samples should be divided into more subsam- ples as we need to preserve the skewness of a new continuous variable (𝑦∗), such as Loan Ma- turity, with missing data as well. The details of the generation process is as follows:

1. We start by separating the data into two groups of sections 𝑆0,1, 𝑆0,2,…, 𝑆0,𝑚and 𝑆+,1, 𝑆_+,2,…, 𝑆+,𝑚based on the zeros and non-zeros of 𝑦and 𝑦∗. 𝑆𝑦,𝑦∗represents a section in which 𝑦 ∈ [0, +], and 𝑦∗∈ [1, 2, … , 𝑚].

2. Randomly sampling𝑃𝑐𝑡in each 𝑆0,𝑦∗, and then merge all 𝑆0,𝑦∗to generate 𝑆𝑢𝑏0. 3. In terms of generation of 𝑆𝑢𝑏+, it depends on the data types of 𝑦:

a. If 𝑦is binary, we divide 𝑆+into 𝑆+,1, 𝑆+,2,…, 𝑆+,𝑚based on the values of 𝑦∗, then randomly sample 𝑃𝑐𝑡in each 𝑆+,𝑦∗, finally merge all 𝑆_+,𝑦∗to generate 𝑆𝑢𝑏₊.

b. If the 𝑦is ordinal categorical (3 levels), we divide 𝑆+into

𝑆

+1,1,

𝑆

+1,2,…,

𝑆

+1,𝑚,

𝑆

+2,1,

𝑆

+2,2,…,

𝑆

+2,𝑚based on the values of 𝑦and 𝑦∗, then random sample 𝑃𝑐𝑡

in each

𝑆

+𝑘,𝑦∗, and finally merge all

𝑆

+𝑘,𝑦∗to generate 𝑆𝑢𝑏+.

c. If 𝑦is semi-continuous, we divide 𝑆+into 𝑆+,1, 𝑆+,2,…, 𝑆+,𝑚based on the values of 𝑦∗, then we sort all non-zero cases based on the values of 𝑦in each 𝑆+,𝑗, next, we divide each 𝑆+,𝑦∗_into𝑁_sections

𝑆

₊

1,𝑦∗,

𝑆

+2,𝑦∗,…,

𝑆

+𝑛,𝑦∗with equal

number of cases, then we random sample 𝑃𝑐𝑡in each

𝑆

+𝑘,𝑦∗, and finally merge

𝑆

₊_𝑘_,𝑦∗_{to generate}𝑆𝑢𝑏₊_.

4. At last, combing 𝑆𝑢𝑏0and 𝑆𝑢𝑏+into a sample ready for missingness simulation.

5.4.3 Generating missingness

In terms of the missing data mechanisms, MCAR, MAR, and MNAR are used in our simulations. To investigate the performance of the methods under different missing rates (𝑅), the details of the three mechanisms are adjusted to yield an overall rate of missingness at five different levels (10%, 20%, 30%, 40%, and 50%) in this paper. Regarding to the functions in missing data imputation, all variables can be classified into three types: 𝑋, which always observed; 𝑌, which is partly observed; and 𝑍, which may be observed and is a potential cause of missingness for 𝑌. 𝑋and 𝑌 represent variables that will automatically appear in an imputation because they are of research interest. 𝑍represents variables that is not of direct interest but might be included in the model if the researchers consider it is beneficial.

To model MCAR, missing values are randomly imposed on 𝑌independently of 𝑋, 𝑌, and 𝑍at different missing rates stated above. It is straightforward.

To model MAR, the only requirement is that the missingness of 𝑌associates with 𝑍. The potential relations between 𝑌and 𝑍are countless, and it is impossible to model all of them. Common conditions for MAR in the previous literature include: Linear, in which the missingness of 𝑌is lin- early related to 𝑍; Quadratic, in which the missingness of 𝑌at the extremes of 𝑍is different from that in the middle; and Sinister, in which the missingness of 𝑌is a function of the correlation between 𝑋and 𝑍. The study of Collins et al. (2001) shows that the selection of MAR conditions has little effect on the biases of correlation estimation between 𝑋and 𝑌. Therefore, we only focus on linear MAR in this paper for simplicity. In terms of administrative loan books and surveys of MFIs, a typical scenario of MAR would be that males (𝑍) have higher probability of

nonresponse on the question of Arrears (𝑌) than females. To simulate such a scenario, we impose a linear MAR missing mechanism on 𝐺𝑒𝑛𝑑𝑒𝑟(𝑍) following a semi-random sampling process designed as follows:

1. We start by separating the initial sample (𝑆𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒) into four sections 𝐶+,𝑓, 𝐶+,𝑚, 𝐶0,𝑓, and 𝐶0,𝑚based on the values of𝑦and 𝑧. 𝐶𝑦,𝑧represents the sample size of a section in which𝑦 ∈ [0, +], 𝑧 ∈ [𝑓𝑒𝑚𝑎𝑙𝑒, 𝑚𝑎𝑙𝑒].

To simulate MAR missingness, we impose different weights (𝑊𝑧 ∈ [0,1], 𝑧 ∈ [𝑓𝑒𝑚𝑎𝑙𝑒, 𝑚𝑎𝑙𝑒] ) on the missing rates based on 𝐺𝑒𝑛𝑑𝑒𝑟. The gap between 𝑊𝑓and 𝑊𝑚indicates the strength of association between 𝐺𝑒𝑛𝑑𝑒𝑟and missingness of 𝐴𝑟𝑟𝑒𝑎𝑟𝑠. When 𝑊𝑓 = 𝑊𝑚, the missing data is MCAR. 𝑊𝑓 ≤ 𝑊𝑚simulates the scenario that women have lower probability of being missing from the datasets. For simplicity, we setup 𝑊𝑓 = 1and 𝑊𝑚 = 0.9in this study. In next stages, we impose different missing rates on the four sections as follows:

2. Randomly drop 𝑅+,𝑓percent of cases in the section with 𝐶+,𝑓: 𝑅_+,𝑓= 𝑅 ∗ 𝐶 / (4 ∗ 𝐶+,𝑓) ∗ 𝑊𝑓∗ 100

3. Randomly drop 𝑅+,𝑚percent of cases in the section with 𝐶+,𝑚: 𝑅_+,𝑚= 𝑅 ∗ 𝐶 / (4 ∗ 𝐶+,𝑚) ∗ 𝑊𝑚∗ 100

4. For the section with 𝐶0,𝑓, we sort the data based on the values of 𝑦and 𝑧in ascending order, divide the section into 20 subsections with equal amount of data, randomly drop

𝑅_0,𝑓percent of cases in each subsection, and then merge all subsection back to𝐶0,𝑓: 𝑅_0,𝑓= 𝑅 ∗ 𝐶 / (4 ∗ 𝐶0,𝑓) ∗ 𝑊𝑓∗ 100

5. For the section with 𝐶0,𝑚, we sort the data based on the values of 𝑦and 𝑧in ascending order, divide the section into 20 subsections with equal amount of data, randomly drop

𝑅_0,𝑚percent of cases in each subsection, and then merge all subsection back to 𝐶0,𝑚: 𝑅_0,𝑚= 𝑅 ∗ 𝐶 / (4 ∗ 𝐶0,𝑚) ∗ 𝑊𝑚∗ 100

After merging the processed 𝐶+,𝑓, 𝐶+,𝑚, 𝐶0,𝑓, and 𝐶0,𝑚back to a single sample (𝑆𝑖𝑛𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒), we use 𝐶∗to denote the sample size of 𝑆𝑖𝑛𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒, and calculate𝐶∗as:

𝐶∗_{= 𝐶}

+,𝑓∗ (1 − 𝑅+,𝑓) + 𝐶+,𝑚∗ (1 − 𝑅+,𝑚) + 𝐶0,𝑓∗ (1 − 𝑅0,𝑓) + 𝐶0,𝑚∗ (1 − 𝑅0,𝑚).

When 𝑊𝑓 = 1and 𝑊𝑚 = 0.9, we can infer that𝐶∗< C ∗ (1 − 𝑅). In order to preserve the joint distributions embedded in 𝑆𝑖𝑛𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒and increase its sample size to C ∗ (1 − 𝑅), we refill the

incomplete sample and generate the final sample with MAR missingness for further imputation evaluations as follows:

6. Randomly select𝑁+,𝑓cases from the abandoned data generated in step 2 and merge them back to 𝑆𝑖𝑛𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒. 𝑁+,𝑓is calculated as: 𝑁+,𝑓= (𝐶 − 𝐶∗) ∗ 𝐶+,𝑓∗ (1 − 𝑅+,𝑓) / 𝐶∗

7. Randomly select 𝑁+,𝑚cases from the abandoned data generated in step 3 and merge them back to 𝑆𝑖𝑛𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒. 𝑁+,𝑚is calculated as: 𝑁_+,𝑚= (𝐶 − 𝐶∗_{) ∗ 𝐶}

+,𝑚∗ (1 − 𝑅+,𝑚) / 𝐶∗

8. Randomly select 𝑁0,𝑓cases from the abandoned data generated in step 4 and merge them back to 𝑆𝑖𝑛𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒. 𝑁0,𝑓is calculated as: 𝑁_0,𝑓= (𝐶 − 𝐶∗_{) ∗ 𝐶}

0,𝑓∗ (1 − 𝑅0,𝑓) / 𝐶∗

9. Randomly select 𝑁0,𝑚cases from the abandoned data generated in step 5 and merge them back to 𝑆𝑖𝑛𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒. 𝑁0,𝑚is calculated as: 𝑁_0,𝑚= (𝐶 − 𝐶∗_{) ∗ 𝐶}

0,𝑚∗ (1 − 𝑅0,𝑚) / 𝐶∗

Note that in actuality, the mechanism shown above will be MAR only if 𝑍(e.g., 𝐺𝑒𝑛𝑑𝑒𝑟) appears in the procedure . If 𝑍is omitted, then the mechanism is actually MNAR and procedures based on an assumption of ignorability may be biased. Again, the potential unobservable variables as- sociated to 𝑌are countless and we cannot model all of them. Therefore, we only consider the most common form of MNAR in this paper. For the microfinance loan books, one example of MNAR would be that clients with non-zero Arrears (𝑌) have higher probability of nonresponse to a question of Arrears (𝑌) than clients with zero Arrears. To simulate such scenario, we can simply allow 𝑌(e.g., Arrears in this case) to take the place of 𝑍(e.g., 𝐺𝑒𝑛𝑑𝑒𝑟) in mechanism MAR above, forcing it to be MNAR. In addition, the generation process of MNAR can refer to the process of MAR.

In document Essays on microfinance repayment behaviour: an evaluation in developing countries (Page 113-118)