9.3 Single imputation methods for categorical data
9.3.1 Mode imputation
Mode imputation is the categorical missing data’s version of mean imputation. We know that it is not possible to compute the mean or median value for a categorical variable. Hence, instead of looking for mean and median values to replace missing data of a categorical vari- able, missing values for that variable are imputed with the category that has the most of in-
dividuals with the observed values, this is, mode imputation (Ramirez et al. 2011). Actually, most papers in the literature consider mean/median/mode imputation to be the same impu- tation method. Why? Let’s consider what mean/median imputation is actually doing. We know that the mean or median for a numerical variable is the best guess of the centre of its distribution. This means that for a symmetric unimodal distribution, most of data points are centred around mean and median. Then, it makes sense to replace missing values with the mean or the median, because these missing values have higher chance to be located close to mean or median than to be located far away from them on a distribution. Therefore, we see that the underlying theory here is to replace missing values with values which have higher probability than others to be selected from a distribution. For categorical data, the category with the highest frequency is the category that has higher chance than other categories to be selected from a distribution, so it makes sense to replace missing values with the value of that category. Hence, mode imputation and mean/median imputation have the same motivation of selecting the most likely values of a distribution. Obviously, categorical missing data cannot use mean or median imputation, but we have to point out that the mode imputation can be used for numerical continuous variables as well (Torgo 2003). The way is to transform the numerical continuous variables into categorical variables by grouping the numerical values into ranges.
As with mean imputation which we discussed in Chapter 3, Section 3.3.1, there is uncon- ditional and conditional mode imputation. Suppose Y is a categorical variable with n obser- vations, and Y = (Yobs,Ymis), where Yobs are observed values, and Ymis are missing values.
There are r response observations, and n − r missing observations. If Y has K categories, with category values (C1,C2, ...,CK), then unconditional mode imputation is:
Ymis,i = Ck (9.1) if k= argmax k r
∑
j=1 qk, j where i = (1, ..., n), and j = (1, ..., r) qk, j= ( 1 if Yobs, j = Ck 0 OtherwiseFor conditional mode imputation, suppose we divide Y into g groups conditioning on some independent variables X . Then
Yg,mis,i= Ckg (9.2) if kg= argmax kg r
∑
j=1 qk, jIi,g where i = (1, ..., n), and j = (1, ..., r) qk, j= ( 1 if Yobs, j = Ck 0 Otherwise and Ii,g= ( 1 if unit i in group g 0 OtherwiseFrom Eq (9.1) and Eq (9.2), we see that unlike mean/median imputation for which there can only be one mean or median for each variable, there can be more than one mode of a variable, which means there is more than one category with the highest frequency. How do we impute missing values if there is more than one mode? There are three methods. The simplest one is to randomly select one category as the replacement values for all the missing data, from categories with the same highest frequencies. Obviously, this method alters the original distribution of proportion of each category1 which has multiple categories with the same highest frequencies in only a single category with the highest frequency. If the proportion of missing data is large, this method makes the estimates very different from the unimputed data. Hence, a slightly better method is to impute missing values equally with all the highest frequency categories. For example, if two categories have the same highest frequencies, then half of the missing values will be imputed by using one of the categories, and the other half of the missing values will be imputed by using the other categories. By doing this, we have made sure that the imputed data still have the same or a very similar distribution of proportions in each category as the unimputed data. But, do we really want the imputed data to have exactly the same distribution of proportions in each category as the original unimputed data? or, do we want some variation from the original unimputed data? Hence, we have this last method which randomly selects one of the categories with the same highest frequencies as the imputed value for each missing value. It is still highly likely that the imputed data will have different distributions of proportions in each category, e.g. only have one category with the highest frequency instead of having multiple categories with the same highest frequencies. However, because we have a random selecting process for each missing value, and each category with the highest frequency has the same chance to be selected each time, it is likely that the imputed data will have roughly similar distributions of proportions in each category as the original unimputed data (e.g. the category with the highest frequency is only slightly bigger than the category with the next highest frequency), although it might be a uni-modal instead of multi-mode distribution. It is really hard to say which of the last two methods is the best method. This is because, in practice, different researchers may have different needs or objections to what the distribution of proportions in each category of the imputed data ought to be like.
As has been done in Chapter 4, we applied the unconditional and conditional mode impu- tation methods to the “Qualification” variable of the SURF data. The Qualification variable is a categorical variable which has four levels: “None”, “School”, “Vocational”, and “Degree”. First, we applied the MCAR mechanism to the Qualification variable and created 50 MCAR missing values out of 200 observations. Then, the unconditional and conditional mode im- putation methods were used to impute those MCAR missing qualification data. The whole process was repeated 1000 times. The following steps depict the exact process:
1The distribution of the proportion of each category is not the distribution of the data. After any imputation,
the distribution of the data will be different, but for a categorical variable, the distribution of the proportion of each category can be the same.
Recipe: Unconditional and conditional mode imputation
Step 1: create 50 MCAR missing observations for the Qualification variable.
Step 2: apply unconditional and conditional mode imputation to impute the missing qualification data. For conditional mode imputation, the condition is on “Gen- der” and “Marital status”.
Step 3: repeat step 1 to step 2 1000 times, which produces 1000 imputed qualification variables.
Figure 9.1 and Figure 9.2 show the proportions to the total observations of the four levels of the 1000 imputed qualification variables. The red lines represent the true proportions of the four qualification levels of non-missing qualification variable. Similarly to the results for the unconditional and conditional mean imputation in chapter 4, conditional mode imputa- tion performs better, having less bias against the true proportion values than unconditional imputation, although the missing data is MCAR.
The distributions of the four graphs in Figure 9.1 are somewhat strange. After imputation, the proportions of the qualification categories: none and degree are less than the true pro- portions of those two categories. Meanwhile, the proportions of the qualification categories: school and vocational are either less or more than the true proportions of those two categories. What are the causes of these patterns? For the rare categories2: none and degree, it is highly likely that they are still rare after the creation of the MCAR missing observations, hence, these rare categories will never be imputed under the scheme of the unconditional imputation. This is why we see that their proportions after imputation are less than the true proportions. For the categories with large proportions of observations: school and vocational, it is highly likely that one of them will become the category with the most observed observations after the creation of the MCAR missing observations. If one of them becomes the most populated category, then unconditional imputation will impute all the missing observations with the value of the most populated category. This means its proportion after imputation will be larger than the true proportion. On the other hand, the category with the second most observations will suffer the same fate as those rare categories which means that it will not be imputed. However, we have simulated the process of creating 50 MCAR missing observations 1000 times and the chance of one of the qualification categories: school and vocational becomes the most popu- lated category is random, hence, both school and vocational categories have had the chance to be imputed.
The distributions of the four graphs in Figure 9.2 shows that the conditional imputation method produced much better estimates than the unconditional imputation method, although the estimates are sill biased against the true estimates of the proportions. The improvement is due to the imputation being conditioned on the “Gender” and “Marital” variables. This means that we separate the data into several subgroups. Hence, there is a chance that the rare categories might become the most populated categories in a subgroup. Then, we impute missing observations in that subgroup with the value of the “new” most populated category. In the end, the missing observations are not imputed by just a single category value. The rare categories also have a chance to be imputed. This is why we see that the estimates of the proportions moving towards the true proportions. However, the proportions of the rare
categories: none and degree are small. Under the scheme of the MCAR mechanism, the chance for them to become the dominant categories in a subgroup is slim. This is why that majority of the proportions of none and degree after imputation are still less than the true proportions of these two categories.
Please refer to Appendix E, section E.1.1 for the R code.
None Frequency 0.12 0.14 0.16 0.18 0.20 0 100 300 500 School Frequency 0.20 0.30 0.40 0.50 0 100 200 300 400 500 Vocational Frequency 0.20 0.30 0.40 0.50 0 100 200 300 400 Degree Frequency 0.08 0.10 0.12 0.14 0 50 150 250 350
Figure 9.1: The proportions of the four qualification categories. The 1000 qualification vari- ables were imputed by unconditional mode imputation
None Frequency 0.10 0.15 0.20 0.25 0 50 100 150 200 School Frequency 0.25 0.30 0.35 0.40 0.45 0.50 0 50 100 150 200 Vocational Frequency 0.25 0.30 0.35 0.40 0.45 0 50 100 150 200 Degree Frequency 0.06 0.10 0.14 0.18 0 50 100 150 200 250
Figure 9.2: The proportions of the four qualification levels. The 1000 qualification variables were imputed by conditional mode imputation