1527
Volume 62 159 Number 6, 2014
http://dx.doi.org/10.11118/actaun201462061527
MISSING CATEGORICAL DATA
IMPUTATION AND INDIVIDUAL
OBSERVATION LEVEL IMPUTATION
Pavel Zimmermann
1, Petr Mazouch
1, Klára Hulíková Tesárková
21 Department of Statistics and Probability, Faculty of Informatics and Statistics, University of Economics, nám. W. Churchilla 4, 130 67 Prague 3, Czech Republic
2 Department of Demography and Geodemography, Faculty of Science, Charles University in Prague, Albertov 6, 128 00Prague 2, Czech Republic
Abstract
ZIMMERMANN PAVEL, MAZOUCH PETR, HULÍKOVÁ TESÁRKOVÁ KLÁRA. 2014. Missing Categorical Data Imputation and Individual Observation Level Imputation. Acta Universitatis
Agriculturae et Silviculturae Mendelianae Brunensis, 62(6): 1527–1534.
Traditional missing data techniques of imputation schemes focus on prediction of the missing value based on other observed values. In the case of continuous missing data the imputation of missing values o en focuses on regression models. In the case of categorical data, usual techniques are then focused on classifi cation techniques which sets the missing value to the ‘most likely’ category. This however leads to overrepresentation of the categories which are in general observed more o en and hence can lead to biased results in many tasks especially in the case of presence of dominant categories. We present original methodology of imputation of missing values which results in the most likely structure (distribution) of the missing data conditional on the observed values. The methodology is based on the assumption that the categorical variable containing the missing values has multinomial distribution. Values of the parameters of this distribution are than estimated using the multinomial logistic regression. Illustrative example of missing value and its reconstruction of the highest education level of persons in some population is described.
Keywords: missing data, categorical data, multinomial regression
1 INTRODUCTION
Popular methods for a completion of (individual) observation as for example mean imputation, regression imputation or maximal likelihood imputation are usually focused on imputation of a continuous variable. Those methods mostly classify the missing values as “most likely” or “expected” values. Overview of those methods can be found for example in Schafer, Graham, 2002. List of methods for imputation of categorical variable is less extensive. In the case of categorical data, usual techniques are then focused on classifi cation techniques which sets the missing value to the ‘most likely’ category (see Sentas et al., 2004). This however leads to overrepresentation of the categories which are in general observed more o en and hence can lead to biased results in many tasks especially in the case of presence of dominant categories.
The aim of the paper is to introduce multinomial logistic regression as very eff ective tool for missing data imputation. Motives for using this technique could be described by the following three requirements:
• to impute data set in form which can be re-used for variety of diff erent analysis and applications; this means single imputation is required,
• to impute data in the most detailed level; optimally on individual observation level,
• to impute data in a way that will respect “expected” ratios of categories in general.
In the following text the methodology and its specifi c features will be described.
2 MISSING DATA TYPOLOGY
will be adopted. Rubin considered the missingness as a probabilistic phenomenon, i.e. a set of random indicator variables R indicating non-missingness of a particular observation was considered. Also the partition of the complete dataset Ycom into set
of observed values Yobs and set of missing values Ymis,
i.e.
Ycom= (Yobs, Ymis),
was considerd. Missing data are called missing at random (MAR) in the case where the distribution of the missingness does not depend on Ymis, i.e. when
P(R|Ymis)= P(R|Yobs).
This is the case where ‘MAR allows the probabilities of missingness to depend on observed data but not on missing data’. A special case of the MAR is then MCAR (missing completely at random), where the probabilities of missingness do not depend on the observed data either:
P(R|Ycom)= P(R).
If MAR is violated, data are missing not at random (MNAR).
The Task Solved Within the Paper In this article a methodology for a specifi c task is developed which can be however reused in many similar tasks.
We assume a univariate pattern of categorical data, i.e. data where several variables are completely observed (Xobs) and one variable contains missing
values. This was schematically expressed as in Schafer, Graham, 2002.
More precisely, we assume nt complete observations of p categorical variables observed over
T time periods denoted as Xi,t,j, i = 1, …, nt, t = 1, …, T,
j = 1, …, p. Observations of Xi,t,j are complete for all
years. Furthermore we observe another categorical variable Yi,t, i = 1, …, nt, t = 1, …, T. For the years
t = 1, …, c < Tobservations of Yi,t are complete
(or with just negligible amount of missing data). These years will be referred as ‘complete data years’. For the outstanding years t = c + 1, …, T the amount of missing data for Yi,t is rather large and MAR is not guaranteed. These years will be referred as ‘missing data years’.
We assume that c is ‘suffi ciently large’ to reasonably extrapolate trends for the missing data years and we assume that the trends observed during the complete data years are relevant for the predictions for the missing data years. Observations of Yi,t are assumed conditionally
independent conditioning on Xi,t,j.
The Time Structure of Data Set According to Missing Data
From the time point of view three types of missing data position could be distinguished. The fi rst is situation where we have complete information from some moment (year) but before this time missing data occur. In such a situation the aim is to reconstruct data before some point in time.
The second example is situation where data are complete, however, from some moment in time some (or all) data are missing. The aim in such a situation is to estimate the missing information for that period a er any concrete moment. The third type is a situation where we have missing data “in the middle” of the time period, i.e. for some (limited) period of time the information is partially or completely missing. The aim is to bridge this part, estimate the missing data respecting trends before and a er this missing period.
3 THE IMPUTATION ALGORITHM
In the following text, we will mean by determinants the original or rediscretized variables that have a signifi cant impact on the distribution of the variable containing missing data. (The signifi cance is measured over the years with complete data.) By profi le we then mean a group of data with the same combination of values of the determinants. The set of determinants will be denoted as X. The matrix containing observations of determinants up to the time t will be denoted as Xt.
The basic steps of our imputation algorithm: 1. Based on the data observed in complete data
years Xi,t,j, i = 1, …, nt, t = 1, …, c, j = 1, …, p and
Yi, i = 1,…, nt, t = 1, …, c, fi nd determinants X
of the missing data structure.
2. Defi ne profi les of observations with missing data based on the values of the observed determinants
Xt. The conditional distribution of Y is diff erent
conditioning on diff erent profi les.
3. Estimate probabilities of each category q, q = 1, …, k of the missing variable Y for each profi le, i.e. estimate P(Y = q|X).
4. Based on the probabilities, fi nd “appropriate” count of missing observations of each category in each profi le and distribute these counts to each individual in the profi le.
containing the missing values (Y) follows for a given profi le the multinomial distribution. This fact immediately suggests using the multinomial logistic regression on the complete data years (t ≤ c) as the methodology for fi nding the determinants (X) of the distribution (structure) of Y (as the response variable) and predicting the conditionally expected probabilities of each category of the response variable for each profi le of data at each time point (for both t ≤ c and t > c). This requires assessing the time variable as covariate and assuming some (possibly polynomial) trend. That is the probability distribution of the categories of Y for each profi le in each year P(Yt= q|X, t) is fi tted as the outcome
of the regression analysis (steps 1–3 of the above outlined imputation algorithm).
3.2 Multinomial Logistic Model
The multinomial regression method is a generalization of the logistic regression to multiclass problems. It is assumed that the response variable is a categorical variable with k possible outcomes. One of the k categories are selected as the ‘reference’ category. For every other category q, a regression equation is assumed in the model to describe the logarithmic odds of the category q
to the reference category, i.e. equations
,0 , ,
0 1 1 2 2
log q
q q q
p
x x
p
are assumed, where pq is the probability
of the outcome q of the response variable, p0 is
the probability of the reference category, q,j are
the parameters and xj are the regressors. Parameters
q,j are fi tted using the maximum likelihood method. (See Hosmer, Lemeshow, 2004 for details.) Probabilities pq may then be calculated using
the parameter estimates using the formulas
, ,
, ,
1 1 2 2
1 1 2 2
exp( …)
1 exp( …)
q q
q
q q
x x
p
x x
and
, ,
0
1 1 2 2 1
1 exp( q q …)
p
x x .
3.3 Partially Missing Data
Based on the above described analysis we obtain the predicted distribution of the variable containing the missing data (Y) also for the years containing missing data (t > c) for each profi le and each time point (conditioning on X and t will be leaved out in the notation of this section for simplicity). However, for these years we may have some amount of observed data (supposing partially missing data in the data set). Therefore we can estimate two distributions of missing values, fi rst based
on complete data years and second based on missing data years:
1. First prediction of the distribution P(Y = q) for each category q = 1, …, k and a given profi le and each time point as the prediction based on the complete data years, i.e. Xt, Yt t < c.
2. Second distribution P(Y = q|R = 0) fi tted based on the observed data in the missing data years
Xt, Yt, t > c, i.e. distribution conditional on the fact
that an observation is not missing.
Besides these distributions, we can also estimate the probability of missing values (P(R = 0)). The (marginal) distribution P(Y = q) equals
P(Y = q)= P(Y = q, R = 0) + P(Y = q, R = 1), where P(Y = q, R = 0) (or P(Y = q, R = 1)) is the (joint) probability that the observation is certain category and is missing (or is not missing respectively) which equals
P(Y = q, R = 0) = P(Y = q|R = 0) P(R = 0)
and
P(Y = q, R = 1) = P(Y = q|R = 1) P(R=1).
We can write for the distribution of the observations that are missing (i.e. for which we already know that R = 0) as:
P(Y = q|R = 0) =
= [P(Y = q) − P(Y = q|R = 1) P(R = 1)]/ P(R = 0). Furthermore the estimates of the diff erences between the distributions P(Y = i)and P(Y = i|R = 1) may suggest the (non)randomness in missingness ‘mechanism’.
3.4 Finding the Appropriate Count of Missing Observations of Each Category
Let us assume one particular profi le of the data in a given year. Based on the above described regression analysis we can get the estimated distribution of the categories of Y for the missing observations, denoted as P(Y = q|R = 0) = pq, q = 1, …, k
for this given profi le and year. Furthermore we know that in this profi le and year, there is certain amount of missing data n. The distribution of the missing observations is (under the assumption of conditional independence) multinomial with the given parameter vector p = (p1, p2, …, pk) and n.
Given the probability distribution of the categories of the response variable and the number of missing observations we still need to determine how many of the missing observations correspond with each category (step 4 of the above outlined imputation algorithm). This variable will be denoted as Uq,
and hence we need to determine counts (integers) of missing observations for each category.
Normally the expected value would be the fi rst choice for the predictions as it yields predictions with the lowest least square error. The expected value of the multinomial distribution in a particular category q is simply the count of missing observations (in the particular profi le in the particular year) times its probability, i.e.
E(Uq) = n pq, q = 1, …, k.
However, the expected values are generally real numbers (not necessarily integers) and hence do not allow for imputation on individual observation level. Therefore we suggest using the maximum likelihood criterion where the maximization is performed only on the discrete (integer) numbers. This means fi nding such uq, q = 1, …, k that the joint
distribution P(u1,u2,…,uk |p, n) is maximized, i.e. we
are looking for a vector u = (u1, u2, …, uk) for which
arg maxP(u; p, n)
u
where P denotes the probability function of the multinomial distribution. This in fact means we are looking for the mode of the multinomial distribution.
Mode of the Multinomial Distribution There is no closed form formula for the mode of the multinomial distribution. There are however several iterative algorithms developed for this task. See for example Lloyd et al., 1997, Finucan, 1964 or Le Gall, 2003. In our computations we selected the Finucan’s algorithm published in Finucan, 1964.
Distribution of Estimated Data on the Individual Level
Having found the mode of the multinomial distribution for a particular profi le we have a vector of counts (integers) of missing values of each category of the variable of the concern which has the highest probability. Within the profi le, these counts may be ‘assigned’ randomly to the individuals as all individuals of the given profi le have the same probability vector pi, i = 1, …, k of being in i-th
category.
4 APPLICABILITY OF THE ALGORITHM
The proposed method of estimation of missing data could be used in many spheres of application. In this paper we demonstrated the algorithm on (completely or partially unknown) education structure of a population. Education attainment could be taken as a typical example of categorical data. Moreover, when studying the population, this type of data is relatively o en incomplete. Other example could be e.g. the marital status, age profi le, etc.
The described algorithm is based on the assumption of continuous trend in the data within the missing data years. It corresponds with situation where data are missing because of some administrative changes etc. which does not aff ect the trend in the data. Application of the described method in situations where this condition is not fulfi lled (e.g. where the missingness of the data is at least partially related to some changes aff ecting also the long-term trend – wars, etc.) would mean some sort of extrapolation of “unaff ected” development – how the structure (partially or completely missing) would have developed if there had not been any interruption of the trend.
5 PRACTICAL APPLICATION
5.1 Dataset
The following variables are available within our dataset: Education, Sex, Marital status, Diagnoses of death, Age and Year of death. The dataset contains individual deaths in the Czech Republic 1995–2011. As a practical application we assumed educational attainment of a studied population as the variable containing missing data (Y).
The education is perceived as an important proxy for social status and behavioural habits and therefore it is a factor driving mortality. The education was collected obligatory until 2009 only and almost 40% of cases are missing in the year 2010 and 2011 and the other 60% are unreliable. Due to this fact it is necessary to impute the education conditioning on the other observed variables (i.e. using the information contained in the other variables X) and some regression model is necessary to forecast the probability of a death being in a given educational category given the other observed values in the year 2010 and 2011.
5.2 The Multinomial Model
Fitted Model
of combinations of the levels of the above listed determinants.
Further Interesting Relations Observed We present some of the results observed that are in particular interesting. In order to be able to extrapolate the trend into unobserved years 2010 and 2011, we need to use a parametric function for the eff ect of the year of death. In this case second order polynomial was particularly suitable especially with an extra eff ect of the male gender until 2002. The trend curves are displayed in the Fig. 2.
Main eff ect of the male gender reduces the log odds ratio of having basic education (relatively
to having low education) and increases the log odds ratio of having middle or high education. Results are in Fig. 3a.
Main eff ect of being single increases relatively the chances of having basic, middle or high education. Main eff ect of being a widower increases the chances of having basic education and decreases the chances of having middle or high education (Fig. 3b).
The interaction of the male gender with the single status was in particular signifi cant. We can interpret the result that single status increases the log odds ratio of having basic education, especially for males. Single status increases the log odds ratio (to low education) of females of having middle or high I:
Variable Nr. levels Levels
Sex 2 Male/Female
Marital status 4 Single/Divorced/Widower/Married
Age 4 0–16/16–39/40–59/80–110
Cause of death 3 Cancer/Circulatory/Other
Ͳ4
Ͳ3
Ͳ2
Ͳ1 0 1 2
log
odds
ratio
Low
Middle
High
LowͲmales
MiddleͲmales
HighͲmales
2: Trend curves of the effect of the year of death
Ͳ.800
Ͳ.600
Ͳ.400
Ͳ.200 .000 .200 .400 .600 .800 1.000 1.200 1.400
male
log
odds
ratio
1 3 4
Ͳ.800
Ͳ.600
Ͳ.400
Ͳ.200 .000 .200 .400 .600 .800 1.000
singl
e
widower divorced married
log
odds
ratio
1 3 4
Ͳ.200
Ͳ.100 .000 .100 .200 .300 .400
neoplasms circulatory
other
log
odds
ratio
1 3 4
3a 3b 3c
education and the interaction mitigates the single eff ect for males. Interactions are described in appendix (App. 1).
Neoplasms diagnose generally decreases the odds of having basic education and it strongly increases the odds of having middle or high education (Fig. 3c).
Circulation diseases strongly increase the odds of having basic education and decrease the odds of having high education. Diff erence between genders was not observed.
There are signifi cant diff erences between the impact of the cancer on education for diff erent genders (interaction gender and cause): The impact of cancer on the decrease of the odds of having basic education and the increase of the odds of having middle or high education is much stronger for females than for males. So the neoplasm cause of death is determining the education much more for females than for males (see App. 2).
5.3. Predicted Probabilities and Imputation Based on this model, it is possible to determine the conditional probability distribution of the educational levels for each combination of the values of the regressors (for each profi le).
These probabilities together with the number of missing observations for each profi le specify the multinomial distribution. The mode is searched for each profi le using the Finucan’s algorithm. These modes are then the number of imputed observations for each educational level in each profi le. The resulting imputation is for each education level displayed in comparison with the observed sample in Fig. 4.
6 CONCLUSION
Aim of this paper was to introduce multinomial logistic regression as very eff ective tool to missing data imputation. To the authors’ knowledge the combination of the multinomial regression and mode searching algorithm was used for the fi rst time for the missing data imputation task. The outcome of the proposed algorithm follows expected structure of the variable containing the missing values.
As a by-product the outcomes of the intermediate steps of the algorithm may be used for further analyses such as analyses of the dependencies (determinants) of the variable of our concern, or analysis of the missingnes mechanism.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
log
odds
ratio
Basic
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
log
odds
ratio
Low
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16
log
odds
ratio
Middle
0.00 0.01 0.01 0.02 0.02 0.03 0.03 0.04 0.04
log
odds
ratio
High
Future steps in the research will be to proof this method in some other practical situation. Demographic data (with incomplete information about the education attainment occurring in the latest years of the involved time period – as in the Fig. 3b) were used for the very fi rst verifi cation of the model and fi rst results seem to be acceptable.
Next part of the research will be to fi nd more datasets with missing data, both MAR and MNAR and with diff erent structure of missing data from the time point of view (length of missing, time of missing) and to prepare more detailed analysis of complemented data fi les.
SUMMARY
Paper presents original methodology of imputation of missing values which results in the most likely structure (distribution) of the missing data conditional on the observed values. The aim of the paper is to introduce multinomial logistic regression as very eff ective tool for missing data imputation. Motives for using this technique could be described by the following three requirements:
1. to impute data set in form which can be re-used for variety of diff erent analysis and applications; this means single imputation is required,
2. to impute data in the most detailed level; optimally on individual observation level, 3. to impute data in a way that will respect “expected” ratios of categories in general.
To the authors’ knowledge the combination of the multinomial regression and mode searching algorithm was used for the fi rst time for the missing data imputation task. The outcome of the proposed algorithm follows expected structure of the variable containing the missing values.
The methodology is based on the assumption that the categorical variable containing the missing values has multinomial distribution. Values of the parameters of this distribution are than estimated using the multinomial logistic regression. The multinomial regression method is a generalization of the logistic regression to multiclass problems.
As a by-product the outcomes of the intermediate steps of the algorithm may be used for further analyses such as analyses of the dependencies (determinants) of the variable of our concern, or analysis of the missingnes mechanism.
Demographic data (with incomplete information about the education attainment occurring in the latest years of the involved time period) were used for the very fi rst verifi cation of the model and fi rst results seem to be acceptable.
Acknowledgement
The article was written with the support provided by the Grant Agency of the Czech Republic to the project No. P404/12/0883 “Generační úmrtnostní tabulky České republiky: data, biometrické funkce
a trendy“. Earlier version of this paper with some preliminary results was presented at the Mathematical
Methods in Economics 2013 conference held in Jihlava. Authors would like to thank anonymous referees for their extremely valuable and extended suggestions on how to improve the paper. All errors and omissions remain the authors’ own.
REFERENCES
FINUCAN, H. M. 1964. The Mode of a Multinomial Distribution. Biometrika, 51(3–4): 513–517.
HOSMER JR, D. W., LEMESHOW, S. 2004. Applied
logistic regression. New York: John Wiley & Sons.
JOHNSON, L. J., KOTZ, S. and BALAKRISHNAN, N. 1997. Discrete multivariate distributions. Vol. 165. New York: Wiley.
LE GALL, F. 2003. Determination of the modes of a Multinomial distribution. Statistics & Probability
Letters, 62(4): 325–333.
RUBIN, D. B., 2002. Inference and missing data.
Biometrika, 63(3): 581–592.
SCHAFER, J. L., GRAHAM, J. W. 2002. Missing data: our view of the state of the art. Psychological methods, 7(2): 147–177.
PANAGIOTIS, S., LEFTERIS, A., STAMELOS, I. 2004. Multiple logistic regression as imputation method applied on so ware eff ort prediction.
In: Proceedings of the 10th International Symposium
on So ware Metrics, 2004. Chicago: IEEE Computer
Society.
ZIMMERMANN, P., MAZOUCH, P., HULÍKOVÁ TESÁRKOVÁ, K. 2013. Categorical data imputation under MAR missing scheme. In:
Proceedings of the 31st International Conference
Mathematical Methods in Economics, 2013. Jihlava:
Contact information Pavel Zimmermann: [email protected]
Petr Mazouch: [email protected]
Klára Hulíková Tesárková: [email protected]
Ͳ0,8
Ͳ0,6
Ͳ0,4
Ͳ0,2 0 0,2 0,4 0,6 0,8 1
1 3 4
female
male
Ͳ0,8
Ͳ0,6
Ͳ0,4
Ͳ0,2 0 0,2 0,4 0,6 0,8 1
single widower divorced married 1 3
4
Appendix 1: The interaction of the male gender with the single status
Ͳ0,2
Ͳ0,1 0 0,1 0,2 0,3 0,4
1 3 4
female male
Ͳ0,2
Ͳ0,1 0 0,1 0,2 0,3 0,4
neoplasms circulatory other
1 3 4