Workpackage 11
Imputation and Non-Response
Deliverable 11.2
II
List of contributors:
Seppo Laaksonen, Statistics Finland;
Ueli Oetliker, Swiss Federal Statistical Office;
Susanne R¨assler, University of Erlangen-N¨urnberg;
Jean-Pierre Renfer, Swiss Federal Statistical Office;
Chris Skinner, University of Southampton.
Main responsibility:
Seppo Laaksonen, Statistics Finland;
Susanne R¨assler, University of Erlangen;
Chris Skinner, University of Southampton.
IST–2000–26057–DACSEIS
The DACSEIS research project is financially supported within the IST
programme of the European Commission. Research activities take place in
close collaboration with Eurostat.
http://europa.eu.int/comm/eurostat/research/
http://www.cordis.lu/ist/
Preface
The ultimate objective of this document is to provide pseudo code for the implementation of imputation methods in the DACSEIS simulation studies. This code will be given in Appendices B and C. Before defining the code, we shall outline the rationale for impu-tation in Chapter 1. This document distinguishes between two broad approaches: single
imputation and multiple imputation, according to whether one ore more imputed values
are constructed for each missing value. These two approaches are described in Chapter 2 and 3 respectively. We shall consider some relatively simple applications, focussing on univariate rather than multivariate missing data and not considering hierarchical data structures. The variable with missing values to be imputed is categorical in four applica-tions (the labour force surveys of Austria, Finland and the Netherlands, and the micro census of Germany) and continuous in the other two applications (Swiss Household bud-get survey and German Survey of income and expenditure) (see deliverable D1.1). Some more detailed consideration of imputation in the Swiss Household Budget Survey is given in Appendix A.
Seppo Laaksonen Statistics Finland
Susanne R¨assler University of Erlangen
Contents
List of figures VII
1 Introduction 1
1.1 Outline of Chapter . . . 1
1.2 Non-Response in Surveys . . . 1
1.3 The Treatment of Non-Response: Weighting and Imputation . . . 2
1.4 Non-response Mechanisms . . . 2
1.5 Non-response Mechanisms for Simulation Study . . . 4
1.6 Reasons for Imputation . . . 4
1.7 Single Imputation or Multiple Imputation . . . 5
2 Single Imputation 7 2.1 Introduction . . . 7
2.2 Outline of the Imputation Process . . . 7
2.3 Univariate or Multivariate Missingness Patterns . . . 9
2.4 Imputation Model . . . 10
2.5 Distance Metrics . . . 12
2.6 Imputation Methods . . . 12
3 Multiple Imputation 19 3.1 Introduction . . . 19
3.2 MI Methods for Multivariate Missing Data . . . 20
3.2.1 Iterative Univariate Methods . . . 20
3.2.2 Gibbs Sampling . . . 21
VI Contents
3.3 MI Methods for Univariate Missing Data . . . 22
3.3.1 Continuous Variables: HBS Type of Data . . . 22
3.3.2 Binary Variables: LFS Type of Data . . . 23
3.3.3 Semicontinuous Variables . . . 24
3.4 Typical Problems that May Occur with the Chained Equation MI . . . 25
3.4.1 Incorporating the Sampling Design . . . 25
3.4.2 Collinearity . . . 25
3.4.3 Rounding Off . . . 25
3.4.4 Monotone Missingness Versus Incompatibility . . . 25
3.5 Conclusions . . . 26
A Imputation of Missing Income Data in the Swiss Household Budget Survey 29 B Core SAS Codes for Single Imputation Methods 31 C S-Plus / R codes for the Proposed Imputation Methods 37 C.1 Single Imputation for Continuous Data . . . 37
C.1.1 Regression Imputation . . . 37
C.1.2 Ratio Imputation for Swiss Data . . . 38
C.2 Single Imputation for LFS Type of Data . . . 40
C.2.1 Specification (i) . . . 40
C.2.2 Specification (ii) . . . 41
C.2.3 Specification (iii) . . . 42
C.2.4 Specification (iv) . . . 43
C.3 Multiple Imputation Codes . . . 44
C.3.1 Pooling the Estimates . . . 44
C.3.2 MI Algorithm for Continuous Variables: HBS Type of Data . . . . 45
C.3.3 MI Algorithm for Binary Variables: LFS Type of Data . . . 47
List of Figures
2.1 Structure of imputations . . . 9
3.1 Encoding semicontinuous variables . . . 24 3.2 Monotone pattern of missingness . . . 26
Chapter 1
Introduction
1.1
Outline of Chapter
In this chapter we describe the general non-response setting in which imputation might be considered and outline why imputation is used. We also explain the distinction between single imputation and multiple imputation which underlies the structure of this report. We make some remarks on the mechanisms for generating non-response in the DACSEIS universe. We do not, however, specify these techniques. The reports of these specifications are available for each DACSEIS file, see the deliverables for WP1.
1.2
Non-Response in Surveys
Sample surveys are invariably subject to non-response.
Unit non-response arises when no survey data are collected for a unit, for example because
no contact is established with the unit or because of the unit’s refusal to participate in the survey.
Item non-response arises when some data are collected for a unit but values of some
items are missing, for example because the respondent refuses to provide the answer to a sensitive question or is unable to provide the answer to a question requiring complex information.
Some patterns of non-response might be viewed as either unit or item non-response, for example wave non-response or attrition in a longitudinal survey or non-response by one person in a household survey where other members of the household respond. SeeGroves
et al.(2002) for further discussion of the sources and reasons for non-response in surveys.
All these cases of non-response may be considered as examples ofmissing data in surveys. Data may also be missing for other reasons. For example, no data will be available for all units which are not present in the sampling frame (the problem of under-coverage). In some surveys or censuses, a short questionnaire is completed by one group of respondents while a longer questionnaire is completed by the rest, leading to deliberately missing data in the first group on the variables only featuring in the longer questionnaire.
2 Chapter 1. Introduction
1.3
The Treatment of Non-Response: Weighting and
Imputation
Ideally, non-response should be prevented from occurring, but in practice some amount of response inevitably arises. It is then necessary to consider how to treat the non-response at the estimation stage of the survey. Estimation typically involves the construc-tion of a point estimator and an associated variance estimator.
Different approaches to point estimation may be adopted to account for non-response. The simplest kinds of methods just ignore the response. In the case of unit non-response this will typically involve treating the set of responding units as if it were the selected sample. In the case of item non-response this may involve deleting units which have missing values on any of the variables used in a particular analysis (available cases
analysis) or may involve deleting units which have missing values on any of the survey
variables (complete cases analysis). Such approaches may be subject to bias and, in general, do not make most efficient use of the data. We shall not consider them explicitly, although they will typically feature as special cases of the approaches we do consider. The two principal methods used to correct for bias due to non-response and to make efficient use of data are weighting and imputation, besides other model-based techniques such as the EM-Algorithm, may also be used for some kinds of analyses.
Weighting is classically used to treat the problem of unit non-response, whereas imputation is classically used to treat problems of item non-response. Weighting is a ‘unit-level’ adjustment, providing a common form of adjustment for all analyses based on a common set of responding units and is thus natural for the treatment of unit non-response. It is less practical to use weighting to treat item non-response, since a different method of weighting would be required for estimates based upon different sets of variables. Weighting is particularly natural for complex survey data involving complex sampling designs and/or the presence of auxiliary population information. In such settings, weighted estimation is used even in the absence of non-response to adjust for unequal selection probabilities and to make efficient use of auxiliary population information. The additional use of weighting to treat non-response in such settings is sometimes called re-weighting, since it may involve the modification of an initial set of weights.
In contrast to weighting, imputation is a variable-specific adjustment and is thus natural to treat missing data in a given variable. Each missing value is replaced by an imputed (fabricated) value. Imputation tends to become more complicated and time consuming as the number of variables increases and so is more difficult to implement for the treatment of unit non-response in surveys with many variables.
1.4
Non-response Mechanisms
The process generating the non-response is called the non-response mechanism. This is a special case of a missing data mechanism, the process that leads to missing data. In the simulation studies, it will be necessary to specify non-response mechanisms and we discuss this in the next section. The nature of the non-response mechanism is important
1.4 Non-response Mechanisms 3
for understanding the impact of non-response on estimation and therefore for determining the way that non-response will be treated in estimation. It is important to emphasize that the non-response mechanism is generally unknown to the user of the data. Weighting or imputation methods will typically be based upon some assumptions or model, explicit or implicit, about the non-response mechanism, but these assumptions or model will never be more than an approximation to the truth.
For the purpose of constructing weighting and imputation procedures and for evaluating bias and variance, it is useful to classify missing data mechanisms in a number of different ways. The following ‘definitions’ are heuristic. See Little and Rubin (2002) for more precise definitions.
Missing Completely At Random (MCAR): The values of the set of variables used to
construct a point estimator are missing completely at random if missingess is independent of all these variables.
Under the MCAR condition, non-response generally does not introduce bias into the point estimator.
Missing at Random (MAR): The values of the set of variables used to construct a point
estimator are missing at random given an additional set of measured variables if missingess is independent of the values of the variables which are missing, conditional on the observed values of both sets of variables.
More simply, we may say that missingness is at random if it does not depend upon the underlying values which are missing, conditional on information used for estimation. A mechanism which is not MAR is calledNot Missing At Random (NMAR). In this case, missingness of a variable will generally depend in some way on the value of the variable which is missing, even conditional on observable information, for example, non-response may be more likely to occur on income if income is high.
Under the MAR condition, it is generally feasible, in principle, to construct an adjusted point estimator which is approximately unbiased in large samples, providing the relation between the variables is correctly modelled.
Ignorable missingness: The missing data mechanism is ignorable if the properties of a
given inference procedure (e.g. point estimation) are not affected by the nature of the mechanism.
In practice, the MAR condition often implies ignorable missingness for many kinds of inference procedures. The advantage of being able to assume ignorable missingness is that it is not necessary to specify a statistical model for the missing data mechanism when determining a weighting or imputation procedure.
Non-ignorable missingness: The missing data mechanism is non-ignorable if it is not
ignorable.
In practice, the NMAR condition often implies non-ignorable missingness. It is generally more difficult to construct weighting and imputation procedures to give approximately unbiased estimates when non-response is non-ignorable.
4 Chapter 1. Introduction
1.5
Non-response Mechanisms for Simulation Study
In this section, we comment on the non-response mechanisms which could be employed in the simulation studies.
Coverage: It would be realistic to include some undercoverage and overcoverage in the
simulation. Sources of undercoverage and overcoverage tend to be quite different. For example, in a typical household survey, overcoverage is common for emigrants, whereas undercoverage is common for infants and immigrants. The creation of overcoverage and undercoverage in the simulation could be straightforward if such characteristics are avail-able in the universe file. However, the treatment of coverage problems is not the primary purpose of imputation and it therefore seems sensible not to create such problems, at least in the first simulation studies.
Non-response: It is easiest to start with the simple assumption that non-response is
MCAR for each stratum but with a different level of non-response in each. This assump-tion may then be made progressively more complex by, for example:
– creating missing values with varying probabilities (response propensities) for each unit of the universe using a logistic regression model which may be estimated from a sample survey with similar structure to the universe. These are thus general response propensities, and are not dependent on the survey variables, corresponding to the
MAR assumption. It is possible to add some random noise to modelled values. – assigning an additional propensity coefficient to the item level (this is thus
con-ditional on the fact that the unit responds): we suppose that for most variables this coefficient = 1, but for some more problematic variables less. This may be done purely using random selection (MCAR, MAR) or modelled exploiting logistic techniques also in this case.
– There may be different levels of missingness propensities (high, low) for testing that effect.
1.6
Reasons for Imputation
Imputation is used for a number of reasons in official statistics, including the following.
Complete Datasets: After missing values are replaced by imputed values, the original
dataset with ’holes’ becomes a complete dataset. This has various advantages. All analy-ses of this dataset will be mutually consistent. For example, if two tables are produced, cross classifying variable A by variable B and variable C by variable B, then the B mar-gins of the two tables will be identical if both tables are based upon the same completed dataset but this might not be the case if the tables had been based upon data with differ-ent sets of missing values for the three variables. A completed dataset also avoids various problems of datasets with missing values, e.g. that different users might deal with these missing values in different ways (leading to inconsistent analyses) or may treat the missing data erroneously, e.g. treating a missing value code as real.
1.7 Single Imputation or Multiple Imputation 5
Edit and Imputation: A second reason for using imputation is to handle the result of
editing, where e.g. invalid responses are identified, and it is desirable to replace them by valid values.
Reduction of Item Nonresponse Bias: A third reason for imputation is to reduce bias
arising from item non-response. If the values of auxiliary variables x are available for cases with missing values of y and if x and y are correlated then it will often be possible to reduce the bias by imputing the missing values of y using the observed x values.
1.7
Single Imputation or Multiple Imputation
The traditional approach to imputation in official statistics is to produce just one imputed value for each missing item. This is calledsingle imputation. This can achieve the various objectives described in the previous section. Imputation can, however, create a problem for variance estimation.
The standard approach to point estimation under imputation is to treat the imputed values in the completed dataset as if they were actual values. Imputation methods are generally designed so that this approach will lead to a less biased estimator than would arise if cases with missing values were simply deleted. There is a problem, however, if the same principle is applied to variance estimation, i.e. if standard errors are estimated from standard software using the completed dataset with the imputed values treated as real. In this case the estimated variances will generally be too small, since the variance estimation method will fail to allow for differences between the imputed values and the real values.
A number of alternative approaches to variance estimation in the presence of imputa-tion are possible. Some approaches treat the single imputed values as given, on the assumption that imputation has been designed for the purposes above, such as minimis-ing non-response bias. These approaches then construct valid variance estimators for the resulting point estimators. See e.g. Rao and Shao (1992), Shao (2002) and Lee et al.
(2002). An alternative approach is to design the imputation method in such a way that a simple variance estimator can be constructed. One such approach is multiple imputation
(Rubin, 1987).
The basic idea ofmultiple imputation is to createmimputed values for each missing item. For any parameterθ, analysis of the mcompleted datasets (and treatment of the imputed values as genuine) will lead to m point estimates θ1, . . . ,b θbm of θ as well as m variance
estimatesbv1, . . . ,bvm. Rubin proposes to combine the point estimates by taking their mean
and proposes a variance estimator for this point estimator which is a simple function of thempoint estimates and variance estimates. It is clear that this approach will only work for certain kinds of imputation methods, in particular the imputation method must be stochastic since otherwise identical datasets and estimates will be produced. However, it is quite easy to modify even non-stochastic single imputation methods such that suitable multiple imputations can be created, e.g. by using the approximate Bayesian bootstrap according toRubinandSchenker(1986). The basic kind of imputation method required for multiple imputation is what Rubin calls aproper method, which basically means that inference from the multiply imputed data sets is randomisation-valid.
6 Chapter 1. Introduction
In this document, we shall present single imputation methods first in Chapter 2. These imputation methods cover ones traditionally used in official statistics, with the main objective being to reduce bias from non-response in the resulting point estimates. Multiple imputation methods are then considered in Chapter 3. These methods may be seen to be extensions of some of the basic regression imputation methods considered in Chapter 2. The relative advantages of single and multiple imputation are the subject of some debate. See, for example,Rubin (1996),Meng (1994) and Nielsen (2003).
Chapter 2
Single Imputation
2.1
Introduction
This chapter describes some single imputation methods, that is methods that produce a single imputed value for each missing item. We shall focus on the univariate case, where missing values for a single variable, y, are to be imputed. We begin by outlining the structure of the imputation process and then proceed to describe some imputation methods. This field is so wide that it is possible to outline only a number of standard techniques. We take the broad objective of the imputation process to be the reduction of non-response bias in the point estimator which results from treating the imputed values as real values. As noted in Chapter 1, we do not consider here the issue of variance estimation.
2.2
Outline of the Imputation Process
In this section we outline the series of steps involved in the process of imputation. We assume that editing has already taken place and that the missing values have been iden-tified.
Step (i) Selection of training data set and specification of auxiliary variables
Atraining data set needs to be selected upon which the imputation method can be based.
This will often be a subset of respondents for the same data set with missing values requiring imputation, but it may alternatively be an external analogous data set from the previous period, for example. This training data set should include not only values of y but also values of a vector of auxiliary variables, x. The x variables should be observed for cases where y is to be imputed (Laaksonen, 2002b). For most imputation methods, it is desirable that the auxiliary variables be chosen so that they predict y as well as possible and so that the MAR assumption is plausible, that is that missingness on y is approximately independent of y conditional on the values ofx.
8 Chapter 2. Single Imputation
Step (ii) Construction of imputation model
Most imputation methods are motivated explicitly or implicitly by a model, referred to here as the imputation model. Two alternative target variables may be used for building an imputation model, either y, the variable being imputed, or the missingness indicator of this variable, denoted R. The model for each particular case may be of any type, i.e. parametric or non-parametric. Moreover, the model may be estimated from the training data, or ’logically deduced’ The purpose of the modelling is to achieve high predictability. Just one model can be estimated for the data set that requires imputation, or alternatively, several models can be estimated, one for each sub-set. These sub-sets are often referred to asimputation cells orclasses. They should be as homogeneous as possible with respect to missingness, in other words, the missingness mechanism should be presumed to be ignorable within such a cell.
Step (iii) Choice of two features of imputation method
There are two particular features of the imputation method which need to be chosen, depending on the final imputation method being applied. First, there is a need to decide on the prediction role of imputation, whether it is appropriate to use a deterministic im-putation method which provides ’best predictions’ of the missing true values or whether it is appropriate to use a stochastic method, where the distribution of the possible im-puted values corresponds to the uncertainty about the missing values. Second, it is often desirable to decide on a metric for measuring nearness if the imputation method requires this kind of option. Typically, such nearness metrics are based on a Euclidean distance measure or other model-external solutions, often using auxiliary variables that are not used in building a model. Alternatively, the metrics can be taken from model outputs. An example of this approach is the so-called’regression based nearest neighbour’ (RBNN)
technique (Laaksonen, 2000, and Laaksonen, 2002b) in which nearness is measured using predicted values of the imputation model.
Step (iv) Choice of imputation method itself.
Finally the imputation method needs to be selected, based upon the outcomes of steps (i)-(iii). We distinguish two broad kinds of imputation method. If the imputed values are derived from a model, either as predicted values or based upon an estimated distribution, we refer to the method asmodel-based. Alternatively, if the imputed value consists of the actual value of y for a responding unit, possibly selected using an imputation model and a distance metric, we refer to the method as a’donor’ method (NB. some authors use the
term ’hot deck’ to denote this general case; we use the term ’hot deck’ as a special case
of a donor method). Note that this technique may be used for finding a good observed residual (noise term), too (see Laaksonen, 2002a).
It should be noted that it is desirable in the imputation process that the imputed values
are flagged. When flagged so that several alternative imputed values can be available
for one missing value even in the case of a single imputation, the user has the flexibility to apply in an analysis such imputed values as are considered the best for the purpose concerned.
The general structure of imputations is given in Figure 2.1, broken down into stages 0, A, B and C. We shall refer to this figure when giving rules for implementing a computer program for imputations in the case of the DACSEIS simulation task.
2.3 Univariate or Multivariate Missingness Patterns 9
0. Input File
0a Manual Imputation cells = included via one or more variables in input file
0b Automatically done Imputation cells using e.g. Classification Tree, Regression Tree and SOM
A. Imputation Model
- Standard linear (general) regression - Logistic regression
- More complex models
B. Metrics and other specifications
C. Imputation Task
- Nearest neighbor - K-nearest neighbor - Predicted directly
- Predicted with random noise (e.g. normal distribution, or observed) - Probability (random)
- Random draw without replacement - Random draw with replacement
N e w R o und
Figure 2.1: Structure of imputations
Theinput file in stage 0 in the figure is a standard file with missing item values which are to be imputed. There should be a standardised symbol for a missing value such as a point ’.’. Alternatively, a user should be able to determine in the beginning of the process what values should be considered as missing values (e.g ’-9’), and be imputed, consequently.
The specification of imputation cells in stages 0a or 0b of the figure and the specification of imputation model in stage A correspond to step (ii) of the imputation process above. The specification of imputation metrics in stage B corresponds to step (iii). Finally, the imputation tasks in C correspond to step (iv).
2.3
Univariate or Multivariate Missingness Patterns
The input file will in general include some variables with missing values and some variables without missing values. With k variables with missing values, there are 2k−1 possible
10 Chapter 2. Single Imputation
1. create a binary variable for each of the k variables so that 1 = non-missing, and 0 = missing;
2. construct a multidimensional frequency table of these binary variables from the input file;
3. the output file of the previous table is the missingness pattern; 4. each pattern may then be given a code.
As noted earlier, we shall focus on the simplest case when there is just one variable, y, that may be missing (k = 1) so that there is just one missingness pattern, i.e. item non-repondents for which y is missing. With multivariate missingness (k > 1), there will in general be more than one missingness pattern. Imputation then becomes more complex. One option is to impute separately for each missingness pattern drawing on auxiliary information from complete records. Another option is to chain the imputation tasks, imputing for one variable at a time.
DACSEIS specification: Provide the missingness patterns and code these. It seems that
in most cases there will be just one pattern but there may be two for the Swiss data.
2.4
Imputation Model
As noted in Section 2.2, the target variable in the imputation model may either bey, the variable with missing values (assumed again to be univariate), or be R, the missing value indicator for the y variable, i.e. R= 1 if y is missing and R= 0 if it is observed.
Letx1, x2, . . .denote additional continuous variables in the input file which are completely observed for all sample units and z1, z2, . . . additional categorical variables which are also complete. Standard imputation models, as in cases 1 and 2 below, consist of a regression model, representing the conditional distribution of y given x1, x2, . . . , z1, z2, . . ..
Many imputation methods depend upon a set ofimputation cells, formed using the values of x1, x2, . . . , z1, z2, . . . . The simplest way to form imputation cells is by cross-classifying some of thez1, z2, . . . variables. The imputation model may be specified separately within imputation cells or may incorporate imputation cells via covariates.
Case 1: Linear regression model for continuous y variable
In this case, the variabley is continuous or handled as a continuous variable. Thex and
z variables may have been imputed at an earlier step. A standard model for y given x1, x2, . . . is the linear regression model
y=β1x1+β2x2 +. . .+βpxp +ǫ
More generally the variables y and x1, x2 may be transformed first, for example the loga-rithm transformation might be applied to variables such as income, wages and turnover. Dummy explanatory variables may also be introduced to represent the effect of the z1, z2, . . . variables. The estimation of such a linear model may follow standard techniques
2.4 Imputation Model 11
available in standard software. See the SAS specifications in Appendix B. Estimation may take place using sampling weights. Thus, this should be included as an option in the computer codes.
Example of DACSEIS specifications:
The Swiss expenditure data (analogous to German data):
1. For expenditure:
• two options fory: (a) y = expenditure or (b)y = log(expenditure+11)
• two options for x,z: (a) one key variable chosen by the Swiss group, (b) all available auxiliary variables.
2. For income:
• two options fory: (a) y = income or (b) y = log(income+11)
• two options for x,z: (a) one key variable chosen by the Swiss group (e.g. completed income), (b) all available auxiliary variables plus completed income.
Case 2. Logistic regression model for binary y
In this case,yis binary and the model is the usual logistic regression (or probit regression) model with x1, x2, . . . , z1, z2, . . . defining the covariates.
Example of DACSEIS specifications:
LFS and similar data (German, Austria, Netherlands, Finland):
• one option for binaryy so that y= 1 if unemployed andy = 0 otherwise
• two options forx,z: (a) one key variable chosen by each national group (e.g. region), (b) all available auxiliary variables without interactions.
Case 3. Logistic Regression Model for Response Indicator R Here the variableR is the missing value indicator for they variable.
Example of DACSEIS specifications:
• one option forR (= 0 if non-missing, = 1 if missing)
• two options forx,z : (a) one key variable chosen by each national group (e.g. region), (b) all available auxiliary variables without interactions.
12 Chapter 2. Single Imputation
2.5
Distance Metrics
Many imputation methods require the specification of a distance function, measuring how close two units are with respect to the auxiliary variables x1, x2, . . . , z1, z2, . . .. Distance metrics include:
1. Distance metrics not based on models:
• Euclidean distance based on a continuous variable;
• Euclidean distance based on several variables, each subject to user-specified scaling;
• a metric for a categorical variable, e.g. geographical area, defined as 0 if two units share the same value of the variable and 1 if not.
• a metric for categorical variables, defined as the number of variables taking different values for the two units.
2. Distance metrics based on models:
• Euclidean distance between the predicted values of they variable based upon a regression model fory given x1, x2, . . . , z1, z2, . . ..
DACSEIS specifications:
Two alternatives which are easy to implement, both using LFS types of data (not for expenditures or incomes):
• based on the predicted values of the model
• all units within a certain area are as close to each other.
2.6
Imputation Methods
An imputation method replaces a missing or deficient value with a fabricated one, the
imputed value. We consider two broad types:
A. Donor methods identify, for each missing value, a donor which is another unit in the same database for which the value of this variable is present, and use this value as the imputed value. The unit for which the value is imputed is called the recipient. B. Model-based methodsfit the imputation model in some way using the training data and determine the imputed value using this fitted model. Note that some donor methods also involve fitting a model.
2.6 Imputation Methods 13
A. Donor Methods
A1. Random draw methods(often called random hot deck methods) Two options for drawing:
A1a: Random draw with replacement: A donor is randomly chosen from a given set with non-missing values. Thus the same donor may be chosen many times.
A1b: Random draw without replacement: as above but when a real donor is chosen, it cannot be chosen again. Note: if the missingness rate is higher than 50%, all missing values cannot be imputed.
Two options for the given set:
• 1u: Overall imputation, thus donors are drawn from all cases with non-missing values.
• 1s: Cell-based imputation, donors are drawn from the same imputation cell as the recipient. Selection from each cell is independent of selection from other cells. This may be interpreted as using a distance metric defined by the imputation cell.
Thus, we have the four possible methods:
1a + 1u = 1au 1a + 1s = 1as
1b + 1u = 1bu 1b + 1s = 1bs
Method 1au could be recommended for use for benchmarking purposes. Thus, the results from different methods could be compared with these results. Note that the construction of cells in 1smay involve model-fitting. For example, consider a binary variable y, which may be predicted directly from a logistic regression model or the model may be used to create imputation cells. In this case, the model is estimated using the respondent data and predicted values of the probability that y = 1 are determined for both nonrespondents and respondents. There are several ways that these scores could be grouped to create imputation cells. The two main approaches are:
1. division into equal intervals, e.g. (0%, 10%), [10%, 20%), [20%, 30%), . . . , [90%, 100%)
2. division into intervals with equal frequencies, respectively (like deciles).
If the data set is small, very many intervals cannot be used. Problems may arise if some intervals or imputation cells include too many missing values for y. Both approaches are relatively easy to program. See the SAS specifications in Appendix B.
14 Chapter 2. Single Imputation
A2. Nearest Neighbor methods
Possible options for choosing a donor for a given distance metric are:
A2a: The donor is selected to be the nearest to the recipient with respect to the distance metric. This requires calculating all the distances for all potential donors for each recipient. If the metric is based on a single variable, such as the predicted value, then the values of this variable should be sorted first, and then the nearest donors from above and then from below the value of the recipient unit are compared (the metric distances calculated), and the nearest is chosen.
For SAS, this can be done e.g. so that values for a reasonable number of lags have been constructed for both directions (below and above). This solution works both for continuous and categorical variables, thus for linear models and logistic models. See the SAS specification in Appendix B.
If there are several units with the same distance, two main options are available: (i) a random selection of all these, or (ii) the average value of all (this is possible only for continuous variables).
A2b: To choose randomly one donor of the m nearest ones, where the user specifies the value of m. This reduces to option A2a for the case m=1.
B. Model-based Methods
B1. Simple Methods based upon Imputation Cells
a. Mean imputation: the mean of observed values of y within each cell is calculated and this value is assigned to all missing units within that cell. It is possible to use robust methods to estimate the mean, e.g. by trimming outliers first.
b. Median imputation: as in a., but using the median rather than the mean.
c. Mode imputation: for categorical variables, the most common value may be used as an imputed value.
d. Average of the specified percentile as in a., but calculated for cases falling be-tween two percentiles, e.g. p25-p75. Note that all of these may be weighted or unweighted (i.e. weights = 1).
B2. Probability-based methods for Categorical Variables
The following method is for categorical or categorised variables, and may be done as earlier within cells.
This method calculates the proportions of observed values falling into each category. These are sorted cumulatively within the interval [0, 1] so that each category has its own interval indicating its probability. For example, categoriesa,b andc with relative frequencies 0.2, 0.7 and 0.1 may be assigned to intervals (0, 0.2], (0.2, 0.9] and (0.9, 1.0) respectively. Each imputed value is determined by taking a random number with a uniform distribution from
2.6 Imputation Methods 15
the same interval (0, 1). If the random number lies within the category (0, 0.2], then the imputed value = a, and respectively for the others.
B3. Linear Regression Imputation
The method of regression imputation generates imputed values by fitting a linear regres-sion model
y=β1x1+β2x2 +. . .+βpxp +ǫ
to the responding units and then setting imputing values asyb=β1x1b +β2x2b +. . .+βbpxp,
where (β1,b β2, . . . ,b βbp) is the vector of least squares estimates of the regression coefficients.
A special case is ratio imputation in which there is a single variable xand no intercept is included in the regression model so that yb=βxb and where βbis typically determined by the ratio of the mean of y by the mean of x amongst responding cases, perhaps within imputation cells.
B4. Linear Regression Imputation with added noise
This method is the same as B3 except that a noise term is added, so the imputed value is given byyb=βb1x1+βb2x2+. . .+βbpxp+bǫ, wherebǫ may be determined in different ways.
It is however good to include an option in order to avoid very high and, respectively, very low noise terms because these are not often realistic in practice (see the discussion on bounds before). It is not clear how to do that; one standard way is to use a truncated distribution (e.g. normal) so that a user could choose that truncation level, e.g. using the number of standard errors (e.g. one standard error, or 95% confidence interval).
It is also possible to determine bǫ from the data, i.e., from observed residuals for non-missing units. It is simplest to randomly choose each noise term from these residuals (similarly to methods A1). This methodology is not a full model-based method, since the last operation has been done using donor methodology. This thus shows that an imputation method may be also a mixture of donor-based and model-based methods. There may be many other situations where this may be competitive, too. We give some examples next.
B5. Regression Imputation for Categorical Variables
Linear regression is inappropriate if y is categorical. In this case, a categorical outcome regression model may be fitted to the respondent data, for example logistic regression when y is binary. The model will imply a certain set of probabilities, Pk, that y will
fall into the different categories k and Pbk, the estimated values of the Pk, may then be
determined for each case with a missing value of y and these values used to determine imputed values, as in method B2 above.
16 Chapter 2. Single Imputation
B6. Two-step methods for Mixed Binary-Continuous Variables
An imputation method may involve several consecutive steps. A common case is the following two-step one which may be implemented fairly easily. The variable y is mixed binary-continuous, for example earnings where a proportion of the population have no earnings and the remainder of the population have positive earnings. The binary variable is denoted z, i.e. has a job or not in our example, so that y = 0 if z = 0 and y > 0 if z = 1. Now we go to the two steps:
Step 1 Construct a logistic regression model for z with the same explanatory variables as before. Then for each unit with y missing, use the B5 method to imputez. Step 2 Impute y using one of the methods B3 or B4 for cases withz = 1. Impute y= 0
if z = 0.
DACSEIS specifications:
Swiss data (and possibly the German EVS):
We recommend linear regression imputation (B3) for this case. Since there are two model specifications and two different variable selections, we will have the following imputation specifications if possible; however, for practical reasons we cannot perform the first three specifications for the Swiss data:
(i). linear regression imputation fory= expenditure with one auxiliary variable chosen by a country group,
(ii). linear regression imputation for y = expenditure with all available auxiliary vari-ables,
(iii). linear regression imputation for y= log-transformed expenditure with all available auxiliary variables. Note that after imputation, the exponent transformation must be performed,
(iv). linear regression imputation for y = income with all available auxiliary variables including (imputed) expenditure (log-transformed) based on specification (ii) (other specifications may be used). No intercept term. The model should be constructed both directly and including the robustness specifications as proposed by the Swiss group (see Appendix A).
(v). If a good imputation cell structure is available, we may also test ratio imputation within cells forincome using variable expenditure as auxiliary variable. This speci-fication could be as follows (within each cell):
• calculate ratio: sum(income)/sum(expenditure) =qc using respondents’ data
• for each missing income calculate income(imputed) =qc* expenditure(known)
• Later, if time permits, we may specify a respective method with noise term (bounds are needed if theoretical residuals are used; for empirical residuals we propose to exploit nearest neighbour methods).
2.6 Imputation Methods 17
LFS type of data
We propose the following 4 methods:
(i). Use the variableregion (or another subgroup with a reasonable number of categories) as the imputation cell, and draw a random donor without replacement method (A1b).
(ii). Build a logistic imputation model for the target variable so that this is y = 1 if unemployed, and y = 0 otherwise (see model section above) and transfer all available auxiliary variables to this model as explanatory variables. And choose a donor using predicted values of this model based on a nearest neighbour technique, as in method A2 above (incl. SAS codes).
(iii). Build a logistic imputation model for the missingness indicator,R, for this binary variableyand choose a donor using predicted values of this model based on a nearest neighbour technique, as in method A2 above (incl. SAS codes).
(iv). Build a logistic imputation model for R, as in (iii)., and choose a donor at ran-dom without replacement within 10 imputation cells so that these will have been constructed as explained above in Case 2 of the imputation model section.
Note that the option of using weights (sampling weights) should be allowed in the mod-elling task.
Chapter 3
Multiple Imputation
3.1
Introduction
The theory and principles of multiple imputation (MI) are extensively described byRubin
(1987), although this book is very difficult to read. An excellent and comprehensive treatment of data augmentation and multiple imputation is provided bySchafer(1997). Introductions to MI are given by Schafer (1999a), Little and Rubin(2002), and, for the DACSEIS project, by R¨assler (2004).
The basic approach is to undertake imputation m times to create m imputed datasets. Standard complete-case analysis performed for each of the m imputed datasets leads to m estimates θ1, . . . ,b θbm of any given parameter θ. The resulting point estimate of θ is
taken to be the mean of these estimates. Suitable R-Code for pooling estimates is given in Appendix C. For the purpose of variance estimation it is necessary that the imputation method fulfils certain conditions, referred to as proper imputation byRubin (1987), and that the complete-data estimates are asymptotically normal (like maximum likelihood estimates are) or t distributed. The basic idea of a proper imputation method is that it should reflect the uncertainty about the missing value correctly. From a frequentist perspective the requirement is that randomization valid inference will be drawn from the multiply imputed data sets. For a discussion of the proper property seeBrand(1999) or
R¨assler (2004). Provided imputation fulfils these criteria, a simple variance estimator can be constructed following the ”MI paradigm” (Rubin, 1987). A common choice of m is to take m = 5. Empirical evidence from the DACSEIS simulations suggests that is is better for variance estimation of Horvitz-Thompson type estimates to usem= 15 or even better m = 30. In principle, a higher number of imputations leads to better results of the MI estimate concerning efficiency and coverage, but usually smaller numbers ofm are sufficient, see Rubin(1987), pp. 114-115.
This chapter describes some multiple imputation procedures proposed for DACSEIS which, in principle, are proper MI routines. The problem of multivariate missing data is consid-ered first, including the use of iterative methods, including regression switching, which permit multivariate methods to be constructed from univariate methods. Some of these univariate methods are then described with particular reference to the DACSEIS applica-tions. These methods may be viewed as extensions of the regression imputation methods
20 Chapter 3. Multiple Imputation
considered in the previous chapter, which allow for additional uncertainty arising from the estimation of model parameters.
3.2
MI Methods for Multivariate Missing Data
3.2.1
Iterative Univariate Methods
One approach to handling general multivariate missing data patterns is based on the as-sumption that the variables follow a multivariate normal model, for example Schafer
(1999b) implements procedures under this assumption in the software NORM. This ap-proach has become quite popular for multiple imputation in multivariate settings. How-ever, assuming a multivariate normal distribution for categorical variables with missing values is often not regarded as a good choice. Recently, Rubin (2003) suggests iterative univariate multiple imputation procedures for large-scale data sets. They are successfully used for multiple imputation in the U.S. National Medical Expenditure Survey (NMES) where the data set to be imputed consists of up to 240 variables of different scales and 22,000 observations. Such routines have been used quite efficiently in the context of “mass imputation”, i.e., imputing a high amount of data that are typically missing by design. This is the situation in the so-called data fusion case and the split questionnaire survey designs, see, e.g., R¨assler (2002). For the pseudouniverses and the simulation study to be performed within the DACSEIS project, we therefore suggest such multiple imputa-tion routines as state-of-the-art. The advantages and disadvantages of this approach are described herein, also the necessary pseudocode is provided in S-PLUS/R.
It is said that iterative univariate imputations were first implemented by Kennickell
(1991) and Kennickell (1994); see Schafer and Olsen (1999).1 The intuitively ap-pealing idea behind the iterative univariate imputation procedure is to overcome the problem of suitably proposing and fitting a multivariate model for mixtures of categor-ical and continuous data by reducing the multivariate imputation task to conventional regression models iteratively completed. In many surveys it may be difficult to propose a sensible joint distribution for all variables of interest. On the other hand there is a variety of procedures available for regression modeling of continuous and categorical univariate response variables such as ordered or unordered logit/probit models (see Greene, 2000). Thus any plausible regression model Y|X = x,Θ = θ may be specified for predicting each univariate variable Ymis that has to be imputed given all the other variables. This
approach is also known as regression switching, chained equations, or variable-by-variable Gibbs sampling; see Van Buuren and Oudshoorn (1999). In the variable-by-variable Gibbs sampling approach it is also possible to include only relevant predictor variables, thus reducing the number of parameters.
1Ready to use and available for free via the Internet is software called MICE which is a recent
imple-mentation of some iterative univariate imputation methods in S-PLUS as well as R; seeVan Buurenand Oudshoorn (2000). Moreover, there is the free SAS-callable application IVEware, which also provides iterative univariate imputation methods.
3.2 MI Methods for Multivariate Missing Data 21
3.2.2
Gibbs Sampling
Gibbs sampling is a Monte Carlo technique to simulate draws from a multivariate density function by repeatedly drawing from its conditional density functions, which is especially interesting when the joint distribution is not easily simulated but the conditional distri-butions are. Thus, even in a high dimensional problem, all of the simulations may be univariate, which usually is an advantage. For an introduction to understand the way the Gibbs sampler works the interested reader is referred to Casella and George (1992). We will shortly describe the main principle here.
Suppose that a p-dimensional random variable U (here U may denote the data as well as the parameters) is partitioned into non overlapping subvectors (U1, U2, . . . , Uk), (k ≤ p)
containing all components of U. Let fU denote the joint distribution of U and also the
distribution of interest. Starting with an initial value U(0) of U, the Gibbs sampling algorithm generates a sequence of values U(0), U(1), U(2), . . . where in iteration t, t ≥ 1, U(t) is generated from U(t−1) by iteratively drawing from the conditional distribution of each subvector given all the others. The value of U(t) = (U(t)
1 , U
(t)
2 , . . . , U (t)
k ) is obtained
by successively drawing from the distributions
U1(t)|u2(t−1), u(3t−1), . . . , uk(t−1) ∼ fU1|U2,U3,...,Uk(u1|u (t−1) 2 , u (t−1) 3 , . . . , u (t−1) k ) U2(t)|u(1t), u(3t−1), . . . , uk(t−1) ∼ fU2|U1,U3,...,Uk(u2|u (t) 1 , u (t−1) 3 , . . . , u (t−1) k ) . . . Uk(t)|u1(t), u(2t), . . . , u(kt−)1 ∼ fUk|U1,U2,...,Uk−1(uk|u (t) 1 , u (t) 2 , . . . , u (t) k−1) (3.1) For eachUj the value ofU
(t)
j is generated conditionally on the most recently drawn values
of all other variables. According to Markov chain theory2 the distribution ofU converges to the desired distribution fU under mild regularity conditions, i.e. the sequence {U(t) :
t= 0,1,2, . . .} has a stationary distribution equal to fU.
3.2.3
Regression-switching
To illustrate the principle of the regression-switching let us assume the simple case with 3 variablesA, B and C each with missing data. ThenRubin (2003) proposes:
• Begin by arbitrarily filling in all missing B and C values!
• Then, fit a model ofA|B, C using those units where A is observed, and impute the missingA values!
• Next, toss the imputedB values, and fit a model of B|A, C using those units where B is observed, and impute the missing B values!
• Next, toss the imputed C values, and fit a model of C|A, B using units where C is observed, and impute the missingC values!
• Iterate!
2For an excellent and very extensive description of Monte Carlo methods in general and Markov chain
22 Chapter 3. Multiple Imputation
This procedure allows great flexibility due to the possible conditional specifications. Each specification simply is a univariate regression. It has to be mentioned that there are some theoretical shortcomings, because it is possible to generate incompatible distributions via implicit contradictions in the specified conditional specifications. The practical impli-cations of this phenomenon in iterative univariate imputation are still quite unknown, see Schafer and Olsen (1999). A “real” Gibbs sampler starts with an existing but intractable joint distribution for the variables of interest, iteratively generating random variables from easier to operate full conditional distributions derived from its joint distri-bution. In the context of iterative univariate imputations the conditional distributions are specified in the hope that these conditional distributions will define a suitable joint model. However, even if there is no such joint distribution for the data, the Markov chain Monte Carlo Method (MCMC) can be implemented, and each conditional specification may be a good empirical fit to the data, seeRubin(2003),Van Buurenand Oudshoorn(2000), and Brand (1999).
3.3
MI Methods for Univariate Missing Data
In this section we consider MI methods for univariate missing data. These might be combined for multivariate missing data following the methods in the previous section.
3.3.1
Continuous Variables: HBS Type of Data
To impute missing data for a continuous variable y, such as income or expenditure in the Household Budget Surveys (HBS), we propose to extend the linear regression imputation approach of Section 2.6. In contrast to Chapter 2, we shall adopt a Bayesian frame-work, using prior distributions, which will be the usual uninformative or flat priors. The continuous variable (income or expenditure data, whichever has data missing) may be transformed by its logarithm before performing imputation. Notice that after imputation the values have to be transformed back. To assure that only values are imputed that lie within a certain range, also upper and lower bounds can be given. After performing the final imputation step for the missing yvalues, each row of the imputed data set is exam-ined to see whether any of the imputed values is out of range. In such cases these values are redrawn until the constraints are satisfied. According toSchafer(1997), p. 204, this procedure leads to approximate proper multiple imputations under a truncated normal model.
The basic algorithm is as follows.
• as in Section 2.4, assume the underlying linear regression model y=β1x1+β2x2+. . .+βpxp+ǫ=Xβ+ǫ, ǫ∼N(0, σ2).
• Assume that y has nmis missing data, variables X are fully observed or already
imputed. yobs and Xobs refer to the jointly observed part,Xmis to the missing part
3.3 MI Methods for Univariate Missing Data 23
• Letβbandbσ2 = (y
obs−Xobsβ)b ′(yobs−Xobsβ)/(nb obs−p) be the least squared estimates
from the observed data.
• Multiple imputation procedure for j = 1,2, . . . m: 1. Draw (σ2|X)∼(y
obs−Xobsβ)b ′(yobs −Xobsβ)χb −nobs2 −p
2. Draw a vector of pvariables from (β|σ2, X)∼N(β, σb 2(X′
obsXobs)−1)
3. Draw (Ymis|β, σ2, X) ∼ N(Xmisβ, σ2) independently for every missing value
i= 1,2, . . . , nmis.
For the independent variablesX all available auxiliary variables from the universes may be taken. If both, income and expenditure have missing values, then the regression switching can be applied.
The resulting approach is very similar to the linear regression imputation plus added noise in the previous chapter, except that the additional uncertainty about the parameters β and σ2 is also allowed for.
3.3.2
Binary Variables: LFS Type of Data
For the target binary variable with missing values we base the imputations on a logistic regression model. For the Labour Force Surveys (LFS) and its data, the employment variable has to be either recoded to zero and one; e.g., to 1 if y = unemployed and to 0 otherwise, or split into dummy variables; e.g., to employment (yes/no) and unemployment (yes/no) leaving the third category for non labor force or simply the rest. Also region should be recoded to dummy variables. Then the basic algorithm we propose for DACSEIS is as follows:
• Assume the underlying data model of a logistic regression
ln θ 1−θ =β0+β1x1+. . .+βpxp =Xβ, θ =P(Y = 1|X).
• Assume that y has nmis missing data, variables X are fully observed or already
imputed. yobs and Xobs refer to the jointly observed part,Xmis to the missing part
ymis.
• Let βb be the iterative least squares estimates from the observed data (or any ap-proximate ML estimate) and Vb(β) its estimated covariance matrix (e.g., from theb inverse Fisher information matrixI(β)b −1).
• Apply the large sample normal approximation forj = 1,2, . . . , m: 1. Draw a vector of pvariables from (β|X)∼N(β,b Vb(β))b 2. For every i∈mis calculate θi = 1/(1 + exp(−Xi′β).
3. Drawnmis independent uniform (0, 1) random numbersui fori= 1,2, . . . , nmis
24 Chapter 3. Multiple Imputation
For the independent variables X again all available information may be taken.
If a variable has more than one category it is typically recoded into dummy variables. Then, if more than one (dummy) variable (after recoding) has missing data, then we may impute the most populous category first versus the rest. If zero is imputed, then we impute the next category versus the rest, and so on. If one is imputed all remaining categories are set to zero.
This approach is again similar to the regression imputation methods for binary variables described in the previous chapter, except that allowance is made for uncertainty about β.
3.3.3
Semicontinuous Variables
Semicontinuous variables were called mixed binary-continuous variables in the previous chapter. They take the value zero with positive probability and otherwise take a positive continuously distributed value. For example, such variables may occur in an LFS when respondents are asked the number of months they have been unemployed. These variables will have a large amount of zeros (the employed) and a continuous part (unemployment time).
For imputation we first impute the ’0’ vs. the ’+’ using the logistic regression, then, if a ’+’ is imputed the linear regression is used for imputing these missing values. This follows an approach of Schafer (1997), p. 381, published in detail by Schafer and
Olsen (1999), one may encode each semicontinuous variable U to a binary indicator W (with W = 1 if U 6= 0 and W = 0 if U = 0) and a continuous variableV which is treated as missing whenever U = 0; for an illustration, see Figure 3.1.
Unit no. 1 2 3 4 5 . . . n-1 n U 12 NA 0 0 NA . . . 3 0 Unit no. 1 2 3 4 5 . . . n-1 n W 1 NA 0 0 NA 1 0 V 12 NA NA NA NA . . . 3 NA
Figure 3.1: Encoding semicontinuous variables
Notice that a relationship between W andV would have little meaning and could not be estimated by the observed data. However, we aim at generating plausible imputations for the original semicontinuous variable U and, thus, are only interested in the marginal distribution forW and the conditional distribution forV givenW = 1. MCMC procedures have been shown to behave well in this context with respect to the parameters of interest, seeSchafer and Olsen (1999).
3.4 Typical Problems that May Occur with the Chained Equation MI 25
3.4
Typical Problems that May Occur with the
Chained Equation MI
3.4.1
Incorporating the Sampling Design
In the multiple imputation model, stratification can be incorporated by including strata indicators as covariates. Clustering may be incorporated by multilevel models that include random cluster effects, see Little and Rubin (2002), p. 90, or Schafer and Yucel
(2002). On the other hand, these effects can be controlled by a design-based complete-data inference.
3.4.2
Collinearity
It should be noted that the problem of collinearity may occur when the coefficients of the linear regression model are to be estimated from the observed data. If some covariates of X show only very little variability it may happen that the remaining part in Xobs, which
belongs to the observed part ofY, has one or more variables with constant values. In these cases the algorithm given herein should be extended and a plausibility check performed that excludes such variables from the regression model.
3.4.3
Rounding Off
For discrete valued data, e.g. annual income in multiples of 1,000 Euros, we propose to proceed with the linear regression model and then
1. either round off the imputed variable to equal one of the observable values in the data set or
2. use a “predictive mean matching” approach as discussed by Rubin(1986) and Lit-tle (1988).
The latter has the great advantage that only values which are really observed can be imputed and the imputation is even more robust against misspecification of the linear model. It has been used quite successfully for single imputation, see R¨assler et al.
(2002).
3.4.4
Monotone Missingness Versus Incompatibility
The regression-switching approach has the theoretical limitation of possibly generating incompatible distributions via implicit contradictions on their conditional specification. This may be of importance if the rate of missingness is high and there are a lot of different conditional specifications. If the missingness has a monotone pattern, for illustration see Figure 3.2, then we may impute the variable from left to right always regressing the
26 Chapter 3. Multiple Imputation
Variable 1 Variable 2 Variable 3 Variable 4 Variable 5
obs erve
d
miss ing
Figure 3.2: Monotone pattern of missingness
variable to be imputed on all other variables on the left. This procedure is continued until all missing values have been imputed.
Monotone patterns of missingness have the advantage that they can be handled noniter-atively because each imputation model being fit is conditioning only on the variables to the left. The resulting univariate models are automatically distributionally compatible. When the missing data are not monontone but “approximately” monotone, Rubin(2003) proposes to first impute those values which disturb the monotone missingness pattern re-gressing on all other variables (thus, possibly specifying an incompatible Gibbs sampler) and then impute the monotone missing values from left to right.
3.5
Conclusions
In general, when a model is used as a device for imputation, the meaning or interpretation of its parameters is not essential. The utility of the model lies in its ability to predict and simulate missing observations. A sensible imputation method for complex surveys should preserve basic relationships among variables. We believe that our proposed models are capable of preserving these effects. With multivariate longitudinal or clustered data also correlations among observations from the same subject or cluster should be preserved, too. Then more complex imputation models may be used, see Schafer and Yucel
(2002). In practice, inference by multiple imputation is quite robust to departures from the imputation model because that model applies not to the entire data set but only to its missing parts. For example, simulations by Schafer (1997) have shown that rounding off imputed values to the nearest category may only lead to minor biases. This also holds for departures from the missing at random (MAR) assumption. Empirical evidence suggests that multiple imputation under MAR often is quite robust against violations of this assumption. Even an erroneous assumption of MAR may have only minor impact on estimates and standard errors computed using multiple imputation strategies. Only when missing not at random (NMAR) is a serious concern and the fraction of missing information is substantial, does it seem necessary to model jointly the data and the missingness. Moreover, because the missing values cannot be observed, there is no direct evidence in the data to address the NMAR-assumption. Therefore, it can be more helpful to consider several alternative models and to explore the sensitivity of resulting inferences.
3.5 Conclusions 27
Hence, we conclude that the regression-switching approach seems to be quite promising in large data sets and also for high amounts of missing values. Even in the context of “mass imputation”, i.e., split questionnaire survey designs and data fusion we find good frequentist properties. In the U.S. the regression-switching multiple imputation approach is basically applied in the NHANES (a split project) and NMES. The basic routines are already implemented in MICE (SPLUS and R version) and IVEware, Raghunathan’s SAS callable application. We have provided some pseudocode especially for the DACSEIS universes.
Let us finish with a quote from the recent Little and Rubin (2002) book, p. 90: “The right way to assess the relative merits of the methods, from a frequentist perspective, is through comparisons of their repeated-sampling operating characteristics in real settings, not their theoretical etiologies.”
Appendix A
Imputation of Missing Income Data
in the Swiss Household Budget
Survey
The Swiss household budget survey 1998 (HBS98), was conducted in 12 monthly waves of a stratified simple random sample of private households in Switzerland with 9,295 fully participating households. During a month, each participating household reported all its expenditures and the income of its members. Missing data were detected based on supplementary information like the economic activity status given by the participating persons themselves (implying a certain type of income) or simply based on a ’don’t want to give’ -declaration of the persons. This led to a higher percentage of detected item non-response in the INCOME variable than the EXPENDITURE variable. In general, item non-response is not extremely frequent in the HBS98 data since households that did not deliver a complete report were considered as non-participating and thus were joined to the unit non-response.
It was found that the INCOME and EXPENDITUREvariables are quite correlated. Therefore an imputation model for the total of household income explained by the total of household expenditures was used to impute missing INCOME data. Because of the observed distribution with a strong wing towards higher income, a LOG10 transfor-mation was applied before the regression. Additionally, a more robust L1 (or LAD) regression was used as a first step in order to identify outliers. After discarding those outliers, a classic L2 (or LS) regression was applied to the remaining points in order to obtain the coefficients for imputation. These regressions were done separately for each socio-economic group (STASOCIO). The INCOME value was only imputed if the calcu-lated value was higher than the original value (concerns households with partially missing income data). After imputation the discarded outliers were joined back to the data for the estimation procedure.
A flowchart description for the implemented imputation model in the HBS98 data looks as follows:
1. Sum up all income and all expenditures per household to the total per household. (These totals correspond to the INCOME and EXPENDITURE variables of the HBS98 Pseudo Universes.)
30 Appendix A. Imputation in the Swiss HBS
2. Transform the totals of income and expenditures with the LOG10 transformation. 3. Identify all households with missing (also partially missing) data in income. Sep-arate them from the others. (All households: 9,295; households with missing and partially missing income: 505; households without missing income: 8,790)
4. Calculateresidualsof L1 regression(LOG INC = beta1 L1 k * LOG EXP; with-out intercept) for each socio-economic group (k stands for the 6 classes of the variable STASOCIO).
5. Divide the residuals by the STD MAD (=1.48*MAD); outliers are defined as ob-servations with absolute residual strictly greater than 2.5.
6. Separate the outliers from the others.
7. Calculate the L2 regression coefficients (LOG INC = beta1 L2 k * LOG EXP; without intercept) on the remaining points for each socio-economic group.
8. Calculate the imputed total income from the total expenditures for each household with missing (or partially missing) income data using the obtained L2 regression coefficients: LOG INC = beta1 L2 k * LOG EXP
9. Impute total income only if the imputed value is greater than the original value (concerns households with partially missing income data). (Of the 505 households with missing income data, only 304 were imputed because of this condition). 10. Recalculate the imputed INCOME on thelinearscale (INCOME = 10**LOG INC).
The beta1 L2 k coefficients that were actually used are the following:
• STASOCIO = -1 0.9913 • STASOCIO = 1 1.0185 • STASOCIO = 2 0.9986 • STASOCIO = 3 0.9941 • STASOCIO = 4 0.9949 • STASOCIO = 5 1.0119
Appendix B
Core SAS Codes for Single
Imputation Methods
The code in this Appendix is organised according to the headings in Sections 2.4 and 2.6. Case 1. Estimated linear regression model for continuous target variable
proc glm data=p a t t e r n 1 a ;
/∗t h i s i m p l i e s t h e f i l e or t h e m i s s i n g n e s s p a t t e r n∗/
c l a s s z1 z2 ;
/∗t h e s e a r e c a t e g o r i c a l v a r i a b l e s b e i n g used i n t h e model i f
not mentioned under c l a s s statement , t h e variable w i l l c o n t i n u o u s∗/
model y = z1 z2 x1 x2 x3/s o l u t i o n ; /∗model s p e c i f i c a t i o n∗/
output out = p a t t e r n 1 b p=ypred ;
/∗name f or output f i l e p l u s a new variable with t h e p r e d i c t e d
v a l u e s i n t h i s f i l e; t h e r e s i d u a l s may be i n c l u d e d a u t o m a t i c a l l y , t o o∗/
run ; /∗r u n n i n g t h e programme∗/
or if the intercept has not used:
proc glm data=p a t t e r n 1 a ;
c l a s s z1 z2 ;
model y = z1 z2 x1 x2 x3 /s o l u t i o n n o i n t ;
output out = p a t t e r n 1 b p=ypred ; run ;
/∗t h i s s p e c i f i c a t i o n d o e s not g i v e t h e i n t e r c e p t f or t h e
model∗/
Note: if there are robust methods available for this model, please give this opportunity for a user.
The output of this program gives, for example, a value for ’ROOT MSE’ (if you cannot automatically exploit ROOT MSE then you have to calculate it using residuals or (y-ypred), thus working with the output file = pattern1b).
Thus: we will have the predicted values fory = ypredand we may calculate the respective values taking into account the model uncertainty. This depends on the assumption of the
32 Appendix B. Core SAS Codes for Single Imputation Methods
error term. In a standard case, this may be simply based on a normal distribution, that is, we obtain:
yprednoise = ypred + rannor(u)*ROOT MSE
in which rannor(u) is random term with mean = 0 and standard deviation = 1 withu = seed number.
All robustness tests possible could be included in this process. Also, if a certain re-scaling ( e.g. = log) has been used in modelling, the values should have been further transformed to the initial dimension (in case of log, this means exp-transformation).
Moreover, a user should be able to specify the certain bounds for this solution, so not to include the full range of rannor(u) but for example as follows;
data p a t t e r n 1 b ; set p a t t e r n 1 b ; i f r a n n o r ( u ) > k then r a n n o r ( u ) =k ;