5 8 Real data application
Step 6: get imputed values
The imputed values ˆRo of each non-respondents intercalate between the
respondent’s ranks, by renumbering all these ranks from 0 to n=nr+no,
one defines the imputed ranks. For each non-respondent t, we can find two respondents yL;tand yU;t respectively having the closest rank smaller
and larger than ˆRo;t. The predicted income is then estimated for example
by simple interpolation:
ˆyimp,t =
yL;t+yU;t
2 , (5.19)
which is what we have done leading to the results commented above. Note that we could also use the fitted GB2 of Step 1:
ˆyGB2imp,t =FGB2−1 (Rˆo;t). (5.20)
Nevertheless, this leads to still good but slightly worse results.
5.8.3 Results
Figure 5.2 compares the empirical income density distributions of the im- puted values to the (usually unknown) corresponding distribution of the real values, and to the distribution of the respondents. Again we can see that the respondents and the non-respondents show different distri- butions. Concerning the imputed values, their distribution follows very closely the one targeted for the non-respondents except for the very low (annualized) incomes below 20,000 Swiss Francs. This is also reflected in Table 5.4: the empirical estimates of the inequality indicators based on the partially imputed dataset are not significantly different from the true values except for the QSR! The latter indicator is by definition more sen- sitive to the lower incomes and thus to the precision of our imputation in that region of the income distribution. But these inequality measures depend only on the shape of the distribution. What about the precision at the unit level? The four bottom panels of Figure 5.2 show that for the half of the imputed values we commit an absolute error comprised be- tween[−9, 896; 19, 881] Swiss Francs, and for 90% of the imputed values, the absolute error lays within [−38, 954; 70, 716] Swiss Francs. The two lower panels of Figure 5.2 confirm, as can be suspected by observing the densities of the top panel, that we are relatively less precise in the lower tail of the distribution, that means for low income. This originates from the fact that the low incomes were hard to model with the information at our disposal. For middle to high income, we have very good results. In- deed, to compare our performance, we have conducted other imputations using IVEware. The same auxiliary variables were provided in input, and we have imputed the log-transformed yCCO variable as currently done at
the SFSO. Note that in the SFSO’s imputation procedure, the imputation is split into imputation classes2
and some bounds restricting the allowed
2
See Appendix F for more details on the creation of imputation classes. Note that the appendix is not part of the original publication.
predicting range are set. For example, the employed income of an indi- vidual cannot exceed the declared total income of his or her household, if available. We did not implement these possibilities in order to compare the two imputation strategies with investing similar efforts in both, so to have a fair evaluation of the respective performances. Figure 5.3 presents the result of this attempt. The right panels summarise what is obtained with IVEware by taking as imputed value, for each observation, the mean over 50 imputations. The left panels illustrates what comes out when one chooses, for each missing value, the first out of the 50 imputations as im- puted value. To take the mean on 50 imputations completely destroys the original income distribution and ends up roughly imputing the median income plus or minus some variations. The imputation errors are much bigger than with our system and the results on the inequality indicators based on data partially imputed data in this manner are not acceptable (see Table 5.4). If one chooses the first imputation out of 50 as an imputed value, the empirical density of the imputed values looks better, but it is still much worse than the one provided by our system. The imputation er- rors are still huge and computed inequality indicators are still (all, except one) significantly different from the true values (see Table 5.4).
Finally, for our imputation system, Table 5.3 shows the order of magni- tude of the estimated design variance and imputation variance, see (5.18). Our imputations are very stable since the coefficient of determination of the model used to predict the normal quantiles (Step 4) is over 63%. Con- sequently, we observe that for the considered inequality measures, the im- putation variance is very low compared to the design variance (less than 6% of the total variance for all indicators) and could therefore almost be neglected.
Table 5.3 – Imputation-variance estimation from our imputation strategy. One can see that the imputation variance is very low compared to the other component, reaching at the maximum 6.6% of the total variance for the RMPG. EMP R + IMP = empirical esti- mation on respondents + non-respondents imputed by our imputation procedure (original survey weights).
EMP R Imputation Design Total s2
imp Vtot
Relative bias + IMP variance s2
imp variance Vp variance Vtot of point estimate
ARPT 34’099 418 156162 156580 0.3% -3%
ARPR 32.96 0.00 0.16 0.17 2.6% 2%
RMPG 58.00 0.04 0.51 0.55 6.6% 0%
QSR 14.10 0.00 0.12 0.12 0.4% -13%
5.8. Real data application 141
Table 5.4 – Inequality indicators. All the estimations are weighted. 95% confidence estimated intervals. EMP ALL = empirical estimation on all the individuals (no non- response, original survey weights), EMP R = empirical estimation on respondents only (original survey weights, no nonresponse correction), EMP R Gen Calib = empirical es- timation on respondents only using weights stemming from the generalized calibration, GB2 ALL = parametric estimation out of the GB2 fitted on all individuals (no nonre- sponse, original survey weights), GB2 R Gen Calib = parametric estimation out of the GB2 fitted on the respondents only using weights stemming from the generalized cali- bration, GB2 R = parametric estimation out of the GB2 fitted on the respondents using only the original survey weights (no nonresponse correction), EMP R + IMP = empiri- cal estimation on respondents + non-respondents imputed by our imputation procedure (original survey weights). All variance estimations are done through linearization tech- niques (see Section 5.7) and take the sampling design, several nonresponse corrections, panel combinations and calibrations into account. The variance due to imputation is evaluated through multiple imputation.
EMP ALL EMP R EMP R GEN CALIB
Est. ciL ciU Est. ciL ciU Est. ciL ciU
ARPT 35,081 34,389 35,774 38,942 38,283 39,600 33,977 32,856 35,098
ARPR 32.31 31.51 33.12 27.68 26.78 28.59 34.20 33.17 35.22
RMPG 57.72 55.88 59.57 52.14 49.80 54.49 60.19 58.70 61.68
QSR 16.23 15.17 17.28 10.39 9.73 11.05 17.78 16.63 18.92
GINI 42.60 41.73 43.46 37.96 37.03 38.88 43.49 42.71 44.26
GB2 ALL GB2 R GB2 R GEN CALIB
Est. ciL ciU Est. ciL ciU Est. ciL ciU
ARPT 33,210 32,618 33,802 37,737 37,114 38,359 32,246 30,902 33,589
ARPR 31.23 30.81 31.65 27.43 26.92 27.94 31.92 31.14 32.7
RMPG 52.85 51.82 53.87 44.47 43.43 45.52 54.56 52.61 56.51
QSR 14.88 13.98 15.79 9.54 9.01 10.07 16.01 14.48 17.53
GINI 42.51 41.70 43.32 37.80 36.95 38.66 42.91 41.94 43.88
Our imputation IVEware Imputations
EMP R + IMP mean of 50 imp. first imputation
Est. ciL ciU IMP_iveT ciL ciU IMP_ive2T ciL ciU
ARPT 34,099 33,323 34,874 34,131 NA NA 37’332 NA NA ARPR 32.96 32.16 33.76 19.10 18.16 19.95 29.48 28.38 30.48 RMPG 58.00 56.55 59.45 50.93 47.95 53.32 49.13 46.99 51.23 QSR 14.10 13.43 14.78 7.52 7.19 7.94 11.59 11.08 12.18 GINI 41.79 40.98 42.60 34.79 34.10 35.65 41.02 40.25 41.98 ARPR RMPG QSR GINI
on orginal CCO-incomes< 20,000.- on original CCO-incomes≥20,000.-
Figure 5.2 – Results. Top panel: income empirical densities. Middle left panel: quantiles of the absolute imputation error which corresponds to the real value minus the imputed value, yCCO,k− ˆyimp,k, k ∈ o. Middle right and two bottom panels, quantiles of the relative imputation error,(yCCO,k−ˆyimp,k)/yCCO,k, k∈o; bottom left, zoom on original income below 20,000.-, bottom right, zoom on original income over 20,000.- CHF.
5.8. Real data application 143
IVEware imputations, mean over 50imputed values for each missing observation.
IVEware imputations, one single imputed value for each observa- tion.
Figure 5.3 – Results from imputations using IVEware, to be compared with Figure 5.2. On the left side, the imputed value corresponds to the mean of 50 imputations, on the right side, only one imputed value has been generated for each missing observation. Top panels: income empirical densities. Four lower panels: quantiles of the absolute, yCCO,k−ˆyimp,k, (middle ones) and relative, (yCCO,k− ˆyimp,k)/yCCO,k, k ∈ o (bottom ones) imputation error, k∈o.