get imputed values - 5 8 Real data application

5 8 Real data application

Step 6: get imputed values

The imputed values ˆRo of each non-respondents intercalate between the

respondent’s ranks, by renumbering all these ranks from 0 to n=nr+no,

one defines the imputed ranks. For each non-respondent t, we can find two respondents yL;tand yU;t respectively having the closest rank smaller

and larger than ˆRo;t. The predicted income is then estimated for example

by simple interpolation:

ˆyimp,t =

yL;t+yU;t

2 , (5.19)

which is what we have done leading to the results commented above. Note that we could also use the fitted GB2 of Step 1:

ˆyGB2_imp,t =F_GB2−1 (Rˆo;t). (5.20)

Nevertheless, this leads to still good but slightly worse results.

5.8.3 Results

Figure 5.2 compares the empirical income density distributions of the imputed values to the (usually unknown) corresponding distribution of the real values, and to the distribution of the respondents. Again we can see that the respondents and the non-respondents show different distributions. Concerning the imputed values, their distribution follows very closely the one targeted for the non-respondents except for the very low (annualized) incomes below 20,000 Swiss Francs. This is also reflected in Table 5.4: the empirical estimates of the inequality indicators based on the partially imputed dataset are not significantly different from the true values except for the QSR! The latter indicator is by definition more sen- sitive to the lower incomes and thus to the precision of our imputation in that region of the income distribution. But these inequality measures depend only on the shape of the distribution. What about the precision at the unit level? The four bottom panels of Figure 5.2 show that for the half of the imputed values we commit an absolute error comprised between[−9, 896; 19, 881] Swiss Francs, and for 90% of the imputed values, the absolute error lays within [−38, 954; 70, 716] Swiss Francs. The two lower panels of Figure 5.2 confirm, as can be suspected by observing the densities of the top panel, that we are relatively less precise in the lower tail of the distribution, that means for low income. This originates from the fact that the low incomes were hard to model with the information at our disposal. For middle to high income, we have very good results. In- deed, to compare our performance, we have conducted other imputations using IVEware. The same auxiliary variables were provided in input, and we have imputed the log-transformed yCCO variable as currently done at

the SFSO. Note that in the SFSO’s imputation procedure, the imputation is split into imputation classes2

and some bounds restricting the allowed

See Appendix F for more details on the creation of imputation classes. Note that the appendix is not part of the original publication.

predicting range are set. For example, the employed income of an indi- vidual cannot exceed the declared total income of his or her household, if available. We did not implement these possibilities in order to compare the two imputation strategies with investing similar efforts in both, so to have a fair evaluation of the respective performances. Figure 5.3 presents the result of this attempt. The right panels summarise what is obtained with IVEware by taking as imputed value, for each observation, the mean over 50 imputations. The left panels illustrates what comes out when one chooses, for each missing value, the first out of the 50 imputations as imputed value. To take the mean on 50 imputations completely destroys the original income distribution and ends up roughly imputing the median income plus or minus some variations. The imputation errors are much bigger than with our system and the results on the inequality indicators based on data partially imputed data in this manner are not acceptable (see Table 5.4). If one chooses the first imputation out of 50 as an imputed value, the empirical density of the imputed values looks better, but it is still much worse than the one provided by our system. The imputation errors are still huge and computed inequality indicators are still (all, except one) significantly different from the true values (see Table 5.4).

Finally, for our imputation system, Table 5.3 shows the order of magni- tude of the estimated design variance and imputation variance, see (5.18). Our imputations are very stable since the coefficient of determination of the model used to predict the normal quantiles (Step 4) is over 63%. Con- sequently, we observe that for the considered inequality measures, the imputation variance is very low compared to the design variance (less than 6% of the total variance for all indicators) and could therefore almost be neglected.

Table 5.3 – Imputation-variance estimation from our imputation strategy. One can see that the imputation variance is very low compared to the other component, reaching at the maximum 6.6% of the total variance for the RMPG. EMP R + IMP = empirical esti- mation on respondents + non-respondents imputed by our imputation procedure (original survey weights).

EMP R Imputation Design Total s2

imp Vtot

Relative bias + IMP variance s2

imp variance Vp variance Vtot of point estimate

ARPT 34’099 418 156162 156580 0.3% -3%

ARPR 32.96 0.00 0.16 0.17 2.6% 2%

RMPG 58.00 0.04 0.51 0.55 6.6% 0%

QSR 14.10 0.00 0.12 0.12 0.4% -13%

5.8. Real data application 141

Table 5.4 – Inequality indicators. All the estimations are weighted. 95% confidence estimated intervals. EMP ALL = empirical estimation on all the individuals (no non- response, original survey weights), EMP R = empirical estimation on respondents only (original survey weights, no nonresponse correction), EMP R Gen Calib = empirical es- timation on respondents only using weights stemming from the generalized calibration, GB2 ALL = parametric estimation out of the GB2 fitted on all individuals (no nonre- sponse, original survey weights), GB2 R Gen Calib = parametric estimation out of the GB2 fitted on the respondents only using weights stemming from the generalized cali- bration, GB2 R = parametric estimation out of the GB2 fitted on the respondents using only the original survey weights (no nonresponse correction), EMP R + IMP = empiri- cal estimation on respondents + non-respondents imputed by our imputation procedure (original survey weights). All variance estimations are done through linearization tech- niques (see Section 5.7) and take the sampling design, several nonresponse corrections, panel combinations and calibrations into account. The variance due to imputation is evaluated through multiple imputation.

EMP ALL EMP R EMP R GEN CALIB

Est. ciL ciU Est. ciL ciU Est. ciL ciU

ARPT 35_,081 34_,389 35_,774 38_,942 38_,283 39_,600 33_,977 32_,856 35_,098

ARPR 32_.31 31_.51 33_.12 27_.68 26_.78 28_.59 34_.20 33_.17 35_.22

RMPG 57_.72 55_.88 59_.57 52_.14 49_.80 54_.49 60_.19 58_.70 61_.68

QSR 16_.23 15_.17 17_.28 10_.39 9_.73 11_.05 17_.78 16_.63 18_.92

GINI 42_.60 41_.73 43_.46 37_.96 37_.03 38_.88 43_.49 42_.71 44_.26

GB2 ALL GB2 R GB2 R GEN CALIB

Est. ciL ciU Est. ciL ciU Est. ciL ciU

ARPT 33_,210 32_,618 33_,802 37_,737 37_,114 38_,359 32_,246 30_,902 33_,589

ARPR 31.23 30.81 31.65 27.43 26.92 27.94 31.92 31.14 32.7

RMPG 52_.85 51_.82 53_.87 44_.47 43_.43 45_.52 54_.56 52_.61 56_.51

QSR 14.88 13.98 15.79 9.54 9.01 10.07 16.01 14.48 17.53

GINI 42_.51 41_.70 43_.32 37_.80 36_.95 38_.66 42_.91 41_.94 43_.88

Our imputation IVEware Imputations

EMP R + IMP mean of 50 imp. first imputation

Est. ciL ciU IMP_iveT ciL ciU IMP_ive2T ciL ciU

ARPT 34_,099 33_,323 34_,874 34_,131 _NA _NA 37_’332 _NA _NA ARPR 32_.96 32_.16 33_.76 19_.10 18_.16 19_.95 29_.48 28_.38 30_.48 RMPG 58_.00 56_.55 59_.45 50_.93 47_.95 53_.32 49_.13 46_.99 51_.23 QSR 14_.10 13_.43 14_.78 7_.52 7_.19 7_.94 11_.59 11_.08 12_.18 GINI 41_.79 40_.98 42_.60 34_.79 34_.10 35_.65 41_.02 40_.25 41_.98 ARPR RMPG QSR GINI

on orginal CCO-incomes< 20,000.- on original CCO-incomes≥20,000.-

Figure 5.2 – Results. Top panel: income empirical densities. Middle left panel: quantiles of the absolute imputation error which corresponds to the real value minus the imputed value, yCCO,k− ˆyimp,k, k ∈ o. Middle right and two bottom panels, quantiles of the relative imputation error,(y_CCO,k−ˆy_imp,k)/y_CCO,k, k∈o; bottom left, zoom on original income below 20,000.-, bottom right, zoom on original income over 20,000.- CHF.

5.8. Real data application 143

IVEware imputations, mean over 50imputed values for each missing observation.

IVEware imputations, one single imputed value for each observation.

Figure 5.3 – Results from imputations using IVEware, to be compared with Figure 5.2. On the left side, the imputed value corresponds to the mean of 50 imputations, on the right side, only one imputed value has been generated for each missing observation. Top panels: income empirical densities. Four lower panels: quantiles of the absolute, y_CCO,k−ˆy_imp,k, (middle ones) and relative, (yCCO,k− ˆyimp,k)/yCCO,k, k ∈ o (bottom ones) imputation error, k∈o.

In document Imputation of income variables in a survey context and estimation of variance for indicators of poverty and social exclusion (Page 139-144)