Logistic Regression Model for the Predicted Prob-

7.2 Weight Adjustments for Panel Non-response

7.2.2 Longitudinal Sampling Weights for the PME Survey

7.2.2.2 Logistic Regression Model for the Predicted Prob-

The method chosen to adjust the base weights (w_ij(1)∗ ) to compensate for the drop- out is one of the methods mentioned in Lepkowski (1989). It involves performing logistic regressions of a response indicator in order to calculate the predicted probabilities and adjust the base weights by the inverse of these predicted probabilities. Because no intermittent non-response is considered, the probability of dropping out from the panel at occasion t is what is being predicted.

For each head of household on each of the occasions, there is an indicator of responding up to that specific wave (receiving value 1) against dropping out at that specific wave (receiving value 0). The set of completers are those who have response indicators equal to one in all the eight occasions. Logistic regression models are fitted for each response indicator as the outcome variable starting from the second occasion as all the units are observed at the first occasion. The covariates considered in each of the models are taken from the previous occasion. In this sense, using the data from the first occasion, a logistic model for the response at the second occasion is performed in order to adjust the base weights and calculate occasion 2 weights. The same procedure is performed using data of occasion 2 and response indicator for occasion 3 to adjust the already adjusted weights of occasion 2. This is repeated for each subsequent pair of occasions in order to conditionally adjust the weights of each occasion given the adjustment of the previous occasion.

The choice of the covariates for initial inclusion in each of the logistic models was made with the objective of predicting the probability of response given the characteristics of the heads of the households. The auxiliary variables initially considered were the same set of variables as those in Chapter 4. The exception is that all the variables were considered in the model as categorical variables. Therefore, continuous covariates were categorized. This followed the suggestions presented in Chapter 2.

Initial model selection showed that some of the variables were not statistically significant. Therefore they were not included subsequently. It was also observed that for some of the variables their categories could have been collapsed and this was pursued for both the education and the age variables. Further model selection was performed for each of the seven models, one for each consecutive pair of waves, separately. Statistically significant main effects were initially tested in one- level logistic regression models through forward selection. Using these models

as a base, multivariate Wald tests for each categorical variable were performed and statistically significant terms at the 5% level were retained in the model. Table 7.2 presents the final main-effects models. This table shows that for each pair of occasions a different final main effects model was selected. No suggestion was found in the reviewed literature on the best approach to select these models. The belief that the probabilities of response might change over time motivated the selection of different models for each of the pairs of occasions.

Note from Table 7.2 that the variable for metropolitan region was kept in all the models even when the level of statistical significance was not met as this is an important variable in the design of the survey. Another design variable considered and tested was the variable representing the panel the units were selected to. This is the first set of variables shown in Table 7.2. The panel variable was only significant in the model (3,4) but not for the others. Although not shown in Table 7.2 the outcome variable income was also tested in the models. However, in none of the final main effects models did the variable income meet the significance criteria. This shows that the response probabilities are not related to the income for the set of heads of household considered here, which is an indication that the panel non-response is missing at random.

After selecting the one-level models in Table 7.2, two-level random intercept models were evaluated. There are different ways to perform the estimation of discrete random intercept models. The models presented in this section were all estimated using the Stata software and the Gllamm command, which estimates the random intercept models via adaptive quadrature methods as mentioned in Chapter 2. The two-level logistic models define the PSUs (clusters) as the second level units and include a random intercept for each of the clusters. However, after the selection of the data under analysis the number of observations per cluster is quite small and some of the clusters have no variation on the response indicator. Interaction effects between the significant main effects were then investigated at this stage for both one-level and two-level models. The two-level models which included the interaction terms did not converge and for some of the one-level models convergence problems were also met. That was an indication for the non- inclusion of the interaction effects. A similar problem with interactions terms was mentioned in Rizzo et al. (1996) where the final model for the predicted probabilities included only the significant main effects.

Table 7.2: Logistic Regression Model for the Response Propensity

Occasion data, Response indicators

1,2 2,3 3,4 4,5 5,6 6,7 7,8 Panel 4.02 1.086 (2.05)* Panel 4.03 1.078 (2.03)* Panel 4.04 0.93 (1.86) Panel 4.05 0.589 (1.29) Panel 4.06 2.153 (2.82)** Panel 4.07 0.569 (1.33) Panel 4.08 0.763 (1.74) Panel 4.09 0.157 (0.41)

Age 40 and over 0.565 0.649 0.431 0.468 0.433 (2.69)** (2.68)** (5.82)** (2.29)* (2.85)** +12 years of Education -0.789 -0.862 -0.396 -0.539 -0.409 (2.92)** (3.27)** (4.35)** (2.28)* (2.22)* 5 to 9 years of work -0.063 0.003 (0.23) (0.01) 10 to 14 years of work 0.549 0.961 (1.33) (2.02)* 15 to 19 years of work 1.344 2.039 (1.85) (2.01)* 20 to 24 years of work 1.187 1.115 (1.63) (1.53) +25 years of work - -0.183 - (0.44) Proxy Respondent 0.643 0.64 0.162 (2.79)** (2.57)* (2.05)* 2 members in the HH 0.602 0.921 0.336 0.545 0.509 (1.99)* (2.38)* (2.68)** (1.73) (1.87) 3 members in the HH 0.995 1.101 0.543 0.767 0.602 (3.24)** (2.99)** (4.39)** (2.53)* (2.35)* 4 members in the HH 1.124 1.201 0.895 1.496 0.641 (3.47)** (3.20)** (6.84)** (4.18)** (2.51)* 5 members in the HH 1.524 0.425 0.705 0.941 0.389 (3.24)** (1.12) (4.67)** (2.43)* (1.35) +6 members in the HH 1.362 1.816 0.856 1.407 1.064 (2.46)* (2.39)* (4.59)** (2.52)* (2.57)*

Table 7.2 – continued from previous page Occasion data, Response indicators

1,2 2,3 3,4 4,5 5,6 6,7 7,8 Salvador 1.528 -0.005 1.214 -0.206 -0.606 0.603 0.637 (2.84)** -0.01 (2.40)* -1.37 -0.89 -1.37 -1.69 Belo Horizonte -0.126 -0.538 0.269 0.136 -0.999 0.018 -0.424 (0.36) (1.07) (0.72) (0.91) (1.59) (0.05) (1.46) Rio de Janeiro 1.946 0.605 1.855 0.483 0.381 1.074 0.882 (3.64)** (1.06) (3.46)** (3.23)** (0.54) (2.55)* (2.60)** S˜ao Paulo 0.469 0.294 1.379 0.285 -0.881 0.625 0.497 (1.33) (0.56) (3.25)** (2.00)* (1.43) (1.67) (1.62) Porto Alegre 0.133 -0.476 -0.048 0.16 -1.211 -0.025 -0.61 (0.35) (0.91) (0.13) (1.01) (1.92) (0.07) (2.07)* Constant 2.876 4.744 2.669 1.567 5.435 3.273 3.219 (8.00)** (9.83)** (5.66)** (9.90)** (9.21)** (8.27)** (9.64)** Observations 11,598 10,808 11,430 11,357 10,537 10,464 10,365 Note: Absolute value of z statistics in parentheses.

*significant at 5%, **significant at 1%

With the decision to test main-effects models only, Table 7.3 presents the summary of predicted probabilities under the different model formulations. Ta- ble 7.4 presents the estimates for the between cluster variance in the two-level random intercept logistic models. It also presents the test for goodness-of-fit comparing the two model formulations. Note from Table 7.3 that the predicted probabilities for each of the models are not very different when comparing one-level with two-level models. A small difference is observed when, for the two-level models, the random effects are taken into account in calculating the probabilities. How- ever, Table 7.4 shows that for three of the seven models the two-level model is not significantly different to the one-level model. This raises the issue of whether or not to account for the random intercepts in any of the models. The aim of this analysis is to calculate the marginal predicted probabilities of panel drop- out given the heads of household characteristics rather then providing inference on individual effects. Hence, the choice would be for the one-level model. Ran- dom effects models would be of interest if the probabilities of panel drop-out were thought to vary between clusters and also to control for the effects of the data hierarchy on these probabilities. By retaining the metropolitan region variable in the models, the regional effects are being controlled for. In addition, due to the similarities between the predicted probabilities, in order to maintain simplicity and consistency between the different models, those probabilities for the one-level model formulation are chosen at this stage to be the weight adjustments.

Table 7.3: Summary of the Predicted Probabilities

Model Statistics

Min Mean Median Max 1,2 Two-level

Fixed Only 0.9689 0.9963 0.9975 0.9998 Fixed and Random 0.7052 0.9958 0.9977 0.9998 One-level 0.9399 0.9916 0.9940 0.9995 2,3 Two-level

Fixed Only 0.9732 0.9949 0.9962 0.9995 Fixed and Random 0.9625 0.9949 0.9961 0.9995 One-level 0.9680 0.9938 0.9954 0.9993 3,4 Two-level

Fixed Only 0.8530 0.9936 0.9966 0.9999 Fixed and Random 0.8530 0.9936 0.9966 0.9999 One-level 0.8530 0.9936 0.9966 0.9999 4,5 Two-level

Fixed Only 0.7523 0.9393 0.9480 0.9770 Fixed and Random 0.5272 0.9373 0.9483 0.9841 One-level 0.7241 0.9278 0.9370 0.9717 5,6 Two-level

Fixed Only 0.9885 0.9954 0.9966 0.9998 Fixed and Random 0.9641 0.9953 0.9965 0.9998 One-level 0.9827 0.9931 0.9948 0.9996 6,7 Two-level

Fixed Only 0.9727 0.9969 0.9977 0.9995 Fixed and Random 0.7007 0.9959 0.9979 0.9996 One-level 0.9375 0.9905 0.9923 0.9982 7,8 Two-level

Fixed Only 0.9318 0.9891 0.9919 0.9980 Fixed and Random 0.6812 0.9882 0.9924 0.9982 One-level 0.9003 0.9824 0.9869 0.9963

Table 7.4: Between Cluster Variance and Goodness-of-fit Test

Model Goodness-of-fit

ˆ σ2

u SE(ˆσ2u) -2×Log-Likelihood LRT Half p-value

1,2 Two-level 1.879 0.753 1029.04 8.51 0.002 One-level 1037.55 2,3 Two-level 0.385 0.769 827.31 0.24 0.311 One-level 827.55 3,4 Two-level 0.000 0.000 796.69 0.00 0.500 One-level 796.69 4,5 Two-level 0.490 0.109 5675.18 32.26 0.000 One-level 5707.44 5,6 Two-level 0.875 0.844 836.57 1.06 0.151 One-level 837.63 6,7 Two-level 2.526 0.706 1045.26 25.29 0.000 One-level 1070.56 7,8 Two-level 1.077 0.341 1737.48 14.89 0.000 One-level 1752.38

In document Methods for analysing complex panel data using multilevel models with an application to the Brazilian labour force survey (Page 168-173)