Models Based on Genuine Panel Data Estimators

Chapter 4 Measurement Error and Linear Static Fixed Effect Model

4.4 Empirical Results from Static Car Ownership Model

4.4.2 Models Based on Genuine Panel Data Estimators

The Weighted Least Square estimator reported in the previous section is the Instrumental Variable estimator of individual household data. Such interpretation has two problems in the current study, both for analytical purpose and forecasting purpose. For analytical purpose, it treats the number of cars owned or used by each household as linear, i.e. the cause of car number increasing from zero to one is the same as that of any other unit increase. This is problematic as car ownership is in essence discrete choices made by household, with the decision to own the first car significantly different from the subsequent ones. For forecasting purpose, using a model based on individual household data will add difficult complications. It is only possible to predict the explanatory variables at aggregate cohort level, and linear transformation (log, square, etc.) can only be done for that average prediction (e.g. log of average household income). When the model is estimated based on the average of linear transformed data (e.g. average of log household income), aggregation bias would arise.

Given these problems, it is important to interpret the pseudo panel model from a different perspective. Recall that we only include cohorts with sufficiently large number of observations, and we believe the measurement error can be ignored in this case. This implied that the sample averages of each cohort are unbiased estimates of true cohort means, and as a result, the pseudo panel can be estimated as a genuine panel with each cohort being the unit of observation. In this case, any linear transformation of the data will be done at the cohort level and the panel data fixed effect and random effect models can be estimated using standard technique.

Systematic specification search procedure similar to the WLS estimator models has been carried out. Three functional forms have been tested for the Fixed Effect Model: linear, semi-log linear (of various explanatory variables) and double log linear form. The double log form implies a constant income and price elasticity (note that it can not be estimated using the WLS estimator due to zero car ownership for some households). For each functional form, the explanatory variables include average real household disposable income (or its log); household structure variables (number of children and

working person and household size; alternatively, proportion of household types); household location variables (proportion of household living in four areas; alternatively, proportion of household living in metropolitan and least populated rural area); index of real car purchasing price and real car running costs; and finally, second polynomial of average age of household head.

Figure 4-2 Fixed effects in the linear model

Fixed Cohort Effect in Linear Model

-1.4000 -1.2000 -1.0000 -0.8000 -0.6000 -0.4000 -0.2000 0.0000 C1 C2 C3 C4 C5 C6 C7 C8 C9 _C10 _C11 _C12 _C13 _C14 _C15 _C16 Fixed Effect

There is a prevalent feature across most fixed effect models examined here: linear trend in the fixed cohort effect. Figure 4.2 shows the fixed effect in the linear model with household structure variables being the proportion of seven household types and location variable being “Met” and “Rural”. It shows the linear trend in the fixed effects of most cohorts except for the youngest one. It should be noted that there are high correlations between the fixed effects and some of them are not statistically significant.

Consequently, the restricted Fixed Effect Models have been estimated, with the sixteen cohort dummy variables replaced by a variable of cohort number (ID) and a dummy for the youngest cohort (cohort 16). It implies that the fixed effects across cohorts are linear except for the youngest cohort, which is similar to the “Generation Model” of Dargay and Vythoulkas (1999). For each of the unrestricted and restricted FE models, standard set of RESET misspecification test, Durbin-Watson autocorrelation test and White heteroscedasticity test has been conducted. The RESET test does not indicate

Table 4-3 Linear Model: Unrestricted and restricted Fixed Effect Model

Linear Model Semi Log Linear Model Unrestricted Restricted Unrestricted Restricted

Coeff t-ratio Coeff t-ratio Coeff t-ratio Coeff t-ratio

Constant -1.0872 -4.12 -1.7978 -2.41 INC 0.0006 3.80 0.0009 6.64 LNINC 0.2163 4.82 0.2732 6.26 HH2 -0.1553 -0.95 -0.4209 -2.92 -0.1478 -0.91 -0.4569 -3.13 HH3 -0.7297 -3.47 -0.5625 -2.98 -0.6830 -3.28 -0.5814 -3.04 HH4 0.4722 3.54 0.1230 0.97 0.4321 3.29 0.0595 0.47 HH5 0.5961 3.76 0.2067 1.42 0.4827 3.00 0.0867 0.58 HH6 0.7194 5.20 0.3540 2.83 0.6187 4.37 0.2976 2.29 HH7 1.0214 5.59 0.5644 3.27 0.9863 5.44 0.5747 3.29 HH8 0.9223 5.30 0.4725 2.82 0.8430 4.86 0.4872 2.87 MET -0.2434 -2.37 -0.2537 -2.72 -0.2116 -2.08 -0.2496 -2.63 RURAL 0.1494 1.34 0.1548 1.53 0.1201 1.10 0.1699 1.66 PRICE -0.0020 -2.39 -0.0008 -0.91 RUNCST -0.0023 -4.34 -0.0012 -1.99 LNPRICE -0.1256 -1.53 -0.0417 -0.46 LNRUNCST -0.1910 -3.46 -0.1202 -1.89 AGE 0.0455 13.43 0.0381 12.28 0.0439 12.91 0.0375 11.74 AGSQ -0.0003 -9.06 -0.0002 -8.31 -0.0003 -8.66 -0.0002 -7.64 COHORT 0.0706 7.53 0.0736 8.01 C1 -1.2248 -5.06 -1.1537 -1.67 C2 -1.1449 -4.87 -1.0729 -1.57 C3 -1.1175 -4.85 -1.0443 -1.53 C4 -1.0498 -4.65 -0.9810 -1.44 C5 -0.9947 -4.49 -0.9290 -1.37 C6 -0.9192 -4.25 -0.8563 -1.27 C7 -0.8102 -3.83 -0.7479 -1.11 C8 -0.7164 -3.47 -0.6525 -0.97 C9 -0.5999 -3.00 -0.5312 -0.79 C10 -0.5089 -2.61 -0.4390 -0.65 C11 -0.4315 -2.28 -0.3654 -0.54 C12 -0.3352 -1.83 -0.2733 -0.41 C13 -0.2375 -1.34 -0.1813 -0.27 C14 -0.1305 -0.76 -0.0806 -0.12 C15 -0.0688 -0.41 -0.0183 -0.03 C16 -0.0748 -0.46 -0.1341 -3.90 -0.0258 -0.04 -0.1354 -3.94 RHO 0.3511 5.97 0.3307 5.57 Adjusted R2 0.986 0.983 0.986 0.985 SSE 0.31 0.340 0.30 0.26 Log Likelihood 491.70 479.59 495.27 476.42 RESET (t-Stat) -.066 0.74 -0.15 2.69

Note: y = A_ct in both models

misspecification for most of the un-restricted models. While the same test suggests misspecification for the all restricted semi-log and double-log models, it does not reject the null hypothesis of no misspecification for the linear models at any statistically significant level.

For the unrestricted FE models, the Durbin-Watson Test and White Test do not clearly reject the hypothesis of no autocorrelation and homoscedasticity. For the restricted models, these hypotheses are evidently rejected by the same tests, so they have been re- estimated using Feasible Generalized Least Square with AR1. Table 4.3 compares the results of unrestricted and restricted fixed effect models of best fit, both with linear and semi-log linear functional form.

The regression coefficients are quite similar for all models (Except for those with log linear transformation). The adjusted R Square varies between 0.983 and 0.986, which indicate good level of fit. In terms of model selection, the first step is to compare the unrestricted and restricted models. The likelihood ratio test based on the linear models produces a Chi Square statistic of 24.2, and with 14 degree of freedom, the hypothesis of no loss of fit is rejected at 5% level. The same test based on semi-log model produced a Chi Square statistic of 37.7, which rejects the hypothesis at 1% level. In both cases, the unrestricted fixed effect models are chosen as favoured.

The differences between the linear and semi-log models are more subtle. For the linear model, the coefficients of “Price” and “RunCst” are higher (in absolute term) than the coefficient of “Inc”. Based on median income and car ownership level, the implied income elasticity is 0.20, the implied price elasticity is -0.21 and the implied running costs elasticity is -0.25. As previous research suggests that income elasticity is higher than price/running costs elasticity, such results are troublesome. Consequently, the (unrestricted) semi-log model is chosen as the preferred model, whose residual plot in Figure 4.3 does not show any apparent misspecification.

Based on the preferred model, the log income coefficient suggests that a 1% increase of real household disposable income will increase the car ownership by 0.0021 cars per household. On the other hand, a 1% decrease in car purchase price will increase the average car ownership level by 0.0013 cars13; a similar decrease in running costs will lead to a rise in car ownership level by 0.0019 cars. Table 4.4 shows the income and

price elasticities at the low (10 percentile), median and high (90 percentile) car ownership level.

Figure 4-3 Residual Plot of unrestricted Fixed Effect Model

Residuals. Bars mark mean res. and +/- 2s(e)Observ.#

-.100 -.075 -.050 -.025 .000 .025 .050 .075 .100 -.125 51 102 153 204 255 0 Residual

Note: X-axis is ordered by year for each cohort

Table 4-4 Income and Price Elasticity (based on Semi log, unrestricted FE model)

Car Ownership Level Income Elasticity Purchase Price

Elasticity Running Cost Elasticity Low (0.42 cars/HH) 0.52 -0.30 -0.45 Median (0.92 cars/HH) 0.24 -0.14 -0.21 High (1.25 cars/HH) 0.17 -0.10 -0.15

Regarding the household structure, the higher is the proportion of “large” household (household with two or three adults) in a cohort, the higher is the car ownership level. The impacts of number of working person and children is best illustrated by the two adult household (type 4, 5 and 6). When the proportion of household type 4 (two adults, neither in work) increases by 1%14, the average number of car increases by 0.0043 per household in the cohort; when the proportion of household type 5 (two adults with working person but no children) increases by 1%, the average car number increases by 0.0048; when the proportion of household type 6 (two adults with working person and children) increases by 1%, the average car number increases by 0.0062. Regarding the household location variables, only “Met” is significant, whose coefficient suggests that

a 1% increase of Metropolitan household in a cohort results in a decrease of 0.0021 cars per household in that cohort.

Besides the Fixed Effect model, a set of Random Effect Models has been estimated. Assuming that the cohort effects are strictly uncorrelated with the explanatory variables, Random Effect models treat the cohort specific constant terms as randomly distributed across cohorts (

+vi rather than i). With fewer parameters to estimate, such

specification increases the degree of freedom. However, in the context of pseudo panel, the assumption of no correlation between the cohort effects and the regressors is likely to be violated, and the RESET test reveals misspecification in most of the random effect models tested. The presence of correlations seems to have been confirmed by the Hausman’s Test, as the Chi Square statistics are large and statistically significant for all the models tested. For example, for the linear model reported in Table 4.3, its corresponding Random Effect model has a Hausman’s Test statistic of 70.9, which strongly favours the Fixed Effect model.

Finally, we investigated the issue of heterogeneity by estimating cohort specific models. The oldest and two youngest cohorts have to be dropped due to small number of observations and the resulting dataset contains two cohorts with 14 observations (C2 and C14) and eleven cohorts with 19 observations (C3 to C13). To save the regression degrees of freedom, average household characteristic variables (number of children, working person and household size) rather than proportion of household types are used. The initial modelling group tested contains eleven explanatory variables including the constant term in each model, separately estimated using OLS. Three variables that are not significant for almost all cohorts are subsequently dropped, which leaves 6 error degree of freedom for the models with 14 observations and 11 error degree of freedom for those with 19 observations. Different groups with linear and semi-log linear functional form have been estimated and the results are very similar (in terms of coefficients of the common variables and R square). Table 4.5 reports the results of the linear models for each of the 13 cohorts.

For the ease of comprehension, only the coefficients that are significant at 10% level are reported; however, the descriptive statistics are calculated using all 13 coefficients. The last two column of Table 4.5 refer to the standard error of regression and R Square

Table 4-5 Cohort Specific Regression Results (Linear form; 13 cohorts)

Inc Child HHSize Met Price Age AgSq Const. SSE R2

C2 n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. 0.049 0.876 C3 n.s. n.s. n.s. -0.7390 -0.0063 0.1961 -0.0014 -5.9712 0.032 0.961 C4 0.0010 n.s. 0.3331 -0.4776 n.s. 0.1561 -0.0011 -5.1904 0.027 0.967 C5 n.s. -1.4233 0.9822 n.s. n.s. 0.1657 -0.0011 -7.0443 0.020 0.984 C6 0.0019 n.s. n.s. n.s. n.s. n.s. n.s. n.s. 0.037 0.929 C7 n.s. n.s. n.s. n.s. -0.0084 n.s. n.s. 6.8813 0.022 0.974 C8 n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. 0.038 0.871 C9 n.s. -0.6406 0.8139 n.s. n.s. n.s. n.s. n.s. 0.046 0.920 C10 n.s. -0.4308 0.4989 1.1212 n.s. n.s. n.s. -1.5895 0.031 0.975 C11 0.0012 n.s. n.s. n.s. n.s. n.s. n.s. n.s. 0.023 0.987 C12 n.s. n.s. n.s. n.s. n.s. 0.1397 -0.0018 -3.1804 0.029 0.981 C13 0.0014 -0.6733 0.6033 n.s. -0.0028 0.1353 -0.0019 -2.4892 0.014 0.998 C14 n.s. n.s. n.s. n.s. -0.0065 0.2919 -0.0051 -2.7124 0.024 0.995 Mean 0.0006 -0.3446 0.3864 -0.1598 -0.0029 0.1368 -0.0013 -3.6185 0.030 0.955 St Dev 0.0006 0.5673 0.3443 0.4816 0.0030 0.1271 0.0014 4.9290 0.010 0.043 Max 0.0019 0.5705 0.9822 1.1212 0.0006 0.3937 0.0005 6.8813 0.049 0.998 Min 0.0000 -1.4233 -0.2079 -0.7390 -0.0084 -0.0968 -0.0051 -14.8316 0.014 0.871

respectively. The R Square varies from 0.871 to 0.998 among the 13 models, which indicates the same model specification has different explanatory powers in describing the data for different cohorts. For the two cohort models with R Square lower than 0.9 (Cohort 2 and Cohort 8), none of the explanatory variables are significant at 10% level.

The pairwise tests on coefficients of the “Income” variable show no statistically significant difference between the 4 coefficients reported in Table 4.4. However, such tests show that some coefficient pairs are different at 5% level for other explanatory variables, which are indication of parameter heterogeneity. The hypothesis of homogeneity is also rejected by the Likelihood Ratio test of parameter homogeneity, which is based on a Random Parameter Model we subsequently estimated.

The above discussion indicates that the assumption of parameter homogeneity for the earlier models might not be appropriate. However, it also highlights the practical constraints faced by the heterogeneous model in terms of degree of freedom lost. Table 4.5 shows that a majority of the cohort specific regression coefficients are not significant at 10% level. Furthermore, several cohorts have too few observations to be included in the cohort specific regression in the first place. These constraints will limit the practical usefulness of the heterogeneous model.

4.5 Conclusion

This chapter investigates the theoretical and empirical aspects of the linear static pseudo panel model. As the pseudo panel is a synthetic panel constructed from individual data, the first issue we address is the relationship between the estimator based on pseudo panel and that based on individual data. It has been shown that Weighted Least Square estimator using cohort means is equivalent to the Instrumental Variable estimator using individual data, using cohort dummy variables as instruments. However, this relationship is based on the assumption that there is a linear relationship between the dependant variables and the explanatory variables for the individual household. For car ownership model, this assumption is hard to defend theoretically and its appropriateness has to be tested with empirical models. If Weighted Least Square estimator is to be used, it requires any linear transformation of data to be performed on individuals before aggregation, and each observation in the pseudo panel to be weighted by the square root of the sample number in the cohort.

A number of papers have addressed the problem of measurement error for the pseudo panel model. Section 2 of this chapter reviews the various Error-in-Variable Estimators (EVE) proposed in the literature, including that of Deaton (1985), which is unbiased when T →∞; that of Verbeek and Nijman (1993), which is unbiased when C→∞; and that of Devereux (2003), which is approximately unbiased to the order 1/(CT). The following section attempts to address the conditions required to ignore the measurement error problem. It has been shown that the way cohorts are constructed has direct implication on the bias of the within estimator if pseudo penal is to be estimated as genuine panel. The cohort should be defined in a way such that the population cohort means of the variables concerned vary as much as possible over time. Furthermore, the sample number in each cohort has to be sufficiently large to minimize sampling errors and measurement errors. These conditions appear to be met by our pseudo panel dataset constructed from the Family Expenditure Survey, which justify us to ignore the problem of measurement error in the following empirical work.

We first estimate the car ownership model using Weighted Least Square Estimator of pseudo panel. Different formulations of household structure variables and household location variables have been systematically tested, and the results shows that they are

best represented by the split of 8 household types and aggregated location types respectively. Both the Likelihood Ratio Test and RESET test favour Fixed Effect model to Pooled OLS model. Judged by the statistic of adjusted R Square, it appears that the semi-log models (explanatory variables include the cohort average of log- transformed income and price variables) have the best goodness of fit. However, all semi-log models have the wrong trend of fixed effects across cohorts, implying older cohorts have higher tendency of car ownership when other things being equal. This unsatisfactory result could be taken as the first hint that the empirical data do not support the use of Weighted Least Square Estimator here.

From a theoretical point of view, the Weighted Least Square Estimator is based on the assumption of linear economic relationship at individual household level, which would be inappropriate for car ownership due to its discrete nature. Furthermore, the use of WLSE requires any linear transformation of variables to be done at individual level, which could not be replicated for the forecasted value. These concerns prompt us to consider models based on alternative estimators. If we believe the sample averages of each cohort are unbiased estimates of the true cohort means, the pseudo panel can be treated as genuine panel. In this case, various panel data estimators including fixed effect, random effect and heterogeneous models have been investigated.

Hausman’s Test and RESET misspecification test both reject the Random effect model as the preferred model. Regarding the Fixed Effect model, there is a strong linear trend in the cohort specific constants; however, the likelihood ratio test reject restricted FE model as the preferred model due to significant loss of fit. Based on the semi-log unrestricted FE model, the implied income elasticity is 0.24, purchase price elasticity is -0.14 and running cost elasticity is -0.21.

Despite a rather comprehensive investigation of the pseudo panel car ownership model from both theoretical and empirical perspectives, one important aspect is not covered in this chapter: the role of dynamics. The dynamic models will be considered in the following chapter.

In document The Use of Pseudo Panel Data for Forecasting Car Ownership (Page 62-71)