Choice of Function: Box-Cox Tests - TRANSFORMATIONS OF VARIABLES

TRANSFORMATIONS OF VARIABLES

5.5 Choice of Function: Box-Cox Tests

The possibility of fitting nonlinear models, either by means of a linearizing transformation or by the use of a nonlinear regression algorithm, greatly increases the flexibility of regression analysis, but it also makes your task as a researcher more complex. You have to ask yourself whether you should start off with a linear relationship or a nonlinear one, and if the latter, what kind.

A graphical inspection, using the technique described in Section 4.2 in the case of multiple regression analysis, might help you decide. In the illustration in Section 5.1, it was obvious that the relationship was nonlinear, and it should not have taken much effort to discover than an equation of the form (5.2) would give a good fit. Usually, however, the issue is not so clear-cut. It often happens that several different nonlinear forms might approximately fit the observations if they lie on a curve.

When considering alternative models with the same specification of the dependent variable, the selection procedure is straightforward. The most sensible thing to do is to run regressions based on alternative plausible functions and choose the function that explains the greatest proportion of the variance of the dependent variable. If two or more functions are more or less equally good, you should present the results of each. Looking again at the illustration in Section 5.1, you can see that the linear function explained 69 percent of the variance of Y, whereas the hyperbolic function (5.2) explained 97 percent. In this instance we have no hesitation in choosing the latter.

However, when alternative models employ different functional forms for the dependent variable, the problem of model selection becomes more complicated because you cannot make direct comparisons of R² or the sum of the squares of the residuals. In particular – and this is the most common example of the problem – you cannot compare these statistics for linear and logarithmic dependent variable specifications.

For example, in Section 2.6, the linear regression of expenditure on earnings on highest grade completed has an R² of 0.104, and RSS was 34,420. For the semi-logarithmic version in Section 5.2, the corresponding figures are 0.141 and 132. RSS is much smaller for the logarithmic version, but this means nothing at all. The values of LGEARN are much smaller than those of EARNINGS, so it is

hardly surprising that the residuals are also much smaller. Admittedly R² is unit-free, but it is referring to different concepts in the two equations. In one equation it is measuring the proportion of the variance of earnings explained by the regression, and in the other it is measuring the proportion of the variance of the logarithm of earnings explained. If R² is much greater for one model than for the other, you would probably be justified in selecting it without further fuss. But if R² is similar for the two models, simple eyeballing will not do.

The standard procedure under these circumstances is to perform what is known as a Box-Cox test (Box and Cox, 1964). If you are interested only in comparing models using Y and log Y as the dependent variable, you can use a version developed by Zarembka (1968). It involves scaling the observations on Y so that the residual sums of squares in the linear and logarithmic models are rendered directly comparable. The procedure has the following steps:

1. You calculate the geometric mean of the values of Y in the sample. This is equal to the exponential of the mean of log Y, so it is easy to calculate:

n n Y

Y Y Y n Y

n e e Y Y

e ⁱ ⁿ ⁿ ⁿ

1 1

) ...

) log(

...

1log(

1 log

) ...

(

1 1 1

= ^× ^× ^× ^×

∑ (5.35)

2. You scale the observations on Y by dividing by this figure. So

Yi = Yi / geometric mean of Y, (5.36)

where Y_i^* is the scaled value in observation i.

3. You then regress the linear model using Y* instead of Y as the dependent variable, and the logarithmic model using log Y* instead of log Y, but otherwise leaving the models unchanged.

The residual sums of squares of the two regressions are now comparable, and the model with the lower sum is providing the better fit.

4. To see if one model is providing a significantly better fit, you calculate (n/2) log Z where Z is the ratio of the residual sums of squares in the scaled regressions and n is the number of observations, and take the absolute value (that is, ignore a minus sign if present). Under the null hypothesis that there is no difference, this statistic is distributed as a χ² (chi-squared) statistic with 1 degree of freedom. If it exceeds the critical level of χ² at the chosen significance level, you conclude that there is a significant difference in the fit.

Example

The test will be performed for the alternative specifications of the earnings function. The mean value of LGEARN is 2.430133. The scaling factor is therefore exp(2.430133) = 11.3604. The residual sum of squares in a regression of the Zarembka-scaled earnings on S is 266.7; the residual sum of squares in a regression of the logarithm of Zarembka-scaled earnings is 132.1. Hence the test statistic is

2 . 1 200 . 132

7 . log 266 2

570 e = (5.37)

The critical value of χ² with 1 degree of freedom at the 0.1 percent level is 10.8. Hence there is no doubt, according to this test, that the semi-logarithmic specification provides a better fit.

Note: the Zarembka-scaled regressions are solely for deciding which model you prefer. You should not pay any attention to their coefficients, only to their residual sums of squares. You obtain the coefficients by fitting the unscaled version of the preferred model.

Exercises

5.6 Perform a Box-Cox test parallel to that described in this section using your EAEF data set.

5.7 Linear and logarithmic Zarembka-scaled regressions of expenditure on food at home on total household expenditure were fitted using the CES data set in Section 5.2. The residual sums of squares were 225.1 and 184.6, respectively. The number of observations was 868, the household reporting no expenditure on food at home being dropped. Perform a Box-Cox test and state your conclusion.

5.8 Perform a Box-Cox test for your commodity in the CES data set, dropping households reporting no expenditure on your commodity.

Appendix 5.1

A More General Box-Cox Test

(Note: This section contains relatively advanced material that can safely be omitted at a first reading).

The original Box-Cox procedure is more general than the version described in Section 5.5. Box and Cox noted that Y – 1 and log Y are special cases of the function (Y^λ – 1)/λ, Y – 1 being the function when λ is equal to 1, log Y being the (limiting form of the) function as λ tends to 0. There is no reason to suppose that either of these values of λ is optimal, and hence it makes sense to try a range of values and see which yields the minimum value of RSS (after performing the Zarembka scaling). This exercise is known as a grid search. There is no purpose-designed facility for it in the typical regression application, but nevertheless it is not hard to execute. If you are going to try 10 values of λ, you generate within the regression application 10 new dependent variables using the functional form and the different values of λ, after first performing the Zarembka scaling. You then regress each of these separately on the explanatory variables. Table 5.4 gives the results for food expenditure at home, using the CES data set, for various values of λ. The regressions were run with disposable

TABLE 5.4

λ RSS λ RSS

1.0 225.1 0.4 176.4

0.9 211.2 0.3 175.5

0.8 199.8 0.2 176.4

0.7 190.9 0.1 179.4

0.6 184.1 0.0 184.6

0.5 179.3

personal income being transformed in the same way as Y, except for the Zarembka scaling. This is not necessary; you can keep the right-side variable or variables in linear form if you wish, if you think this appropriate, or you could execute a simultaneous, separate grid search for a different value of λ for them.

The results indicate that the optimal value of λ is about 0.3. In addition to obtaining a point estimate for λ, one may also obtain a confidence interval, but the procedure is beyond the level of this text. (Those interested should consult Spitzer, 1982.)

6

In document Dougherty Introduction to Econometrics (Page 154-158)