• No results found

The Chow Test

In document Dougherty Introduction to Econometrics (Page 177-181)

DUMMY VARIABLES

6.4 The Chow Test

It sometimes happens that your sample of observations consists of two or more subsamples, and you are uncertain about whether you should run one combined regression or separate regressions for each subsample. Actually, in practice the choice is not usually as stark as this, because there may be some scope for combining the subsamples, using appropriate dummy and slope dummy variables to relax the assumption that all the coefficients must be the same for each subsample. This is a point to which we shall return.

Suppose that we have a sample consisting of two subsamples and that you are wondering whether to combine them in a pooled regression, P¸or to run separate regressions, A and B. We will denote the residual sums of squares for the subsample regressions RSSA and RSSB. We will denote RSS andAP

P

RSS the sum of the squares of the residuals in the pooled regression for the observations belongingB

to the two subsamples. Since the subsample regressions minimize RSS for their observations, they must fit them as least as well as, and generally better than, the pooled regression. Thus RSSARSSAP andRSSBRSSBP, and so (RSSA + RSSB)≤ RSSP, where RSSP, the total sum of the squares of the residuals in the pooled regression, is equal to the sum of RSS and AP RSS .BP

Equality between RSSP and (RSSA + RSSB) will occur only when the regression coefficients for the pooled and subsample regressions coincide. In general, there will be an improvement (RSSP – RSSARSSB) when the sample is split up. There is a price to pay, in that k extra degrees of freedom are used up, since instead of k parameters for one combined regression we now have to estimate 2k in all. After breaking up the sample, we are still left with (RSSA + RSSB) (unexplained) sum of squares of the residuals, and we have n – 2k degrees of freedom remaining.

We are now in a position to see whether the improvement in fit when we split the sample is significant. We use the F statistic

)

which is distributed with k and n – 2k degrees of freedom under the null hypothesis of no significant improvement in fit.

We will illustrate the test with reference to the school cost function data, making a simple distinction between regular and occupational schools. We need to run three regressions. In the first we regress COST on N using the whole sample of 74 schools. We have already done this in Section 6.1. This is the pooled regression. We make a note of RSS for it, 8.9160×1011. In the second and

Figure 6.6. Pooled and subsample regression lines

. reg COST N if OCC==0

Source | SS df MS Number of obs = 40 ---+--- F( 1, 38) = 13.53 Model | 4.3273e+10 1 4.3273e+10 Prob > F = 0.0007 Residual | 1.2150e+11 38 3.1973e+09 R-squared = 0.2626 ---+--- Adj R-squared = 0.2432 Total | 1.6477e+11 39 4.2249e+09 Root MSE = 56545 COST | Coef. Std. Err. t P>|t| [95% Conf. Interval]

N | 152.2982 41.39782 3.679 0.001 68.49275 236.1037 _cons | 51475.25 21599.14 2.383 0.022 7750.064 95200.43 ---. reg COST N if OCC==1

Source | SS df MS Number of obs = 34 ---+--- F( 1, 32) = 55.52 Model | 6.0538e+11 1 6.0538e+11 Prob > F = 0.0000 Residual | 3.4895e+11 32 1.0905e+10 R-squared = 0.6344 ---+--- Adj R-squared = 0.6229 Total | 9.5433e+11 33 2.8919e+10 Root MSE = 1.0e+05 COST | Coef. Std. Err. t P>|t| [95% Conf. Interval]

N | 436.7769 58.62085 7.451 0.000 317.3701 556.1836 _cons | 47974.07 33879.03 1.416 0.166 -21035.26 116983.4

---third we run the same regression for the two subsamples of regular and occupational schools separately and again make a note of RSS. The regression lines are shown in Figure 6.6.

RSS is 1.2150×1011 for the regular schools and 3.4895×1011 for the occupational schools. The total RSS from the subsample regressions is therefore 4.7045×1011. It is lower than RSS for the pooled regression because the subsample regressions fit their subsamples better than the pooled

COST

N Occupational

schools regression

Regular schools regression

Pooled regression

0 100000 200000 300000 400000 500000 600000

0 200 400 600 800 1000 1200

Occupational schools Regular schools

regression. The question is whether the difference in the fit is significant, and we test this with an F test. The numerator is the improvement in fit on splitting the sample, divided by the cost (having to estimate two sets of parameters instead of only one). In this case it is (8.9160 – 4.7045)×1011 divided by 2 (we have had to estimate two intercepts and two slope coefficients, instead of only one of each).

The denominator is the joint RSS remaining after splitting the sample, divided by the joint number of degrees of freedom remaining. In this case it is 4.7045×1011 divided by 70 (74 observations, less four degrees of freedom because two parameters were estimated in each equation). When we calculate the F statistic the 1011 factors cancel out and we have

1 . 70 31 10 7045 . 4

2 10 2115 . ) 4 70 , 2

( 11

11 =

×

= ×

F (6.38)

The critical value of F(2,70) at the 0.1 percent significance level is a little below 7.77, the critical value for F(2,60), so we come to the conclusion that there is a significant improvement in the fit on splitting the sample and that we should not use the pooled regression.

Relationship between the Chow Test and the F Test of the Explanatory Power of a Set of Dummy Variables

In this chapter we have used both dummy variables and a Chow test to investigate whether there are significant differences in a regression model for different categories of a qualitative characteristic.

Could the two approaches have led to different conclusions? The answer is no, provided that a full set of dummy variables for the qualitative characteristic has been included in the regression model, a full set being defined as an intercept dummy, assuming that there is an intercept in the model, and a slope dummy for each of the other variables. The Chow test is then equivalent to an F test of the explanatory power of the dummy variables as a group.

To simplify the discussion, we will suppose that there are only two categories of the qualitative characteristic, as in the example of the cost functions for regular and occupational schools. Suppose that you start with the basic specification with no dummy variables. The regression equation will be that of the pooled regression in the Chow test, with every coefficient a compromise for the two categories of the qualitative variable. If you then add a full set of dummy variables, the intercept and the slope coefficients can be different for the two categories. The basic coefficients will be chosen so as to minimize the sum of the squares of the residuals relating to the reference category, and the intercept dummy and slope dummy coefficients will be chosen so as to minimize the sum of the squares of the residuals for the other category. Effectively, the outcome of the estimation of the coefficients is the same as if you had run separate regressions for the two categories.

In the school cost function example, the implicit cost functions for regular and operational schools with a full set of dummy variables (in this case just an intercept dummy and a slope dummy for N), shown in Figure 6.5, are identical to the cost functions for the subsample regressions in the Chow test, shown in Figure 6.6. It follows that the improvement in the fit, as measured by the reduction in the residual sum of squares, when one adds the dummy variables to the basic specification is identical to the improvement in fit on splitting the sample and running subsample regressions. The cost, in terms of degrees of freedom, is also the same. In the dummy variable approach you have to add an intercept dummy and a slope dummy for each variable, so the cost is k if there are k – 1 variables in the model. In the Chow test, the cost is also k because you have to estimate 2k parameters

instead of k when you split the sample. Thus the numerator of the F statistic is the same for both tests.

The denominator is also the same because in both cases it is the residual sum of squares for the subsample regressions divided by n – 2k. In the case of the Chow test, 2k degrees of freedom are used up when fitting the separate regressions. In the case of the dummy variable group test, k degrees of freedom are used up when estimating the original intercept and slope coefficients, and a further k degrees of freedom are used up estimating the intercept dummy and the slope dummy coefficients.

What are the advantages and disadvantages of the two approaches? The Chow test is quick. You just run the three regressions and compute the test statistic. But it does not tell you how the functions differ, if they do. The dummy variable approach involves more preparation because you have to define a dummy variable for the intercept and for each slope coefficient. However, it is more informative because you can perform t tests on the individual dummy coefficients and they may indicate where the functions differ, if they do.

Exercises

6.16* Are educational attainment functions are different for males and females? Using your EAEF data set, regress S on ASVABC, ETHBLACK, ETHHISP, SM, and SF (do not include MALE).

Repeat the regression using only the male respondents. Repeat it again using only the female respondents. Perform a Chow test.

6.17 Are earnings functions different for males and females? Using your EAEF data set, regress LGEARN on S, ASVABC, ETHBLACK, and ETHHISP (do not include MALE). Repeat the regression using only the male respondents. Repeat it again using only the female respondents.

Perform a Chow test.

6.18* Are there differences in male and female educational attainment functions? This question has been answered by Exercise 6.16 but nevertheless it is instructive to investigate the issue using the dummy variable approach. Using your EAEF data set, define the following slope dummies combining MALE with the parental education variables:

MALESM = MALE*SM MALESF = MALE*SF

and regress S on ETHBLACK, ETHHISP, ASVABC, SM, SF, MALE, MALEBLAC, MALEHISP (defined in Exercise 6.15), MALEASVC (defined in Exercise 6.12), MALESM, and MALESF.

Next regress S on ETHBLACK, ETHHISP, ASVABC, SM, and SF only. Perform an F test of the joint explanatory power of MALE and the slope dummy variables as a group (verify that the F statistic is the same as in Exercise 6.16) and perform t tests on the coefficients of the slope dummy variables in the first regression.

6.19 Where are the differences in male and female earnings functions? Using your EAEF data set, regress LGEARN on S, ASVABC, ETHBLACK, ETHHISP, MALE, MALES, MALEASVC, MALEBLAC, and MALEHISP. Next regress LGEARN on S, ASVABC, ETHBLACK, and ETHHISP only. Calculate the correlation matrix for MALE and the slope dummies. Perform an F test of the joint explanatory power of MALE and the slope dummies (verify that the F statistic is the same as in Exercise 6.17) and perform t tests on the coefficients of the dummy variables.

 C. Dougherty 2001. All rights reserved. Copies may be made for personal use. Version of 01.05.01.

7

SPECIFICATION OF

In document Dougherty Introduction to Econometrics (Page 177-181)