Instrumental Variables – Estimation Michael Malcolm

4 Instrumental Variables

Unit 6.2: Instrumental Variables – Estimation Michael Malcolm

January 26, 2011

The previous section provided intuition and guidance on when instrumental variables are appropriate and strategies for locating good instruments. This section provides details on instrumental variables estimation.

1 Two Stage Least Squares – Simple Case

Let us begin with a simple single-variable case to develop intuition. The regression of interest is:

Yi=β0+β1xi+ui

However, it cannot be estimated directly by OLS becauseX is endogenous, i.e. E(ui|xi)6= 0. Suppose

that we have an instrumentZ that satisfies our two conditions for instrumental variables: Zmust be relevant in the sense that it is correlated withX andZ must be exogenous. Mathematically, the relevance condition isCorr(X, Z)6= 0 and the exogeneity condition isE(ui|zi) = 0.

The estimation technique for instrumental variables models is two-stage least squares, abbreviated as TSLS. The first stage is to regressX onZ to obtain:

xi= ˆγ0+ ˆγ1zi

The second stage is to regressY on the ˆxi obtained in the first-stage regression.

yi= ˆβ0+ ˆβ1xiˆ

The basic intuition here is that the OLS regression of Y on X is invalid because of the endogeneity. However, ˆxwas reconstructed fromxbased onZ,which is exogenous. Basically, we are using only the piece ofX that moves exogenously, via its correlation withZ, to construct exogenous variation inX. We can then use this exogenous variation to uncover the association betweenX andY, which is our object of interest.

Mathematically, while the regression ofY onX cannot be estimated by OLS becauseE(ui|xi)6= 0, using

our instrumentZ allows us to construct ˆx, whichis exogenous: E(ui|xiˆ ) = 0. Recall that the OLS formula for the slope in the single-variable case is ˆβ1 =

sxy

x . For comparison, the

TSLS formula for the slope in the single-variable case is ˆβ1 =

szy

sxz. It can be demonstrated under proper

assumptions that the TSLS estimates are unbiased, consistent and asymptotically normal estimates of the true population parameters.

2 Example: Single-Variable TSLS

A brief caution: Doing the two-stage regressions by hand will generate the correct TSLS estimates of the coefficients. However, the standard errors from the second stage are not the proper standard errors for the

Figure 1: Input syntax for a TSLS regression

estimation procedure. In other words, you should use a routine designed for TSLS in order to recover the correct coefficientsand the correct standard errors.

Luckily, EViews contains a built-in routine for TSLS estimation and we need not worry about doing the two stages by hand. Throughout, we will work with the filemroz.wf1. We are interested in the regression ofY = ln(wage) on education. Beginning with simple OLS, the estimated regression is:

yi=−0.1852 + 0.1086∗educ

As discussed earlier, this regression has serious endogeneity problems – education is self-chosen and is correlated with intelligence, so it’s not clear whether the higher wage is a result of education or whether it’s just because smarter people earn higher wages and happen to go to school for longer. A well-known research project in economics uses the father’s education level as an instrument – your father’s education level is correlated with your own education level but is not endogenous to your own wage.

In the EViewsEquation Estimationwindow, chooseTSLSas theMethod. In theEquation specification box, enter the equation of interest in the usual way. Here, we are interested in the association between education and wage:

ln(wage) =β0+β1∗educ+ui

So we enter log(wage) c educ in the Equation specification box. We then enter our instrument

fatheducin the Instrument list. Figure 1 shows the correct input syntax.

As usual, use White standard errors since there is generally no reason to assume that regressions are homoskedastic. The results are shown in figure 2.

Using TSLS estimation, the estimated regression line is

Figure 2: Output from a TSLS regression

As expected, the coefficient estimate in the simple OLS regression is biased upwards. In the simple model, each additional year of education is associated with a 10.86% increase in wage (remember to interpret it in percentages since the wage is logged). As discussed, that is too high since education is endogenous and correlated with ability. After using suitable instruments, it appears that the association is weaker. Each additional year of education, as such, is associated with only a 5.92% increase in wage.

Note that TSLS standard errors are usually higher than OLS standard errors. The intuitive reason is that we are using only part of the variation inX to determine an association with Y (the exogenous part backed out throughZ) rather than usingall of the variation inX. We are losing information, but hopefully preserving theright information by using our instruments.

3 Two Stage Least Squares – General Version

A regression could presumably include some regressors that are endogenous but some that are exogenous. Consider a general specification withkendogenous regressors{X1, X2, ..., Xk}and withrexogenous regressors{W1, W2, ..., Wr}. The equation of interest then hask+rregressors:

Y =β0+β1X1+...+βkXk+βk+1W1+...+βk+rWr+u We have minstruments{Z1, Z2, ..., Zm}.

• The model isoveridentified ifm > k

• The model isexactly identified orjust identified ifm=k • The model isunderidentified ifm < k

TSLS requires overidentification or exact identification. You cannot use instrumental variables estimation (or any other kind of identification) for underidentified models – it amounts to trying to solve a system with more variables than equations.

The procedure is similar to the single-variable case.

The first stage is to estimate a regression ofeach endogenous variableX on all of the instrumentsZ and on all of the exogenous variablesW:

ˆ X1= ˆγ0+ ˆγ1Z1+...+ ˆγmZm+ ˆγm+1W1+...+ ˆγm+rWr .. . ˆ XK = ˆγ0+ ˆγ1Z1+...+ ˆγmZm+ ˆγm+1W1+...+ ˆγm+rWr

Observe thatall the endogenous variables are regressed on all the instruments and exogenous variables in the first stage.

In the second-stage regression, we substitute back the ˆxfor each observation to the equation of interest. Again, the intuition is that changes in the ˆxare based on exogenous variation in the instruments and other exogenous variables.

Y =β0+ ˆβ1Xˆ1+...+ ˆβkXˆk+ ˆβk+1W1+...+ ˆβk+rWr

Notice that the intercept (which, as discussed earlier, is just a variable set equal to 1) is trivially considered to be an exogenous regressor like theW terms. In this sense, it appears in the equation of interest and in all the first-stage regressions.

4 Example: Multiple-Variable TSLS

Using themroz.wf1file from earlier, we will now consider the following model:

ln(wage) =β0+β1∗educ+β2∗age+ui

We will treat education as an endogenous regressorX and age as an exogenous regressorW. Let us use both mother’s education and father’s education as instrumentsZ.

To implement this in EViews, enter the equation of interest in the Equation specificationbox with

TSLS selected as the Method. Thus, we enterlog(wage) c educ age in this box. In the Instrument listbox enter the instruments and the exogenous regressors. Thus, we enter age motheduc fatheducin this box.

Put a different way, endogenous regressors X should appear only in theEquation specification box. Exogenous regressorsW should appear in both theEquation specification box and in the Instrument listbox. InstrumentsZ should appear only in the Instrument listbox. Figure 3 shows the correct input syntax. Note that the constant intercept is automatically included as an exogenous regressor even if not listed in theInstrument list.

Running this estimation, the coefficient implies that one more year of education is associated with a 5.85% increase in wage.

5 Conditions for Valid Instruments in the General Case

The relevance conditions are a bit more subtle than in the single-variable, single-instrument case. The idea is similar, though – basically the instrumentsZ have to contain ”enough” information about X. Here are the relevance conditions precisely stated:

Figure 3: Input syntax for a TSLS regression with exogenous regressors and multiple instruments

• One endogenous X and multiple instrumentsZ

At least one instrument Z must be useful to predict Y given the exogenous variables W. That is, the instrument Z must add something beyond the information already contained in the exogenous variablesW.

• Multiple endogenous variables X

In this case, the set {Xˆ1, ...,XKˆ , W1, ..., Wr,1} must not contain any perfect multicollinearity (note the inclusion of the intercept 1). Basically, what this says is that the instruments must provide enough information about each endogenousX to separately sort out the effect of each one onY

The exogeneity condition is the same as in the single variable case. The instrumentsZand the exogenous variablesW must satisfyE(u|Z1, ..., Zm, W1, ..., Wr) = 0.

6 Checking for Instrument Relevance

The relevance condition amounts to a requirement that the instrumentsZ have useful predictive power over

X, so that we have useful information in the second stage regression when we recover the association between

X and Y. We say that our instruments are weak instruments when the correlation between X and Z is weak. Here are two examples from actual research projects.

• The relationship betweenY = quantity of cigarettes demanded andX = price of cigarettes is endogenous because of the usual supply/demand issue discussed in the previous section. We need an instrument that affects the price but without any endogeneity to demand. Z= distance from store to distributor is certainly exogenous in the sense that it has no direct influence on demand. However, the correlation between Z and the price of cigarettes X is probably very weak since cigarettes are light and transportation is a very small part of the supply cost.

• The relationship between Y = wage andX = education features endogeneity, as discussed earlier. In a famous paper, a researcher used Z = birthday as an instrument. Birthday is obviously exogenous, but what’s the relevance? There is a law in the US that everyone must attend school until age 16, so a student who turns 16 in the middle of 10th grade might drop out before completing it, but a student who turns 16 after 10th grade already finished would be required to complete it. The thought is that this is anexogenousreason why different people might have variation in their education levels. Nevertheless, the correlation here isvery weak. This difference explains very little of people’s differences in education levels. In a famous response to this paper, a researcher setZ equal to random noise and got essentially the same results as in the original paper! This is a good indication that the instrument didn’t contain much information about X.

How weak is weak? This is an active topic in econometric research, but a rule of thumb for the case with one endogenous regressor is that the F-statistic for the overall significance of the first-stage regression should be greater than 10. If F <10, then this is cause for concern about weak instruments. Notice that simply calculating the correlation betweenX andZ is not enough since there could be multiple instruments and other exogenous variables.

Why do we care about weak instruments? The basic problem is that, as the instruments become weaker, the normal distribution provides a poorer and poorer approximation of the asymptotic distribution of the coefficient, even in large samples. This makes inference on the coefficients difficult since hypothesis tests and confidence intervals are all based on the coefficient being normally distributed.

What should we do if we have weak instruments? If you have many instruments, you are better off discarding weak instruments and using only the most relevant ones. The standard errors rise, but the interpretation of the standard errors is suspect anyway if the instruments are too weak. However, if you have exact identification or overidentification but not enough ”strong” instruments, then there really is no solution other than finding better instruments. There is no statistical procedure that can solve a lack of good information.

7 Checking for Instrument Exogeneity

If the instruments Z are not exogenous with respect to the original regression, then TSLS estimates are going to suffer from the same problem as the original OLS estimates.

Can we test whether a variable is exogenous? Not really. There is no statistical evidence that can tell you whether a relationship is exogenous or endogenous. The regression of wage on education is perfectly valid in a statistical sense. You need some structural understanding of the problem to realize that the relationship is endogenous. This kind of thing is at the heart of the difference between statistics and econometrics. No matter how good your statistical skills are, these things ultimately rely on understanding the economic theory underlying your models.

8 Testing for Overidentification

In the exactly identified case, there is virtually nothing that we can do to test for exogeneity. In the overidentified case, we can test the hypothesis that ”extra” instruments are exogenousunder the assumption that there are enough valid instruments in the first place.

The intuition behind these tests is that, with two valid instruments, we hope that our answers would be ”close” in a statistical sense. With just identification, there’s nothing to test against. But again, this test relies on the assumption that there are at least enough valid instruments for just-identification. All we are doing is testing whether the model is overidentified.

The test procedure is described below. The null hypothesis is that there is overidentification – that all of the instruments are legitimately exogenous.

• Regress the residuals ˆui on the instrumentsZ and the exogenous regressors W as shown below:

ui= ˆδ0+ ˆδ1Z1+...+ ˆδmZm+ ˆδm+1W1+ ˆδm+1Wm+rWr

• Compute the test statistic F for the restricted / unrestricted regression test given in unit 4.2, using the restriction that δ1 =δ2 =... =δm = 0. The intuition is that, if all the instrumentsZ are truly exogenous, then they should be uncorrelated with the regression residual.

Observe that if there are no exogenous regressorsW, then the relevant F-statistic is just the F-statistic for overall significance of the entire regression.

• Based on the test statisticF from above, construct the test statisticJ =mF, wheremis the number of instruments.

• The rejection region is J > χ2

m−k, where the rejection region is taken from the χ2 distribution with m−kdegrees of freedom (i.e. the degree of overidentification).

• IfJ falls in the rejection region, then we can reject the null hypothesis that the model is legitimately overidentified – and thus conclude that at least one of the instruments fails the exogeneity requirement. Note that the usefulness of this test is somewhat limited – if we reject the overidentification hypothesis, all we know is that there is some problem with exogeneity in the instruments. We do not know which instrument is causing the problem. Further, the whole test is predicated on having at least enough instruments for identification. The test is meaningless if there are so many endogenous regressors as to result in underidentification.

9 Example: Testing for Overidentification

Consider the example from earlier in this section. The equation of interest is:

ln(wage) =β0+β1∗educ+β2∗age+ui

We treated education as an endogenous regressor X and age as an exogenous regressor W. Because we took both mother’s education and father’s education as instrumentsZ, the model is overidentified since there arem= 2 instruments butk= 1 endogenous variable.

EViews automatically saves the regression residuals in a series named resid. After running the TSLS regression as explained above, type series residual=resid in the command window and hit enter, this preserves and saves the regression residuals for use in estimation. You can now treatresidualjust like any other variable.

Proceeding through the rest of the steps:

• Regress the residuals ˆui on the instruments motheduc and fatheduc and the exogenous regressor

ageusing simple OLS. These results are given in figure 4.

• For the restricted / unrestricted regression test, using the restriction that the coefficients onmotheduc

andfatheducare equal to zero, we need to also run the restricted regression that omits these variables. Using simple OLS and regressing the residuals only on the exogenous regressorage gives the results shown in figure 5.

Recall from section 4.2 that the F-statistic for the restricted versus unrestricted regression test is as follows. Note that there areq= 2 restrictions, k= 3 independent variables, and that the sample size isn= 428.

Figure 4: Regressing TSLS residuals on instruments and exogenous regressors

F= (SSRrestricted−SSRunrestricted)/q (SSRunrestricted)/(n−k−1)

= (201.6813−201.5998)/2

201.5998/(428−3−1) = 0.0857

• As there are m= 2 instruments, the J-statistic isJ = 2(0.0857) = 0.1714.

• Here, we have 2−1 = 1 degree of overidentification. Using the chi-square table, for a 5% significance level, the rejection region isJ > χ21⇒J >3.84.

• Since our J-statistic does not fall in the rejection region, we cannot reject the hypothesis that the model is legitimately overidentified.

Thus, we have no evidence that the overidentification is dubious in the sense that some of the instruments fail the exogeneity condition. Again note the limitations of this result – ifbothinstruments fail the exogeneity condition, then this test doesn’t tell us anything since the assumption of the test is that the model is at least just-identified. Furthermore, even if the testhad rejected the null of overidentification, we would have no idea which instrument is causing a problem (or possibly both, which would make the test itself invalid). Ultimately, this test can be an informative diagnostic, but you can really only make a determination about endogeneity or exogeneity using theory.

In document 351_metrics_f12.pdf (Page 98-107)