Lecture_13 [Modo de compatibilidad]

(1)

Techniques of Statistical

Analysis I

Lect_13: Endogeneity and Instrumental

Variables (+robust standard errors)

Bruno Arpino

(2)

Assumptions underlying linear regression models

Omitted Variable and Simultaneity Bias

Instrumental Variables Regression

Outline

Instrumental Variables Regression

---

(3)

1)

Linearity

: The expected value of Y is a linear function of the

independent variables.

2) The error terms are random variables with mean 0 and a

Assumptions

ε

X

β

X

β

X

β

Y

=

₀

+

₁ ₁

+

₂ ₂

+

K

+

_k _k

+

3

2) The error terms are random variables with mean 0 and a

constant variance,

σ

2

The constant variance assumption is known as

homoschedasticity

.

3)

Exogeneity

: The regressors and the error term are not

correlated

i

E[

ε

]

=

0 ∀

i

Var[

ε

_i

]

=

σ

2

∀

i

(4)

4) The random error terms, ε

_i

, are not correlated with one

another, so that:

Assumptions (cont’d)

ε

X

β

X

β

X

β

Y

=

₀

+

₁ ₁

+

₂ ₂

+

K

+

_k _k

+

( )

_i _j

0 Cov

ε ε

=

∀ ≠

i

j

5)

Normality

: The random error terms are normally

distributed.

Assumption 5 is not required for obtaining the OLS estimates

but it is required to make inference. Test of hypothesis

(t-tests) and confidence intervals are based on this assumption.

( )

_i _j

0

(5)

For assumptions 1, 2, 4 and 5 there are statistical tests to

check their validity. Statistical solutions exist to deal with

violations of these assumptions.

We focus on assumption 3 (exogeneity) that cannot be

tested. However, its plausibility can be judged by the

Assumptions (cont’d)

5

tested. However, its plausibility can be judged by the

(6)

Omitted Variable Bias (OVB) is the bias of the estimators of

regression coefficients caused by omitting relevant variables.

Omitted relevant variables are variables correlated both with

the dependent variable and with the independent variables

included in the regression (i.e.,

confounders

).

Omitted variable bias

included in the regression (i.e.,

confounders

).

OVB causes the violation of assumption 3:

Let’s see why…

(7)

Let assume the true model is :

but we omit X

₂

and estimate the model:

Now the error term includes X

₂

:

Omitted variable bias (cont’d)

0 1 1 2 2

Y

= +

β

X

+

β

X

+

ε

0 1 1

Y

= +

β

%

β

%

X

+

ε

%

2 2

ε

%

=

β

X

+

ε

7

The consequence is that if X

₁

and X

₂

are correlated, in the

estimated model X

₁

is correlated with the error term and this

violates assumption 3.

(8)

Also the existence of a bidirectional relationship between X

₁

and Y violates assumption 3.

Also in this case we say that X

₁

is

endogenous

and OLS

gives biased estimate of its slope: simultaneity bias!

(9)

Consider the dataset “

smoking.dta

” that includes information on

1388 children born in 1988 in the US. The file includes the following

variables:

bwght birth weight, ounces

cigs cigarettes smoked per day by the mother

An example of endogeneity: the effect of

smoking during pregnancy

9

cigs cigarettes smoked per day by the mother

while pregnant

male =1 if male child

parity birth order of the child

motheduc mother's years of education

faminc family income (1988)

cigprice (average) cigarette price in the home state (1988)

smoking.dta is a subset of the data analyzed by: J. Mullahy (1997),

(10)

Does smoking during pregnancy have a negative impact on

child health?

If so, can the policy maker protect children health by increasing

cigarettes taxes?

Can we expect that an increase in cigarette taxes will be effective in

reducing cigarette smoking? (Is cigarettes demand elastic to

Research question and policy implications

reducing cigarette smoking? (Is cigarettes demand elastic to

cigarette price?)

SMOKING

Birth weight (measure of child health)

(11)

Preliminary step: based on the theory and past research is it

reasonable to establish a causal pathway from smoking to health?

Holland (1986): “statistics can only estimate the effects of causes”.

It is the “Science” that helps in building the underlying theoretical

framework and deciding the role each variable is expected to play

(outcome, regressor, confounder…)

The “Science”: the literature review

11

(outcome, regressor, confounder…)

Start with a

literature review

. Examples of two papers on the

topic are:

J. van Reek, R. Knibbe, T. van Iwaarde (1991), Policy relevance of a

survey on smoking and drinking behaviour among Dutch school

children,

Health Policy

, 18 (3), pp 261-268.

(12)

It’s known that smoking is a cause of a number of diseases

including lung cancer and chronic obstruction pulmonary disease.

It is strongly suspected to be one of the causes of many

others including heart disease, stroke, most cancers, and

congenital anomalies.

Previous research and the policy relevance

of the smoking-health relationship

congenital anomalies.

It exacerbates some diseases including diabetes, HIV,

depression, respiratory infection, chronic liver disease,

arthritis, nephritis, and ulcers.

In countries where there is a public health system, the society

(13)

Based on theory, past research, common sense, make a list of the possible confounders:

Observables: age, education, income, ethnicity…

Unobservables: ???

Leigh and Schembri (2004) note that:

The “Science”: confounders, assumptions

13

“

Risk aversion

might lead people to never smoke and to maintain good

health. Any correlation between smoking and health that did not remove

risk aversion would

overestimate

the effect of smoking on health.”

(14)

Leigh and Schembri (2004) also note that another problem can arise:

“…

physical functional decline

could scare someone and result in the

person quitting smoking. The smoking that lead to the disability may have

stopped months or years before, yet the person would likely still have the

disability. This is a problem of

BIDIRECTIONAL CAUSALITY

and could lead

to

underestimate

the effect of smoking.

The “Science”: confounders, assumptions

In fact, think for example in this way:

non-smoker (at time t): good health (at time t) non-smoker (at time t +1) non-smoker (at time t): bad health (at time t) non-smoker (at time t +1) smoker (at time t): good health (at time t) smoker (at time t +1)

(15)

A part from the previous considerations, direction of bias can also be guessed by using an influence diagram. As noted before:

Omitting Risk aversion will cause a negative bias

Omitting previous health status will cause a positive bias

A priori we do not know which effect will prevail. This depends on the strength of the unobserved variables!

Potential direction of biases

T

_Y

Bias = true parameter – expected value of estimator

15

T

(smoking)

Y

(child birth weight)

U

(risk aversion)

T

(smoking)

Y

U

(previous health status)

+

-+

₊

(16)

Our data are not sufficient to use OLS

. In fact, this method

requires that all the relevant confounders are observed and

controlled for.

Risk aversion is an important unobserved confounder. Not

controlling for it causes OVB!

And now statistics: which method is

appropriate?

controlling for it causes OVB!

Also the fact that causality is bidirectional violates the exogeneity

assumption (causing simultaneity bias). (Here we can think to this

problem again as an omitted variable bias: omission of previous

health status of the mother).

(17)

Z is related to T (relevance) and not (directly) to Y (validity)

Z is a variable “outside” the model in the sense that we consider it only to create random variation (variation not related to U) in T.

That’s why it’s known as instrumental variable.

A graphical representation of the

Instrumental Variables model

17

T

(smoking)

Y

X

(income, education, child parity and gender…)

U

(risk aversion; previous health status)

Z

(18)

The characteristics of a good instrument

A suitable instrument, Z, is a variable with 2 characteristics

RELEVANCE:

Z is correlated with the endogenous variable, X₁

VALIDITY (or Exclusion restriction):

Z is uncorrelated with the outcome, Z is uncorrelated with the outcome,

conditional on the endogenous variable.

In other words, it does not have direct effects on the Y. Z affects the outcome only through its effect on the endogenous variable.

Z acts as a randomization device:

(19)

A Venn-diagram representation of IV

“Bad variation

in T”

Y

U

A circle represents variation in a variable.

Overlapping parts represent covariance. IV methods use the association between

19

“Good

variation in T”

T

Z

Y

association between

Z and T to extrapolate from the whole variation of T a sub-part

of variation that it is not contaminated by U. Note that this is only a sub-part of the good

(20)

The most popular estimator for an IV model is known as 2SLS. The method consists of two steps:

In the first stage, the endogenous variable, T, is regressed on all the Xs variables and instruments, Z. This model is used to obtain the prediction of T.

In the second stage the outcome variable is regressed on the Xs variables

The Two Stage Least Square Estimator

(2SLS)

In the second stage the outcome variable is regressed on the Xs variables and the predictions of T from the first stage.

The first stage helps in separating the “bad” variation in T (related to U) from the “good” variation (not related to U). In a sense, we “purify” T.

0 3 i

(1

)

ˆ

st

i k k

k

nd

T

= +

γ

β

X

+

γ

Z

+

δ

stage

=

+

∑

(21)

If the two requirements of relevance and validity are violated

(the instruments are not good) then the IV estimator is more biased than the simple OLS!!!

We talk about invalid instrument if Z has a direct effect on Y (e.g., education is not an instrument because has a direct effect on health, not only an indirect effect trough smoking

The cure can be worst than the disease!!!

21

effect on health, not only an indirect effect trough smoking behavior). The validity is primarily assessed on the basis of theoretical/common sense arguments.

We talk about weak instrument if Z are not strongly correlated with T. This assumption can be easily tested using the first stage results.

(22)

We first estimate a standard linear regression model where the child birth weight is a function of family income, mother education, child parity, child gender and cigarettes smoked per day by the mother during pregnancy:

The effect of smoking during pregnancy:

OLS estimates

The estimated effect of cigarette is negative (as expected) and highly significant. However, given the previous considerations about endogeneity we cannot rely on this estimate.

(23)

Cigarette price is a good candidate for IV. In fact, it is not directly related to children health (valid) while it is potentially correlated with the cigarettes

smoked by mothers (relevant). Let’s verify if this correlation exists.

We use a simple demand model for cigarettes where cigarette smoked per day depends on the price, family income, mother education, etc.:

Do we have any good IV candidate?

23

As expected cigarette price is a relevant instrument because it is significantly related to the endogenous variable. Policy makers can think to reduce smoking trough an increase in taxes.

(24)

The effect of smoking during pregnancy:

2SLS estimates

The F-test shows that the instrument is relevant (as a rule of thumb a value of F > 10 is

considerate as acceptable).

The estimated

effect of cigarette is now higher in

absolute value. For each additional

cigarette the model predicts a reduction in the child weight equal to 0.97

ounces (27.5 The OLS underestimated the effect of smoking. This can

(25)

What the IV is estimating?

As it is shown in the Venn-Diagram, the instrument does not

capture the whole “good” variation in the treatment. In other words

the covariance between Z and T is a sub-part of the variation of T

that do not overlap with U (“good variation”).

Angrist, Imbens and Rubin (1996) gave a clear interpretation of this

point.

25

point.

They showed that an IV analysis

can only estimate the causal

effect of the endogeneous

variable on a specific

sub-population of units,

(26)

What the IV is estimating? (cont’d)

Compliers are those who “react” to the instrument variation. In our

example, compliers are those who change their smoking behavior

because

of a change in cigarette price. They determine the IV

relevance.

Non-compliers are those who do not react to a variation in

cigarette price (their cigarette demand is inelastic).

A policy implication of the analysis is that a policy aimed at

increasing cigarettes prices trough an increase of cigarette taxes will

succeed in improving children health through a reduction in mothers

smoking, but only for a sub-group of mothers (those with elastic

(27)

Cross-sectional vs longitudinal data

27

(28)

Cross-sectional vs longitudinal data

Cross-sectional data

refer to a specific point in time (e.g, a

survey conducted in 2008).

Longitudinal data

refer to information collected at different

points in time (e.g., ESS are available for different years). If the

same units are followed we have panel data.

If we estimate a linear regression model on panel data with the

standard OLS method, we violate the 4

th

_{assumption (no}

correlation among the error terms):

The reason is that different observations on the same person in

( )

_i _j

0

(29)

Robust standard errors

A similar problem happens with multilevel data. E.g., students

nested into classes: observations on students in the same class

are not independent (for example, because they share the same

teacher).

The consequence of the violation of assumption 4 is that the

standard errors (but not the regression slopes) are biased. (The

29

standard errors (but not the regression slopes) are biased. (The

same happens when homoshedasticity is violated.)

Stata (as well as other software) offers an alternative way of

calculating the standard errors that allows for possible violations

of assumption 4 (and homoshedasticity): robust standard

errors.

(30)

When to use robust standard errors instead

of multilevel models?

You want to use multilevel models when the multilevel structure

is of interest for your research. I.e., you have multilevel

questions as:

How much variance exists in fertility intentions across cities/regions

compare to individual variability? compare to individual variability?

Do unemployment rate can contribute to explain the regional variability in

political identity?

(31)

Appendix for the braves:

a formal representation of the IV model and

assumptions

(32)

(Let’s omit for simplicity the X variables from the model)

The picture in slide 17 translates into two equations (regression models): The variable Z is an instrument because does not enter model 2.

Representing the model by equations

0 2 3 i

(1)

i i

T

= +

γ

U

+

γ

Z

+

δ

=

+

In formulas, the 2 requirements on the instrument can be written as:

Relevance:

Validity:

0 1 2

(2)

i i i i

Y

=

β

+

β

T

+

β

U

+

ε

cov( , )

T Z

≠

0

(33)

The problem is that we cannot estimate model (2) because U is unobserved.

We will omit U and it will be included into the error term:

This will cause bias. How does an IV help?

Omitted variable bias and the IV method

0 2 3 i

0 1 2

(1)

(2)

i i

i i i i

T

γ

U

γ

Z

δ

Y

β

T

β

U

ε

= +

+

=

+

i i

i

β

U

ε

=

₂

+

ε

~

33

This will cause bias. How does an IV help?

In this simple case (without covariates, X) the IV estimator is the ratio of two covariances:

It’s quite simple to show that this estimator is consistent:

These ideas can be extended to the presence of covariates.

0 1 1

1 1

cov(

, )

cov( , )

plim

cov( , )

IV

Y Z

T

Z

T Z

Z

T Z

β

ε

β

ε

β

=

+

%

=

+

%

=

β

1

cov( , )

IV

Y Z

T Z

(34)

The last equation shows that if validity is violated, i.e. cov(ε, Z) is not zero, then the IV estimator does not converge in probability to β₁(inconsistent).