Techniques of Statistical
Analysis I
Lect_13: Endogeneity and Instrumental
Variables (+robust standard errors)
Bruno Arpino
Assumptions underlying linear regression models
Omitted Variable and Simultaneity Bias
Instrumental Variables Regression
Outline
Instrumental Variables Regression
---
---
---
---
---
---
---
---
1)
Linearity
: The expected value of Y is a linear function of the
independent variables.
2) The error terms are random variables with mean 0 and a
Assumptions
ε
X
β
X
β
X
β
β
Y
=
0+
1 1+
2 2+
K
+
k k+
3
2) The error terms are random variables with mean 0 and a
constant variance,
σ
2The constant variance assumption is known as
homoschedasticity
.
3)
Exogeneity
: The regressors and the error term are not
correlated
i
E[
ε
]
=
0
∀
i
Var[
ε
i]
=
σ
2∀
i
4) The random error terms, ε
i, are not correlated with one
another, so that:
Assumptions (cont’d)
ε
X
β
X
β
X
β
β
Y
=
0+
1 1+
2 2+
K
+
k k+
( )
i j0
Cov
ε ε
=
∀ ≠
i
j
5)
Normality
: The random error terms are normally
distributed.
Assumption 5 is not required for obtaining the OLS estimates
but it is required to make inference. Test of hypothesis
(t-tests) and confidence intervals are based on this assumption.
( )
i j0
For assumptions 1, 2, 4 and 5 there are statistical tests to
check their validity. Statistical solutions exist to deal with
violations of these assumptions.
We focus on assumption 3 (exogeneity) that cannot be
tested. However, its plausibility can be judged by the
Assumptions (cont’d)
5
tested. However, its plausibility can be judged by the
Omitted Variable Bias (OVB) is the bias of the estimators of
regression coefficients caused by omitting relevant variables.
Omitted relevant variables are variables correlated both with
the dependent variable and with the independent variables
included in the regression (i.e.,
confounders
).
Omitted variable bias
included in the regression (i.e.,
confounders
).
OVB causes the violation of assumption 3:
Let’s see why…
Let assume the true model is :
but we omit X
2and estimate the model:
Now the error term includes X
2:
Omitted variable bias (cont’d)
0 1 1 2 2
Y
= +
β
β
X
+
β
X
+
ε
0 1 1
Y
= +
β
%
β
%
X
+
ε
%
2 2
ε
%
=
β
X
+
ε
7
The consequence is that if X
1and X
2are correlated, in the
estimated model X
1is correlated with the error term and this
violates assumption 3.
Also the existence of a bidirectional relationship between X
1and Y violates assumption 3.
Also in this case we say that X
1is
endogenous
and OLS
gives biased estimate of its slope: simultaneity bias!
Consider the dataset “
smoking.dta
” that includes information on
1388 children born in 1988 in the US. The file includes the following
variables:
bwght birth weight, ounces
cigs cigarettes smoked per day by the mother
An example of endogeneity: the effect of
smoking during pregnancy
9
cigs cigarettes smoked per day by the mother
while pregnant
male =1 if male child
parity birth order of the child
motheduc mother's years of education
faminc family income (1988)
cigprice (average) cigarette price in the home state (1988)
smoking.dta is a subset of the data analyzed by: J. Mullahy (1997),
Does smoking during pregnancy have a negative impact on
child health?
If so, can the policy maker protect children health by increasing
cigarettes taxes?
Can we expect that an increase in cigarette taxes will be effective in
reducing cigarette smoking? (Is cigarettes demand elastic to
Research question and policy implications
reducing cigarette smoking? (Is cigarettes demand elastic to
cigarette price?)
SMOKING
Birth weight (measure of child health)
Preliminary step: based on the theory and past research is it
reasonable to establish a causal pathway from smoking to health?
Holland (1986): “statistics can only estimate the effects of causes”.
It is the “Science” that helps in building the underlying theoretical
framework and deciding the role each variable is expected to play
(outcome, regressor, confounder…)
The “Science”: the literature review
11
(outcome, regressor, confounder…)
Start with a
literature review
. Examples of two papers on the
topic are:
J. van Reek, R. Knibbe, T. van Iwaarde (1991), Policy relevance of a
survey on smoking and drinking behaviour among Dutch school
children,
Health Policy
, 18 (3), pp 261-268.
It’s known that smoking is a cause of a number of diseases
including lung cancer and chronic obstruction pulmonary disease.
It is strongly suspected to be one of the causes of many
others including heart disease, stroke, most cancers, and
congenital anomalies.
Previous research and the policy relevance
of the smoking-health relationship
congenital anomalies.
It exacerbates some diseases including diabetes, HIV,
depression, respiratory infection, chronic liver disease,
arthritis, nephritis, and ulcers.
In countries where there is a public health system, the society
Leigh and Schembri (2004) note that:
The “Science”: confounders, assumptions
13
“
Risk aversion
might lead people to never smoke and to maintain good
health. Any correlation between smoking and health that did not remove
risk aversion would
overestimate
the effect of smoking on health.”
Leigh and Schembri (2004) also note that another problem can arise:
“…
physical functional decline
could scare someone and result in the
person quitting smoking. The smoking that lead to the disability may have
stopped months or years before, yet the person would likely still have the
disability. This is a problem of
BIDIRECTIONAL CAUSALITYand could lead
to
underestimate
the effect of smoking.
The “Science”: confounders, assumptions
In fact, think for example in this way:
non-smoker (at time t): good health (at time t) non-smoker (at time t +1) non-smoker (at time t): bad health (at time t) non-smoker (at time t +1) smoker (at time t): good health (at time t) smoker (at time t +1)
Potential direction of biases
T
Y
Bias = true parameter – expected value of estimator
15
T
(smoking)Y
(child birth weight)U
(risk aversion)T
(smoking)Y
(child birth weight)U
(previous health status)
+
-+
+
Our data are not sufficient to use OLS
. In fact, this method
requires that all the relevant confounders are observed and
controlled for.
Risk aversion is an important unobserved confounder. Not
controlling for it causes OVB!
And now statistics: which method is
appropriate?
controlling for it causes OVB!
Also the fact that causality is bidirectional violates the exogeneity
assumption (causing simultaneity bias). (Here we can think to this
problem again as an omitted variable bias: omission of previous
health status of the mother).
A graphical representation of the
Instrumental Variables model
17
T
(smoking)
Y
(child birth weight)
X
(income, education, child parity and gender…)
U
(risk aversion; previous health status)
Z
The characteristics of a good instrument
A suitable instrument, Z, is a variable with 2 characteristics
RELEVANCE:Z is correlated with the endogenous variable, X1
VALIDITY (or Exclusion restriction):Z is uncorrelated with the outcome, Z is uncorrelated with the outcome,
conditional on the endogenous variable.
In other words, it does not have direct effects on the Y. Z affects the outcome only through its effect on the endogenous variable.
Z acts as a randomization device:
A Venn-diagram representation of IV
“Bad variation
in T”
Y
U
A circle represents variation in a variable.
Overlapping parts represent covariance. IV methods use the association between
19
“Good
variation in T”
T
Z
Y
association betweenZ and T to extrapolate from the whole variation of T a sub-part
of variation that it is not contaminated by U. Note that this is only a sub-part of the good
The Two Stage Least Square Estimator
(2SLS)
In the second stage the outcome variable is regressed on the Xs variables and the predictions of T from the first stage. The first stage helps in separating the “bad” variation in T (related to U) from the “good” variation (not related to U). In a sense, we “purify” T.0 3 i
(1
)
ˆ
st
i k k
k
nd
T
= +
γ
β
X
+
γ
Z
+
δ
stage
=
+
+
+
∑
If the two requirements of relevance and validity are violated
(the instruments are not good) then the IV estimator is more biased than the simple OLS!!!
We talk about invalid instrument if Z has a direct effect on Y (e.g., education is not an instrument because has a direct effect on health, not only an indirect effect trough smoking
The cure can be worst than the disease!!!
21
effect on health, not only an indirect effect trough smoking behavior). The validity is primarily assessed on the basis of theoretical/common sense arguments.
We talk about weak instrument if Z are not strongly correlated with T. This assumption can be easily tested using the first stage results.
We first estimate a standard linear regression model where the child birth weight is a function of family income, mother education, child parity, child gender and cigarettes smoked per day by the mother during pregnancy:
The effect of smoking during pregnancy:
OLS estimates
The estimated effect of cigarette is negative (as expected) and highly significant. However, given the previous considerations about endogeneity we cannot rely on this estimate.
Cigarette price is a good candidate for IV. In fact, it is not directly related to children health (valid) while it is potentially correlated with the cigarettes
smoked by mothers (relevant). Let’s verify if this correlation exists.
We use a simple demand model for cigarettes where cigarette smoked per day depends on the price, family income, mother education, etc.:
Do we have any good IV candidate?
23
As expected cigarette price is a relevant instrument because it is significantly related to the endogenous variable. Policy makers can think to reduce smoking trough an increase in taxes.
The effect of smoking during pregnancy:
2SLS estimates
The F-test shows that the instrument is relevant (as a rule of thumb a value of F > 10 is
considerate as acceptable).
The estimated
effect of cigarette is now higher in
absolute value. For each additional
cigarette the model predicts a reduction in the child weight equal to 0.97
ounces (27.5 The OLS underestimated the effect of smoking. This can
What the IV is estimating?
As it is shown in the Venn-Diagram, the instrument does not
capture the whole “good” variation in the treatment. In other words
the covariance between Z and T is a sub-part of the variation of T
that do not overlap with U (“good variation”).
Angrist, Imbens and Rubin (1996) gave a clear interpretation of this
point.
25
point.
They showed that an IV analysis
can only estimate the causal
effect of the endogeneous
variable on a specific
sub-population of units,
What the IV is estimating? (cont’d)
Compliers are those who “react” to the instrument variation. In our
example, compliers are those who change their smoking behavior
because
of a change in cigarette price. They determine the IV
relevance.
Non-compliers are those who do not react to a variation in
cigarette price (their cigarette demand is inelastic).
cigarette price (their cigarette demand is inelastic).
A policy implication of the analysis is that a policy aimed at
increasing cigarettes prices trough an increase of cigarette taxes will
succeed in improving children health through a reduction in mothers
smoking, but only for a sub-group of mothers (those with elastic
Cross-sectional vs longitudinal data
27
Cross-sectional vs longitudinal data
Cross-sectional data
refer to a specific point in time (e.g, a
survey conducted in 2008).
Longitudinal data
refer to information collected at different
points in time (e.g., ESS are available for different years). If the
same units are followed we have panel data.
same units are followed we have panel data.
If we estimate a linear regression model on panel data with the
standard OLS method, we violate the 4
thassumption (no
correlation among the error terms):
The reason is that different observations on the same person in
( )
i j0
Robust standard errors
A similar problem happens with multilevel data. E.g., students
nested into classes: observations on students in the same class
are not independent (for example, because they share the same
teacher).
The consequence of the violation of assumption 4 is that the
standard errors (but not the regression slopes) are biased. (The
29
standard errors (but not the regression slopes) are biased. (The
same happens when homoshedasticity is violated.)
Stata (as well as other software) offers an alternative way of
calculating the standard errors that allows for possible violations
of assumption 4 (and homoshedasticity): robust standard
errors.
When to use robust standard errors instead
of multilevel models?
You want to use multilevel models when the multilevel structure
is of interest for your research. I.e., you have multilevel
questions as:
How much variance exists in fertility intentions across cities/regions
compare to individual variability? compare to individual variability?
Do unemployment rate can contribute to explain the regional variability in
political identity?
Appendix for the braves:
a formal representation of the IV model and
assumptions
(Let’s omit for simplicity the X variables from the model)
The picture in slide 17 translates into two equations (regression models): The variable Z is an instrument because does not enter model 2.Representing the model by equations
0 2 3 i
(1)
i i
T
= +
γ
γ
U
+
γ
Z
+
δ
=
+
+
+
In formulas, the 2 requirements on the instrument can be written as: Relevance: Validity:0 1 2
(2)
i i i i
Y
=
β
+
β
T
+
β
U
+
ε
cov( , )
T Z
≠
0
Omitted variable bias and the IV method
0 2 3 i
0 1 2
(1)
(2)
i i
i i i i
T
γ
γ
U
γ
Z
δ
Y
β
β
T
β
U
ε
= +
+
+
=
+
+
+
i i
i
β
U
ε
=
2+
ε
~
33
This will cause bias. How does an IV help? In this simple case (without covariates, X) the IV estimator is the ratio of two covariances: It’s quite simple to show that this estimator is consistent: These ideas can be extended to the presence of covariates.0 1 1
1 1
cov(
, )
cov( , )
cov( , )
cov( , )
plim
cov( , )
cov( , )
cov( , )
IV
Y Z
T
Z
T Z
Z
T Z
T Z
T Z
β
β
ε
β
ε
β
=
=
+
+
%
=
+
%
=
β
1
cov( , )
cov( , )
IV
Y Z
T Z
The consequence of invalid and weak
instrument
0 1 1
1 1
cov(
, )
cov( , )
cov( , )
cov( , )
plim
cov( , )
cov( , )
cov( , )
IV
Y Z
T
Z
T Z
Z
T Z
T Z
T Z
β
β
ε
β
ε
β
=
=
+
+
%
=
+
%
=
β
then the IV estimator does not converge in probability to β1 (inconsistent).
If something is not clear
(or you find mistakes in the slides)
35