Causal inference Multiple regression
Quantitative Methods I:
Multiple linear regression
Johan A. Elkink
University College Dublin
Causal inference Multiple regression
1 Causal inference
Causal inference Multiple regression Potential outcomes When to control?
Outline
1 Causal inference 2 Multiple regressionCausal inference
Multiple regression
Potential outcomes When to control?
Inference
In regression analysis we look at the relationship between (a set of) independent variable(s) and a dependent variable.
Statistical inference is concerned with the question how likely it is to observe this relationship given the null hypothesis of no relationship (frequentist)
or how much we should update our beliefs concerning this relationship given our new evidence (Bayesian).
A different question is whether or not we can deduce that the independent variable is a cause of the dependent one.
Causal inference
Multiple regression
Potential outcomes When to control?
Inference
In regression analysis we look at the relationship between (a set of) independent variable(s) and a dependent variable.
Statistical inference is concerned with the question how likely it is to observe this relationship given the null hypothesis of no relationship (frequentist) or how much we should update our beliefs concerning this relationship given our new evidence (Bayesian).
A different question is whether or not we can deduce that the independent variable is a cause of the dependent one.
Causal inference
Multiple regression
Potential outcomes When to control?
Causation
Slightly simplified, for X to be a cause of Y , we generally require:
1 X to precede Y
2 X to correlate with Y (either positively or negatively)
3 no other factor to explain the correlation between X and Y
Causal inference Multiple regression Potential outcomes When to control?
Causation: terminology
If X causes Y ,Y is called thedependent variable, oroutcome variable, or
response, or . . . ;
X is called theindependent variable, orexplanatory
variable, orfactor, or . . . .
In political science, most common (unfortunately) is the usage of the terms independent and dependent variables.
Causal inference Multiple regression Potential outcomes When to control?
Association
association 6= causationGiven that, say, X and Y are correlation (associated), there are still many possible causal patterns at play.
Generally, to makecausal inferencesfrom your analysis,
additional assumptions need to be made in addition to the ones already made for associational or predictive inference.
Causal inference
Multiple regression
Potential outcomes
When to control?
Fundamental problem
Imagine, there are two kinds of people, one group, T = 1, that has a college degree, and another group, T = 0, that does not. We want to measure where a college degree leads to a higher salary, Y .
What we would like to know is the difference for any individual i whether they have a college degree or not: YiTi=1− YiTi=0. However, for every individual i, we either observe YiTi=1, or we observe YiTi=0– they either have the degree or they don’t.
Causal inference
Multiple regression
Potential outcomes
When to control?
Fundamental problem
Imagine, there are two kinds of people, one group, T = 1, that has a college degree, and another group, T = 0, that does not. We want to measure where a college degree leads to a higher salary, Y .
What we would like to know is the difference for any individual i whether they have a college degree or not: YiTi=1− YiTi=0. However, for every individual i, we either observe YiTi=1, or we observe YiTi=0– they either have the degree or they don’t.
Causal inference Multiple regression Potential outcomes When to control?
We wish ...
we have ...
Respondent Degree YTi=0 i Y Ti=1 i effect 1 Yes 121 133 +12 2 Yes 100 109 +9 3 No 90 92 +2 4 No 87 88 +1 5 Yes 143 146 +3 6 Yes 111 124 +13 7 No 92 92 0 8 Yes 95 109 +14Causal inference Multiple regression Potential outcomes When to control?
We wish ... we have ...
Respondent Degree YTi=0 i Y Ti=1 i effect 1 Yes 121 133 +12 2 Yes 100 109 +9 3 No 90 92 +2 4 No 87 88 +1 5 Yes 143 146 +3 6 Yes 111 124 +13 7 No 92 92 0 8 Yes 95 109 +14Causal inference Multiple regression Potential outcomes When to control?
Potential outcomes
Potential outcome = Y1i if Ti =1 Y0i if Ti =0E.g., Y1i is the salary of individual i had (s)he a college degree,
irrespective of whether (s)he actually does.
Yi =Y0i+ (Y1i − Y0i)Ti =Y0i+ δTi,
where δ = Y1i − Y0i is thecausal effect.
Causal inference Multiple regression Potential outcomes When to control?
Potential outcomes
Potential outcome = Y1i if Ti =1 Y0i if Ti =0E.g., Y1i is the salary of individual i had (s)he a college degree,
irrespective of whether (s)he actually does.
Yi =Y0i+ (Y1i − Y0i)Ti=Y0i+ δTi,
where δ = Y1i − Y0i is thecausal effect.
Causal inference
Multiple regression
Potential outcomes
When to control?
Average treatment effect
Because it is impossible to observe individual treatment effect, we usually turn toaverage treatment effect:
E [δ] = E [Y1i− Y0i] =E [Y1i] −E [Y0i], which we could naively estimate with
ˆ
δ =E [Y1i|Ti =1] − E [Y0i|Ti =0].
This assumes that E [Y1i]reflects the salary for people with a college degree, irrespective of whether they got one or not, and that E [Y0i] reflects the salary without a college degree, irrespective of whether they got one or not.
Causal inference
Multiple regression
Potential outcomes
When to control?
Average treatment effect
Because it is impossible to observe individual treatment effect, we usually turn toaverage treatment effect:
E [δ] = E [Y1i− Y0i] =E [Y1i] −E [Y0i], which we could naively estimate with
ˆ
δ =E [Y1i|Ti =1] − E [Y0i|Ti =0].
This assumes that E [Y1i]reflects the salary for people with a college degree, irrespective of whether they got one or not, and that E [Y0i] reflects the salary without a college degree, irrespective of whether they got one or not.
Causal inference
Multiple regression
Potential outcomes
When to control?
Counterfactual causality
By making such assumptions – by looking at the ATE – we are
making acounterfactualargument. We are making
assumptions of what Y1i would have been, had i had a college
degree.
To understand when the ATE assumptions are reasonable, we
need to look at the effect ofcovariates– other variables that
Causal inference
Multiple regression
Potential outcomes
When to control?
Bias in causal inference
Using shorthand E01=E [Y0i|Ti =1], etc., and taking π as the population proportion that received the treatment,
E [δ] = πE [δ|Ti =1] + (1 − π)E [δ|Ti =0] = π(E11− E01) + (1 − π)(E10− E00)
can be decomposed into
(E11− E00) =E [δ] + (E01− E00) + (1 − π){(E11− E01) − (E10− E00)}. (E11− E00) observed difference in effect
E [δ] average treatment effect
(E01− E00) selection bias
Causal inference
Multiple regression
Potential outcomes
When to control?
Bias in causal inference
Using shorthand E01=E [Y0i|Ti =1], etc., and taking π as the population proportion that received the treatment,
E [δ] = πE [δ|Ti =1] + (1 − π)E [δ|Ti =0] = π(E11− E01) + (1 − π)(E10− E00) can be decomposed into
(E11− E00) =E [δ] + (E01− E00) + (1 − π){(E11− E01) − (E10− E00)}. (E11− E00) observed difference in effect
E [δ] average treatment effect
(E01− E00) selection bias
Causal inference
Multiple regression
Potential outcomes
When to control?
Confounding
When studying effect of, say, T on Y , by examining the statistical association between the two variables, we need to ascertain that the observed effect is not caused by a third
variable, say,X.
“We can say that T and Y are confounded when there is a third variable X that influences both T and Y ; such a variable is then
called aconfounderof T and Y .”
Causal inference
Multiple regression
Potential outcomes
When to control?
Confounding
Another way of saying this is that if
E (Y |T , X ) 6= E (Y |T ) and
E (T |X ) 6= E (T ), X is a confounder of the effect of T on Y .
Causal inference
Multiple regression
Potential outcomes
When to control?
Confounding
If healthier patients take a drug and sicker patients do not, we can find an association between drug and recovery even when the drug does not work.
If sicker patients take a drug and healthier patients do not, we might not find an association between drug and
recovery even when the drug works. association 6= causation
Causal inference
Multiple regression
Potential outcomes
When to control?
Confounding
Note that confounding is a causal concept, not an associational one!
X has to have a causal effect on T and X has to have a causal effect on Y for there to be an issue.
Causal inference
Multiple regression
Potential outcomes
When to control?
When to control?
X affects both T and Y =⇒ control
T affects Y , which in turn affects X =⇒ do not control T affects X , which in turn affects Y =⇒ do not control ... ... unless you explicitly want only the direct effect X affects Y , but not T , nor the effect of T on Y
X affects Y , not T , but it does affect of effect of T on Y (interaction)
Causal inference
Multiple regression
Potential outcomes
When to control?
Do control
This is the typical case of a confounding factor, and hence should be eliminated through controlling.
Causal inference
Multiple regression
Potential outcomes
When to control?
When to control?
X affects both T and Y =⇒ control
T affects Y , which in turn affects X =⇒ do not control
T affects X , which in turn affects Y =⇒ do not control ... ... unless you explicitly want only the direct effect X affects Y , but not T , nor the effect of T on Y
X affects Y , not T , but it does affect of effect of T on Y (interaction)
Causal inference
Multiple regression
Potential outcomes
When to control?
Don’t control
In this case, X is an effect of Y . By controlling for X , you can severily underestimate the effect of T on Y .
Imagine that a college degree leads to a better income leads to a nicer car. Controlling for the price of the car in estimating the effect of having a college degree on income might cancel the effect.
Causal inference
Multiple regression
Potential outcomes
When to control?
When to control?
X affects both T and Y =⇒ control
T affects Y , which in turn affects X =⇒ do not control T affects X , which in turn affects Y =⇒ do not control ...
... unless you explicitly want only the direct effect X affects Y , but not T , nor the effect of T on Y
X affects Y , not T , but it does affect of effect of T on Y (interaction)
Causal inference
Multiple regression
Potential outcomes
When to control?
Don’t control
To get the overall effect of T on Y , you want to include the effect through X .
E.g. if you want to know the effect of changing the policy regarding smoking in pubs on the amount of smoking in general, you do not care through what mechanism this happened (through peer pressure, laziness, etc.), but only about the overall effect.
Causal inference
Multiple regression
Potential outcomes
When to control?
When to control?
X affects both T and Y =⇒ control
T affects Y , which in turn affects X =⇒ do not control T affects X , which in turn affects Y =⇒ do not control ... ... unless you explicitly want only the direct effect
X affects Y , but not T , nor the effect of T on Y
X affects Y , not T , but it does affect of effect of T on Y (interaction)
Causal inference
Multiple regression
Potential outcomes
When to control?
Maybe control
Remember the following equation:
β∗ = β + φγ
Sometimes you are interested in β (so control), sometimes in β∗(so don’t control).
Causal inference
Multiple regression
Potential outcomes
When to control?
Maybe control
Example: A city provides access to theatres, cinemas, etc., which may in turn lead to more happiness. To estimate whether moving to a city makes one more happy, there is the mediating factor of theatre availability.
If you control for media availability, you underestimate the effect of city-living on happiness.
Causal inference
Multiple regression
Potential outcomes
When to control?
Maybe control
Example: A scholarship for poorer students might help them to get a college degree, which in turn might help them to earn more money later in life. Having a scholarship on your CV, however, might also further your career, independent of the effect of having a college degree.
To see the overall effect of the scholarship, don’t control on having a college degree.
To see the effect of having a scholarship, independent of the effect of getting a college degree, do control for college degree.
Causal inference
Multiple regression
Potential outcomes
When to control?
When to control?
X affects both T and Y =⇒ control
T affects Y , which in turn affects X =⇒ do not control T affects X , which in turn affects Y =⇒ do not control ... ... unless you explicitly want only the direct effect X affects Y , but not T , nor the effect of T on Y
X affects Y , not T , but it does affect of effect of T on Y (interaction)
Causal inference
Multiple regression
Potential outcomes
When to control?
Maybe control
When X affects Y , but not T , there is no confounding issue and the estimates for the effect of T on Y should not be affected by inclusion of X . However, including X in the model can still help forefficiency.
Causal inference
Multiple regression
Potential outcomes
When to control?
When to control?
X affects both T and Y =⇒ control
T affects Y , which in turn affects X =⇒ do not control T affects X , which in turn affects Y =⇒ do not control ... ... unless you explicitly want only the direct effect X affects Y , but not T , nor the effect of T on Y
X affects Y , not T , but it does affect of effect of T on Y (interaction)
Causal inference
Multiple regression
Potential outcomes
When to control?
Maybe control
Here including the interaction in your model can highlight how the effect is different for different groups.
Note that it affects the interpretation, but that the estimation of the overall ATE is not affected by controlling for X .
Causal inference
Multiple regression
Potential outcomes
When to control?
Kitchen sink
A typical approach in the social sciences is to collect a number of different theories / hypotheses, add them all as variables to a
regression, and see “who wins”. This is thekitchen sink
approach(orgarbage can approach).
If anything, the above discussion should have made clear that to draw causal inferences, a clear distinction of treatment from covariates is crucial. In other words: focus your research!
(Note that the “garbage can” phrase has also been used to argue against ignoring nonlinearities (Achen, 2005), as opposed to careless specification of the causal effect.)
Causal inference
Multiple regression
Potential outcomes
When to control?
Kitchen sink
Another way of putting the issue is that the above is all about trying to study the effect of a cause (treatment), rather than the cause of an effect. The latter is ill-defined and runs into the philosophical issue that every cause has a cause, the “infinite regress of causation”.
(See Gerring (2001, 2012) for an extensive discussion of Y -centered and X -centered research.)
Causal inference
Multiple regression
Potential outcomes
When to control?
The ideal experiment
To avoid any effect of covariates the ideal is to randomly select participants for your research from the overal population (enables inference to the population) and to randomly assign the treatment to these participants (enables causal inference).
Causal inference Multiple regression Potential outcomes When to control?
How to control?
Experiment Field experiment Natural experiment Blocking Matching Multiple regression etc.Causal inference
Multiple regression
Outline
1 Causal inference
Causal inference
Multiple regression
Multiple regression
The regression model can be generalized tomultiple
regression, which involves regressing Y on several independent variables X1, X2, etc.
Regression allows us to isolate the linear contribution of
each unit of Xk on Y , “holding everything else constant”.
This is the most common and most powerful basic technique in social science statistics—and most more advanced techniques are extensions of this.
The additional X variables are typically thought of as
control variables.
Causal inference
Multiple regression
R
2Defined in terms of sums of squares:
R2 = SSE SST = 1 − SSR SST = 1 − P(yi− ˆyi) 2 P(yi− ¯y )2
Interpretation: the proportion of the variation iny that is
explained linearly by the independent variables. A much over-used statistic: it may not be what we are interested in at all.
Causal inference
Multiple regression
R
2When a model has no intercept, it is possible for R2to lie
outside the interval (0, 1).
R2rises with the addition of more explanatory variables. For
this reason we often report theadjusted R2:
1 − (1 − R2)n − 1
n − k.
Causal inference
Multiple regression
Adjusted R
2One of the problems with looking at R2is that the more
independent variables, the higher R2, which discourages
parsimony. One solution for this theadjusted R2:
adjR2=1 − n − 1
n − k(1 − R
2)
Causal inference
Multiple regression
Exercise
We will study the relationship between having a degree and future earnings (education.dta).
1 Regress earnings on degree.
2 Repeat, but control for ability.
3 Repeat, but control also for schooling.
Causal inference
Multiple regression
Exercise
Let’s take the silly example of movie description lengths again (films.dta).
1 Regress desclength on year and length. What do you
conclude about the relation between the duration of a movie and the number of lines used in the review?
2 Repeat, controlling for castsize. Does this revise your
Causal inference Multiple regression
Table presentation
% Year 0.05 * (0.015) Duration 0.02 (0.016) Cast size 0.52 * (0.125) intercept -91.04 * (29.49) Observations 100 Adjusted R2 0.31 F 15.87 *Regression coefficients explaining the number of lines devoted to a movie review in Leonard Maltin’s Movie and Video Guide, 1996. Standard errors in parentheses.
Causal inference
Multiple regression
Exercise: US wages
Open the uswages.dta data set.
1 Regress wage on educ, exper and race.
2 What proportion of the variance in wage is explained by
Causal inference
Multiple regression
Achen, Christopher H. 2005. “Let’s put garbage-can regressions and garbage-can probits where they belong.” Conflict Management and Peace Science 22:327–339.
Angrist, Joshua D. and J ¨orn-Steffen Pischke. 2009. Mostly harmless econometrics: An empiricist’s companion. Princeton: Princeton University Press.
Benoit, Kenneth. 2009. “PO 7001: Quantitative Methods I.” Lecture slides, Trinity College Dublin. Gelman, Andrew and Jennifer Hill. 2007. Data analysis using regression and multilevel/hierarchical models.
Analytical Methods for Social Research Cambridge: Cambridge University Press.
Gerring, John. 2001. Social science methodology: A critical framework. Cambridge: Cambridge University Press. Gerring, John. 2012. Social science methodology: A unified framework. Cambridge: Cambridge University Press. Lee, Myoung Jae. 2005. Micro-econometrics for policy, program, and treatment effects. Oxford: Oxford University
Press.