• No results found

Quantitative Methods I: Multiple linear regression

N/A
N/A
Protected

Academic year: 2021

Share "Quantitative Methods I: Multiple linear regression"

Copied!
51
0
0

Loading.... (view fulltext now)

Full text

(1)

Causal inference Multiple regression

Quantitative Methods I:

Multiple linear regression

Johan A. Elkink

University College Dublin

(2)

Causal inference Multiple regression

1 Causal inference

(3)

Causal inference Multiple regression Potential outcomes When to control?

Outline

1 Causal inference 2 Multiple regression

(4)

Causal inference

Multiple regression

Potential outcomes When to control?

Inference

In regression analysis we look at the relationship between (a set of) independent variable(s) and a dependent variable.

Statistical inference is concerned with the question how likely it is to observe this relationship given the null hypothesis of no relationship (frequentist)

or how much we should update our beliefs concerning this relationship given our new evidence (Bayesian).

A different question is whether or not we can deduce that the independent variable is a cause of the dependent one.

(5)

Causal inference

Multiple regression

Potential outcomes When to control?

Inference

In regression analysis we look at the relationship between (a set of) independent variable(s) and a dependent variable.

Statistical inference is concerned with the question how likely it is to observe this relationship given the null hypothesis of no relationship (frequentist) or how much we should update our beliefs concerning this relationship given our new evidence (Bayesian).

A different question is whether or not we can deduce that the independent variable is a cause of the dependent one.

(6)

Causal inference

Multiple regression

Potential outcomes When to control?

Causation

Slightly simplified, for X to be a cause of Y , we generally require:

1 X to precede Y

2 X to correlate with Y (either positively or negatively)

3 no other factor to explain the correlation between X and Y

(7)

Causal inference Multiple regression Potential outcomes When to control?

Causation: terminology

If X causes Y ,

Y is called thedependent variable, oroutcome variable, or

response, or . . . ;

X is called theindependent variable, orexplanatory

variable, orfactor, or . . . .

In political science, most common (unfortunately) is the usage of the terms independent and dependent variables.

(8)

Causal inference Multiple regression Potential outcomes When to control?

Association

association 6= causation

Given that, say, X and Y are correlation (associated), there are still many possible causal patterns at play.

Generally, to makecausal inferencesfrom your analysis,

additional assumptions need to be made in addition to the ones already made for associational or predictive inference.

(9)

Causal inference

Multiple regression

Potential outcomes

When to control?

Fundamental problem

Imagine, there are two kinds of people, one group, T = 1, that has a college degree, and another group, T = 0, that does not. We want to measure where a college degree leads to a higher salary, Y .

What we would like to know is the difference for any individual i whether they have a college degree or not: YiTi=1− YiTi=0. However, for every individual i, we either observe YiTi=1, or we observe YiTi=0– they either have the degree or they don’t.

(10)

Causal inference

Multiple regression

Potential outcomes

When to control?

Fundamental problem

Imagine, there are two kinds of people, one group, T = 1, that has a college degree, and another group, T = 0, that does not. We want to measure where a college degree leads to a higher salary, Y .

What we would like to know is the difference for any individual i whether they have a college degree or not: YiTi=1− YiTi=0. However, for every individual i, we either observe YiTi=1, or we observe YiTi=0– they either have the degree or they don’t.

(11)

Causal inference Multiple regression Potential outcomes When to control?

We wish ...

we have ...

Respondent Degree YTi=0 i Y Ti=1 i effect 1 Yes 121 133 +12 2 Yes 100 109 +9 3 No 90 92 +2 4 No 87 88 +1 5 Yes 143 146 +3 6 Yes 111 124 +13 7 No 92 92 0 8 Yes 95 109 +14

(12)

Causal inference Multiple regression Potential outcomes When to control?

We wish ... we have ...

Respondent Degree YTi=0 i Y Ti=1 i effect 1 Yes 121 133 +12 2 Yes 100 109 +9 3 No 90 92 +2 4 No 87 88 +1 5 Yes 143 146 +3 6 Yes 111 124 +13 7 No 92 92 0 8 Yes 95 109 +14

(13)

Causal inference Multiple regression Potential outcomes When to control?

Potential outcomes

Potential outcome =  Y1i if Ti =1 Y0i if Ti =0

E.g., Y1i is the salary of individual i had (s)he a college degree,

irrespective of whether (s)he actually does.

Yi =Y0i+ (Y1i − Y0i)Ti =Y0i+ δTi,

where δ = Y1i − Y0i is thecausal effect.

(14)

Causal inference Multiple regression Potential outcomes When to control?

Potential outcomes

Potential outcome =  Y1i if Ti =1 Y0i if Ti =0

E.g., Y1i is the salary of individual i had (s)he a college degree,

irrespective of whether (s)he actually does.

Yi =Y0i+ (Y1i − Y0i)Ti=Y0i+ δTi,

where δ = Y1i − Y0i is thecausal effect.

(15)

Causal inference

Multiple regression

Potential outcomes

When to control?

Average treatment effect

Because it is impossible to observe individual treatment effect, we usually turn toaverage treatment effect:

E [δ] = E [Y1i− Y0i] =E [Y1i] −E [Y0i], which we could naively estimate with

ˆ

δ =E [Y1i|Ti =1] − E [Y0i|Ti =0].

This assumes that E [Y1i]reflects the salary for people with a college degree, irrespective of whether they got one or not, and that E [Y0i] reflects the salary without a college degree, irrespective of whether they got one or not.

(16)

Causal inference

Multiple regression

Potential outcomes

When to control?

Average treatment effect

Because it is impossible to observe individual treatment effect, we usually turn toaverage treatment effect:

E [δ] = E [Y1i− Y0i] =E [Y1i] −E [Y0i], which we could naively estimate with

ˆ

δ =E [Y1i|Ti =1] − E [Y0i|Ti =0].

This assumes that E [Y1i]reflects the salary for people with a college degree, irrespective of whether they got one or not, and that E [Y0i] reflects the salary without a college degree, irrespective of whether they got one or not.

(17)

Causal inference

Multiple regression

Potential outcomes

When to control?

Counterfactual causality

By making such assumptions – by looking at the ATE – we are

making acounterfactualargument. We are making

assumptions of what Y1i would have been, had i had a college

degree.

To understand when the ATE assumptions are reasonable, we

need to look at the effect ofcovariates– other variables that

(18)

Causal inference

Multiple regression

Potential outcomes

When to control?

Bias in causal inference

Using shorthand E01=E [Y0i|Ti =1], etc., and taking π as the population proportion that received the treatment,

E [δ] = πE [δ|Ti =1] + (1 − π)E [δ|Ti =0] = π(E11− E01) + (1 − π)(E10− E00)

can be decomposed into

(E11− E00) =E [δ] + (E01− E00) + (1 − π){(E11− E01) − (E10− E00)}. (E11− E00) observed difference in effect

E [δ] average treatment effect

(E01− E00) selection bias

(19)

Causal inference

Multiple regression

Potential outcomes

When to control?

Bias in causal inference

Using shorthand E01=E [Y0i|Ti =1], etc., and taking π as the population proportion that received the treatment,

E [δ] = πE [δ|Ti =1] + (1 − π)E [δ|Ti =0] = π(E11− E01) + (1 − π)(E10− E00) can be decomposed into

(E11− E00) =E [δ] + (E01− E00) + (1 − π){(E11− E01) − (E10− E00)}. (E11− E00) observed difference in effect

E [δ] average treatment effect

(E01− E00) selection bias

(20)

Causal inference

Multiple regression

Potential outcomes

When to control?

Confounding

When studying effect of, say, T on Y , by examining the statistical association between the two variables, we need to ascertain that the observed effect is not caused by a third

variable, say,X.

“We can say that T and Y are confounded when there is a third variable X that influences both T and Y ; such a variable is then

called aconfounderof T and Y .”

(21)

Causal inference

Multiple regression

Potential outcomes

When to control?

Confounding

Another way of saying this is that if

E (Y |T , X ) 6= E (Y |T ) and

E (T |X ) 6= E (T ), X is a confounder of the effect of T on Y .

(22)

Causal inference

Multiple regression

Potential outcomes

When to control?

Confounding

If healthier patients take a drug and sicker patients do not, we can find an association between drug and recovery even when the drug does not work.

If sicker patients take a drug and healthier patients do not, we might not find an association between drug and

recovery even when the drug works. association 6= causation

(23)

Causal inference

Multiple regression

Potential outcomes

When to control?

Confounding

Note that confounding is a causal concept, not an associational one!

X has to have a causal effect on T and X has to have a causal effect on Y for there to be an issue.

(24)

Causal inference

Multiple regression

Potential outcomes

When to control?

When to control?

X affects both T and Y =⇒ control

T affects Y , which in turn affects X =⇒ do not control T affects X , which in turn affects Y =⇒ do not control ... ... unless you explicitly want only the direct effect X affects Y , but not T , nor the effect of T on Y

X affects Y , not T , but it does affect of effect of T on Y (interaction)

(25)

Causal inference

Multiple regression

Potential outcomes

When to control?

Do control

This is the typical case of a confounding factor, and hence should be eliminated through controlling.

(26)

Causal inference

Multiple regression

Potential outcomes

When to control?

When to control?

X affects both T and Y =⇒ control

T affects Y , which in turn affects X =⇒ do not control

T affects X , which in turn affects Y =⇒ do not control ... ... unless you explicitly want only the direct effect X affects Y , but not T , nor the effect of T on Y

X affects Y , not T , but it does affect of effect of T on Y (interaction)

(27)

Causal inference

Multiple regression

Potential outcomes

When to control?

Don’t control

In this case, X is an effect of Y . By controlling for X , you can severily underestimate the effect of T on Y .

Imagine that a college degree leads to a better income leads to a nicer car. Controlling for the price of the car in estimating the effect of having a college degree on income might cancel the effect.

(28)

Causal inference

Multiple regression

Potential outcomes

When to control?

When to control?

X affects both T and Y =⇒ control

T affects Y , which in turn affects X =⇒ do not control T affects X , which in turn affects Y =⇒ do not control ...

... unless you explicitly want only the direct effect X affects Y , but not T , nor the effect of T on Y

X affects Y , not T , but it does affect of effect of T on Y (interaction)

(29)

Causal inference

Multiple regression

Potential outcomes

When to control?

Don’t control

To get the overall effect of T on Y , you want to include the effect through X .

E.g. if you want to know the effect of changing the policy regarding smoking in pubs on the amount of smoking in general, you do not care through what mechanism this happened (through peer pressure, laziness, etc.), but only about the overall effect.

(30)

Causal inference

Multiple regression

Potential outcomes

When to control?

When to control?

X affects both T and Y =⇒ control

T affects Y , which in turn affects X =⇒ do not control T affects X , which in turn affects Y =⇒ do not control ... ... unless you explicitly want only the direct effect

X affects Y , but not T , nor the effect of T on Y

X affects Y , not T , but it does affect of effect of T on Y (interaction)

(31)

Causal inference

Multiple regression

Potential outcomes

When to control?

Maybe control

Remember the following equation:

β∗ = β + φγ

Sometimes you are interested in β (so control), sometimes in β∗(so don’t control).

(32)

Causal inference

Multiple regression

Potential outcomes

When to control?

Maybe control

Example: A city provides access to theatres, cinemas, etc., which may in turn lead to more happiness. To estimate whether moving to a city makes one more happy, there is the mediating factor of theatre availability.

If you control for media availability, you underestimate the effect of city-living on happiness.

(33)

Causal inference

Multiple regression

Potential outcomes

When to control?

Maybe control

Example: A scholarship for poorer students might help them to get a college degree, which in turn might help them to earn more money later in life. Having a scholarship on your CV, however, might also further your career, independent of the effect of having a college degree.

To see the overall effect of the scholarship, don’t control on having a college degree.

To see the effect of having a scholarship, independent of the effect of getting a college degree, do control for college degree.

(34)

Causal inference

Multiple regression

Potential outcomes

When to control?

When to control?

X affects both T and Y =⇒ control

T affects Y , which in turn affects X =⇒ do not control T affects X , which in turn affects Y =⇒ do not control ... ... unless you explicitly want only the direct effect X affects Y , but not T , nor the effect of T on Y

X affects Y , not T , but it does affect of effect of T on Y (interaction)

(35)

Causal inference

Multiple regression

Potential outcomes

When to control?

Maybe control

When X affects Y , but not T , there is no confounding issue and the estimates for the effect of T on Y should not be affected by inclusion of X . However, including X in the model can still help forefficiency.

(36)

Causal inference

Multiple regression

Potential outcomes

When to control?

When to control?

X affects both T and Y =⇒ control

T affects Y , which in turn affects X =⇒ do not control T affects X , which in turn affects Y =⇒ do not control ... ... unless you explicitly want only the direct effect X affects Y , but not T , nor the effect of T on Y

X affects Y , not T , but it does affect of effect of T on Y (interaction)

(37)

Causal inference

Multiple regression

Potential outcomes

When to control?

Maybe control

Here including the interaction in your model can highlight how the effect is different for different groups.

Note that it affects the interpretation, but that the estimation of the overall ATE is not affected by controlling for X .

(38)

Causal inference

Multiple regression

Potential outcomes

When to control?

Kitchen sink

A typical approach in the social sciences is to collect a number of different theories / hypotheses, add them all as variables to a

regression, and see “who wins”. This is thekitchen sink

approach(orgarbage can approach).

If anything, the above discussion should have made clear that to draw causal inferences, a clear distinction of treatment from covariates is crucial. In other words: focus your research!

(Note that the “garbage can” phrase has also been used to argue against ignoring nonlinearities (Achen, 2005), as opposed to careless specification of the causal effect.)

(39)

Causal inference

Multiple regression

Potential outcomes

When to control?

Kitchen sink

Another way of putting the issue is that the above is all about trying to study the effect of a cause (treatment), rather than the cause of an effect. The latter is ill-defined and runs into the philosophical issue that every cause has a cause, the “infinite regress of causation”.

(See Gerring (2001, 2012) for an extensive discussion of Y -centered and X -centered research.)

(40)

Causal inference

Multiple regression

Potential outcomes

When to control?

The ideal experiment

To avoid any effect of covariates the ideal is to randomly select participants for your research from the overal population (enables inference to the population) and to randomly assign the treatment to these participants (enables causal inference).

(41)

Causal inference Multiple regression Potential outcomes When to control?

How to control?

Experiment Field experiment Natural experiment Blocking Matching Multiple regression etc.

(42)

Causal inference

Multiple regression

Outline

1 Causal inference

(43)

Causal inference

Multiple regression

Multiple regression

The regression model can be generalized tomultiple

regression, which involves regressing Y on several independent variables X1, X2, etc.

Regression allows us to isolate the linear contribution of

each unit of Xk on Y , “holding everything else constant”.

This is the most common and most powerful basic technique in social science statistics—and most more advanced techniques are extensions of this.

The additional X variables are typically thought of as

control variables.

(44)

Causal inference

Multiple regression

R

2

Defined in terms of sums of squares:

R2 = SSE SST = 1 − SSR SST = 1 − P(yi− ˆyi) 2 P(yi− ¯y )2

Interpretation: the proportion of the variation iny that is

explained linearly by the independent variables. A much over-used statistic: it may not be what we are interested in at all.

(45)

Causal inference

Multiple regression

R

2

When a model has no intercept, it is possible for R2to lie

outside the interval (0, 1).

R2rises with the addition of more explanatory variables. For

this reason we often report theadjusted R2:

1 − (1 − R2)n − 1

n − k.

(46)

Causal inference

Multiple regression

Adjusted R

2

One of the problems with looking at R2is that the more

independent variables, the higher R2, which discourages

parsimony. One solution for this theadjusted R2:

adjR2=1 − n − 1

n − k(1 − R

2)

(47)

Causal inference

Multiple regression

Exercise

We will study the relationship between having a degree and future earnings (education.dta).

1 Regress earnings on degree.

2 Repeat, but control for ability.

3 Repeat, but control also for schooling.

(48)

Causal inference

Multiple regression

Exercise

Let’s take the silly example of movie description lengths again (films.dta).

1 Regress desclength on year and length. What do you

conclude about the relation between the duration of a movie and the number of lines used in the review?

2 Repeat, controlling for castsize. Does this revise your

(49)

Causal inference Multiple regression

Table presentation

% Year 0.05 * (0.015) Duration 0.02 (0.016) Cast size 0.52 * (0.125) intercept -91.04 * (29.49) Observations 100 Adjusted R2 0.31 F 15.87 *

Regression coefficients explaining the number of lines devoted to a movie review in Leonard Maltin’s Movie and Video Guide, 1996. Standard errors in parentheses.

(50)

Causal inference

Multiple regression

Exercise: US wages

Open the uswages.dta data set.

1 Regress wage on educ, exper and race.

2 What proportion of the variance in wage is explained by

(51)

Causal inference

Multiple regression

Achen, Christopher H. 2005. “Let’s put garbage-can regressions and garbage-can probits where they belong.” Conflict Management and Peace Science 22:327–339.

Angrist, Joshua D. and J ¨orn-Steffen Pischke. 2009. Mostly harmless econometrics: An empiricist’s companion. Princeton: Princeton University Press.

Benoit, Kenneth. 2009. “PO 7001: Quantitative Methods I.” Lecture slides, Trinity College Dublin. Gelman, Andrew and Jennifer Hill. 2007. Data analysis using regression and multilevel/hierarchical models.

Analytical Methods for Social Research Cambridge: Cambridge University Press.

Gerring, John. 2001. Social science methodology: A critical framework. Cambridge: Cambridge University Press. Gerring, John. 2012. Social science methodology: A unified framework. Cambridge: Cambridge University Press. Lee, Myoung Jae. 2005. Micro-econometrics for policy, program, and treatment effects. Oxford: Oxford University

Press.

References

Related documents

In the presence of the conditional cash transfer program, a household chooses from the same option set as in equation set 1, but all eligible households receive the bono alimentario

Now if a resistance of 10 Ω is connected in parallel with this series combination, what change (if any) in current flowing through 5 Ω conductor and potential difference across the

Key words: Chinese medicinal herbs, Pharmacokinetics, Automated blood sampling systems, Liquid chromatography-.. mass

The law of torts (and strict liability) is constantly changing, either through statutory change (as it did here) or through judicial opinions. Conversely, because of another

Indemnification: The City of Chicago and Special Events Management, employees, related festival providers of goods and services, or any participating sponsor will NOT be

This Fund is a field-of-interest endowment created by Terri Union and her husband, the late Carlos Zukowski, to support the arts in Cumberland County, North Carolina - forever.. Final

The highly influential American Academy of Neurology Criteria (1995) provided very strict testing details for the Apnea Test, including delivery of 100% oxygen to prevent the

1) Present decedent must have died within five (5) years from the date of death of prior decedent or date of gift. 2) The property with respect to which deduction is claimed must have