Techniques of Statistical
Analysis I
Lect_10: Causality
Bruno Arpino
Association is not causation!!!
“Alternative explanations”
Spurious correlation and controlling for third
variables
Outline
2
variables
Simpson’s paradox
Association
means that two variables move
together. Causation
means that one of the
variables is causing the movements in the other.
Remember: the direction of causality in a
Causality
3
regression model is set a-priori by the
researcher
Evidence of association is necessary but not
sufficient evidence of causation
There is empirical evidence that X has a causal effect on
Y if:
1.
X and Y are associated (
relationship condition
)
Three necessary conditions for causation
4
2.
X temporally/logically precede Y (
antecedent
condition
)
3.
The observed association between X and Y cannot be
explained by factors not controlled for in the study
(
lack of alternative explanations
condition)
You might find an association but this cannot be
interpreted as causal if:
A.
You mixed up causes and effects (
reverse causality
)
Association does not imply causation
5
B.
X and Y are mutually related (
bidirectional causality
)
C.
X and Y are associated only because some third factor(s)
is related to both variables (
spurious association
)
Check out this wikipedia entry:
The more firemen fighting a fire (X), the bigger the fire is
observed to be (Y).
Therefore we conclude
firemen cause fire size
.
A.
You mixed up causes and
effects (reverse causality):
Examples of reverse causality
fi re s iz e 6
effects (reverse causality):
the bigger is the fire, the more
firemen will be sent to the scene
You find a strong correlation between X and Y.
But it’s too early to draw causal conclusion!!!
Ask yourself: is X the cause of Y or is it the
other way
other way around? (Stata cannot answer!!!)
It’s a common practice in primary and secondary schools to divide
students in working groups according to their abilities.
It’s easy to find data showing association between working groups (X)
and student performance (Y).
E.g, the graph shows results from
Examples of reverse causality (cont’d)
7
E.g, the graph shows results from
an ANOVA: students in group 1
had bad average performance
compared to students in the other
two groups.
But… is it because students were in
group 1 that they had lower performance?
Or… it’s that student who needed more help (“““less smart”””) were
included in group 1?
Does education (X) have a causal effect on fertility (Y)?
Probably yes but… if you are a pregnant teen maybe you drop out
of school! (Think hard about sample selection, timing of events
and possible anticipation effects)
This paper, for example, addresses the possible reverse effect in
Examples of reverse causality (cont’d)
8
This paper, for example, addresses the possible reverse effect in
the education-fertility relationship
http://www.econ.uconn.edu/Seminar%20Series/AminBehrmanSchoolingFertility.pdf
R Related to fertility: big families are more likely
to have a van than small families.
Is it that having a van makes you more fertile?
http://www.youtube.com/watch?v=dk8B_c991kAhttp://www.youtube.com/watch?v=xDZSxFLcMVg&feature=player_embedded You are
The more eggs (X) we have
the more chickens (Y) we have
… or the other way around???
Examples of bidirectional relations
The long contested issue: which came first,
the chicken or the egg?
B.
X and Y are mutually related
(
bidirectional causality
)
9
… N. of children
Quality of marital relation
N. of Children …
… N. of children
Economic wellbeing
N. of Children …
… Childcare choices
Mothers’ labor force participation
Childcare choices…
… Grandparenting
Cognitive Skills
Grandparenting
Therefore we conclude that storks bring babies and cause the increase of the fertility rate
How to reduce fertility rates? Simple! Kill storks!Examples of spurious correlation
10
Problem considered, for example, by Kronmal (1993), JRSSAAnother long contested issue:
do storks bring babies?
Ask yourself: is there an alternative explanation to the observed relationship other than a causal effect? Are there common factors?
Think, for example, that in more rural areas both fertility and presence of stork are higher. What happens if we split the sample of counties by rurality?Examples of spurious correlation (cont’d)
11
Urban areas
Rural areas
Regression lines in:
Examples of spurious correlation (cont’d)
12
Note that in the scatterplot each value of the stork presence is associated with many values of birth rate! We will see that multiple regression is a way to control for confounding effects. Another way is splitting the sample at each value of the confounder (as done here)If you continue to ignore rurality
STORKS (X)
FERTILITY (Y)
RURALITY (C)
Confounders
13
Rurality is a confounder. Confounders can reduce, reverse, cancel relationships. The effect of storks represented by the black line was (entirely) confused with the effect of rurality! In the storks example, X and Y are marginally dependent but conditionally (on C) independent!!!Therefore, sleeping with one's shoes on causes headache. But… how much did you drink yesterday?
Common cause: going to bed drunk
“As ice cream sales increase, the rate of drowning deaths increases sharply”Therefore, eating ice cream causes drowning.
Examples of spurious correlation (cont’d)
14
Therefore, eating ice cream causes drowning.
But… when is more likely that people buy ice creams and swim? Common cause: common (seasonal) trend (season of the year)
The common trend problem is very common! It happens every time we have time series data on variables that “naturally” move together.Ice cream sells
Drownings
(serious) Example of spurious correlation
In their paper “Driving Status and Risk of Entry Into Long-Term Care in Older Adults” appeared in 2006 in the American Journal of Public Health, Ellen E. Freeman and colleagues study whether not driving increases the probability of entering long-term care (LTC) institutions. The authors consider that the association between driving or not with being in15
The authors consider that the association between driving or not with being in LCT or not might be spurious because of the elderly health status. They use a multivariate model where health status is controlled for. The authors find that even after controlling for health (and other factors) not driving increases the likelihood of entry into LTC.Simpson’s paradox
16
( From Wikipedia) One of the best known real life examples of Simpson's paradox occurred when the University of California, Berkeley was sued for bias against women who had applied for admission to graduate schools there. The admission figures for the fall of 1973 showed that menSimpson’s paradox (cont’d)
17
Bickel et al. ("Sex Bias in Graduate Admissions: Data From Berkeley“; Science 187 (4175): 398–404) concluded that women tended to apply to competitive departments with low rates of admission even amongin global warming over the
Causality and Coincidences
18
in global warming over the same period.
Therefore, global warming is caused by a lack of pirates.
(This example is used satirically by the parody religion
The counterfactual approach to causal
inference
19
Example:
Does eating donuts cause Homer being overweight?
Should then the Government increase taxes on donuts?
Fact: Homer eats donuts (X= 1); Y1 = 38 (Y = Body Mass Index) Counterfactual: what would have been Homer’s BMI if he has not been eating donuts? Y0 = ??? The causal effect of eating donuts for Homer is defined as a comparison of the two potentialThe fundamental problem of causal inference
20
unit (the fundamental problem of causal inference; Holland, 1986).
The observed BMI for Homer is Y1 = 38; but Y0 is unobserved How to solve the fundamental problem and estimate causal effects?VS
How to solve the fundamental problem
Donuts No donuts
21
Finding a credible counterfactual substitute is the crux of all sound causal inference. Is it Flanders a good match for Homer? Maybe not… think about third variables (Homer and Flanders are very different in many respects… NOT ONLY wrt eating donuts or not!!!).?
comparison is not fair for donuts.
VS
How to solve the fundam. problem (cont’d)
Healthy Lifestyle (C)
-22
The influence diagram is an easy and intuitive way to assess direction of bias caused by not taking into account for confounders. In this case the bias would be positive (“-” * “-” = “+”) because healthy lifestyle reduces both the quantity of donuts consumed and BMI. Therefore the simple comparison Homer’s BMI (= 38) – Flanders’s BMI (= 25) = 13 would overestimate the effect due ONLY to donuts.I.e., “true donuts effect” < 13
Donuts (X)
BMI (Y)
Simpson vs Ned Flanders).
We need to collect (or find) data on a sample of people with different levels of X (eating and not eating donuts) who are comparable with respect to third variables (similar overall lifestyle).How to solve the fundam. problem (cont’d)
23
third variables (similar overall lifestyle).
Ideally we would compare people eating donuts (X=1) and others not eating donuts (X=0) with exactly the same lifestyle (comparison group). Simply, it’s all about finding a good comparison group!Randomized experiments vs Observational
Studies
24
Random assignment guaranties that the treatment groups only differ by the values of education and not other factors (we remove by designconfounding effects). That’s why random experiments are considered the gold standard for causal inference.
Randomized experiments are very rare in social sciences because of ethical and practical reasons. (For examples see next slide)Examples of randomized experiments in
Social Sciences
25
Kling, J. (2001), “A Synthesis of MTO Research on Self Sufficiency, Safety and Health, and Behavior and Delinquency,” Poverty Research News, 5, 3– 6.
No causation without manipulation
26
E.g., taxation, family allowances, quality of public transportation, election law are causal variables. Who decides is the Government (or theParliament) in this case.
For gender, race and other immutable characteristics it does not make sense to talk about causal effects. Unless we refer to the consequence of how they are perceived (Greiner and Donald B. Rubin, 2011; The Review of Economics and Statistics). One can hypothesize interventions that might change the decider’sperceptions
about races or gender but notIf something is not clear
(or you find mistakes in the slides)
27