Techniques of Statistical
Analysis I
Lect_12: ANOVA vs Regression +
intro to Multilevel Models
Bruno Arpino
An ANOVA model is equivalent to a particular form of
regression where the dependent variable is the same as in
the ANOVA and the independent variables are dummy
variables representing the different groups.
Consider again the data set “nlsw88.dta” used in slide 16 of
Regression vs ANOVA
2
Consider again the data set “nlsw88.dta” used in slide 16 of
Lect_7: Does wage vary significantly by race? (Do different
ethnic groups have significantly different average wage?)
Three race groups: white, black, others
group)
each
for
same
the
is
wage
(average
µ
µ
µ
:
H
0 1=
2=
3group)
some
for
different
is
wage
(average
same
the
are
µ
all
Not
:
The ANOVA model compares the three means (tests if they at
least two are significantly different)
We can also specify a regression model to this scope:
Regression vs ANOVA (cont’d)
Others
White
Y
=
β
+
β
+
β
+
ε
3
where
Note: we omit the black group in the previous regression.
Why? See next slide.
i i
i
i
White
Others
Y
=
β
0+
β
1+
β
2+
ε
Remember: one requirement on the independent variables is
None of the independent variables can be written as a
linear combination of other X variables
(perfect linear
relationship, multicollinearity).
What would happen if we also include “Black”? Consider a
sample of 5 units
The “dummies trap”
4
sample of 5 units
We can see that the variable Black can be written as a linear
combination of the others: Black = 1 – White – Others.
In general, a categorical regressor with k categories can be
entered in a regression model as a set of k-1 dummy
variables. What happen to the excluded one? Let’s see…
White Others Black1 0 0
1 0 0
0 1 0
β
0is the expected value of wage when White and Others are
0 (that is, for black people)
β
1is the expected difference between the wage of White and
Black
F-tests in Regression vs ANOVA
i i
i
i
White
Others
Y
=
β
0+
β
1+
β
2+
ε
5
Black
β
2is the expected difference between the wage of Others
and Black
The omitted group, Black, becomes the
reference group
The hypotheses of the F-test in this regression are equivalent
to those of the ANOVA:
zero
from
different
is
one
least
at
:
H
0
:
H
1 2 10
β
=
β
=
same
the
are
µ
all
Not
:
H
µ
µ
µ
:
H
i 1 3 2 1F-tests in Regression vs ANOVA (cont’d)
zero
from
different
is
one
least
at
:
H
0
:
H
1 2 10
β
=
β
=
same
the
are
µ
all
Not
:
H
µ
µ
µ
:
H
i 1 3 2 10
=
=
6
MLRM output:Which means are different?
7
With the MLRM we test directly specific comparisons:Differences: with the MLRM we do not test all the
Interpretation of the MLRM output
Others
b
White
b
b
wage
=
+
+
∧
8
b
0is the expected salary for Black
b
1is the expected difference between the salary of White and
Black (i.e., the expected salary of White is b0+b1)
b
2is the expected difference between the salary of Others
and Black (i.e., the expected salary of Others is b0+b2)
Extending the MLRM
9 Interpretation:
-2.21 is the expected salary for black people with 0 education (Note that there is only one unit in the sample with this characteristic!) 0.61 is the expected difference between the salary of White and Black with the same level of education. Controlling forInterpretation:
1.12 is the expected difference between the salary of Others andExtending the MLRM (cont’d)
10
1.12 is the expected difference between the salary of Others andBlack with the same level of education
0.73 is the effect of one additional completed grade on wage keeping constant the race. I.e., the effect of grade is the same for the three race groups.What happens if we have many groups?
11
We should use a MLRM with 99 dummy variables (+ the otherregressors).
The output would be difficult to read.Multilevel Linear Regression Model
j ij
ij
Y
=
β
0+
ε
+
η
~
N(0,
2)
ε
σ
ε
ijη
j~
N(0,
σ
η2)
12
In this model the error term is decomposed in two components:ε represents an individual error (e.g., student)
η is a group-level error (e.g., school)
The Intra-Class Correlation coefficient (ICC), compares thegroup-level to the total variance and is an index of homogeneity of units among groups (or importance of the “group” effect):
2 2
2
CC
η ε
η
σ
σ
σ
+
=
Multilevel Linear Regression Model (cont’d)
j ij
kij k
ij ij
ij
X
X
X
Y
=
β
0+
β
1 1+
β
2 2+
...
+
β
+
ε
+
η
13 interpreted as the residual variance across groups, i.e., the
varaibility that exists across groups after differences among groups in terms of the covariates X have been controlled for.
By comparing the residual group-level variance with the variance estimated by the null model we can aseess how much the XSubramanian et al (2001), Does the state you live in make a difference? Multilevel analysis of self-rated health in the US, Social Science & Medicine, 53, 9–19.
How does age, gender, race, income affect self-rated health? Having taken account of these individual (compositionalMultilevel questions: an example
14
Having taken account of these individual (compositionalcharacteristics) are there significant variations in self-rated
health between US states?
How do state-level characteristics such as, per-capita income,income distribution and social capital affect self-rated health?
Are there differential effects of these contextual characteristics(state-level characteristics) across different income groups? (
Multilevel structures: examples
15
patients, doctors, hospitals workers, firms soccer players, teams animals, factoriesnested: each unit at the lowest level is nested in a group at the
second level (and this is possibly nested into another at the third level and so on). Other structures are possible:
Multiple memberships: a unit belongs to more than one groupMultilevel structures: not only hierarchical
16 (e.g., people can change residence place over time).
Cross-classifications: groups are not nested. E.g., Vitali andArpino (2010) consider the effect of country of origin and province of residence in Spain on the probability of living with parents for second generations young adults immigrants. Country of origin and province of residence are not nested. E.g., immigrants from
Colombia can reside in Barcelona and Spain. At the same time, in Barcelona reside immigrants from many different countries.
These papers focus on different contextual effects:
School/Teacher (Aitkin et al, 1981) Family (Curtis et al, 1993) Space/time (Arzheimer, 2009)Multilevel research: examples
17
Space/time (Arzheimer, 2009) Country/Region (Billari et al, 2008) Neighbourhood (Cerdà et al, 2009) Work environment (Jolivet et al, 2010) Social network (De Miguel Luken and Tranmer, 2010)If something is not clear
(or you find mistakes in the slides)
18