Panel data modelling Bruno Arpino
RECSM summer school 2017
Introduction to
Panel Data Analysis
Bruno Arpino
Universitat Pompeu Fabra
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Outline
• What are panel data?
• Why (collecting and) using them? • How to analyse them?
➢ Problems with OLS
➢ Fixed vs random effects models ➢ Hybrid models
➢ Two-way fixed (and random) effects models
• Intuitions on methodological key aspects • Implementation in STATA
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Longitudinal data
• Repeated observations over time
➢ Panel data (repeated measurements): same units at different points in time (e.g., same sample of individuals interviewed several times)
➢ Repeated cross-sectional data: different samples at different points in time (e.g., European Social Survey (ESS), World Values Surveys (WVS), Eurobarometer)
• Our focus will be on panel data (see “Analyzing comparative longitudinal survey data using multilevel models” - Malcom Fairbrother for repeated cross-sectional data)
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data as multilevel data
• Panel data can be seen as special cases of multilevel data. E.g.:
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data as multilevel data
• Panel data and “real” multilevel data can be also combined. E.g.,:
• We will mainly focus on two-level structures with repeated observations (t) nested within units (i)
Region 1 level 3: regions
Ind1 Ind 2 level 2: individuals
Ind 3 Ind 4
t1 t2 t3 t1 t2 t3 t4 t1 t2 t3 t1 t2
level 1: time
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Key characteristic
• Panel data combine features of both cross-section and time-series data:
– As for a cross-section, issues of sample design and sample
selection may affect representativeness of the underlying population
– As for a time-series, the data are naturally ordered and usually
display some degree of regularity or persistence over time
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Short vs long panels
• Short panels (“Large n – small T”): many units, few time
periods (n >> T)
– The cross-sectional dimension prevails
• Long panels (“Large T – small n”): many time periods, few
units (n may be even (much) smaller than T)
– The time-series dimension prevails
• Andreß et al. (2013) uses “micro” and “macro” but these labels may be misleading because “micro” data (short panels) can have geographical entities as units of
observation
• We focus on short panels
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Units of observation (entities)
• Persons, firms, parties, countries, etc. • Examples with individuals:
– Panel Study of Income Dynamics (PSID) – British Household Panel Study (BHPS) – German Socio-Economic Panel (GSOEP)
– Health and Retirement Study and its “sisters” (e.g., SHARE)
• Examples with geographical units:
– Eurostat Regio data base
– Generations and Gender Programme - Contextual Database
• Example with parties:
– Manifesto Project Database
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Micro vs Macro units
• Macro units (e.g., countries) may show more persistent characteristics over time (time invariant)
• Micro units are more easily thought as a sample randomly drawn from a population
• Autocorrelation may be more important in “Macro-level” panels
• Also outliers may be a more serious issue when units are countries
• We will discuss some of these points
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Example:
Panel Study of Income Dynamics (PSID)
• The PSID started in 1967 with a sample of 18,000 individuals in 4,802 families living in the USA, with data collected annually from the head of the household
• It is one of the oldest ongoing panel surveys
• It is an individual panel survey, while BHPS and SOEP are household panel survey (all individuals in the hh are interviewed)
• The PSID intends to follow members of the original family unit and their offspring
• The PSID is available free of charge to registered users: http://simba.isr.umich.edu/data/data.aspx
• See Longhi and Nandi (2015) for more details
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Organizing panel data
• Panel data can be seen as a cube with 3 dimensions:
– units: i = 1, . . . , n
– time points: t = 1, . . . , T
– variables: v = 1, . . . , V
• Wide format: each unit occupies one row of the data
matrix. All measurements over time are included in each row. The matrix has n rows and T x V columns
• Long format: each single measurement occupies one row. Thus, we have T rows per unit and N = n x T rows in total. The number of columns equals the number of
variables V
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Organizing panel data: example
13
From: Andreß, Hans-Jürgen, Golsch, Katrin, and Alexander W. Schmidt. 2013. Applied Panel Data Analysis for Economic and Social Surveys. Springer.
As the number of variables increases, the long format
becomes more efficient
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Organizing panel data
• To implement the regression models we study we need data in long format
• In Stata, with the reshape command we can easily change from one to the other format
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
• A subsample extracted from the PSID provided by Cameron and Trivedi, 2010
• 595 individuals each observed for 7 years from 1976 to 1982
(4,165 person-year pair observations)
• i: individuals’ identifiers
• t: time periods
• n = 595; T = 7
• it indicates an individual i observed at time t
i = 1,2, …, 595; t = 1, 2, …, 7
15
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
A PSID subsample
Variables:
• fem = 1 if female time invariant • exp = n. years of full-time experience
time variant
• wks = n. weeks worked time variant • ed = n. years education ?
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Time-invariant vs time-variant variables
• Time-invariant vs time-variant factors
➢ Time-invariant variables do not vary across time but only between units (within the observation period!)
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Balanced vs unbalanced panel
• With balanced panel data, each unit is observed at each time point. In unbalanced panels, the number of
observations per unit can differ
• Unbalanced panels occur when some respondents:
– do not participate in each wave (temporary unit non-response)
– drop out of the panel (attrition)
– enter the panel at a later point in time (new entries)
18
2003 is missing for person 1:
attrition
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Missing data
• Missing Completely at Random (MCAR): non-response is independent of the variable of interest
• Missing at Random (MAR): non-response is independent of the variable of interest but it depends on observed
covariates (weighting / regression adjustment) – Methods studied in this course rely on this assumption!
• Missing Not at random (MNAR): non-response is non ignorable as it depends on unobservables
– In individual panels, attrition is typically associated with important transitions in a person’s life: going to college, migration, finding a new job, marriage or divorce, retirement, death. If these events are the main object of study, then attrition cannot be ignored (see:
Wooldridge (2010) "Econometric Analysis of Cross-Section and Panel Data”)
Panel data modelling Bruno Arpino
RECSM summer school 2017
An example of a country panel
dataset: The “Homicides” data
• Neumayer E. (2003) Good Policy Can Lower Violent Crime:Evidence from a Cross-National Panel of Homicide Rates, 1980–
97, Journal of Peace Research, 40(6), 619–640
• Neumayer analyses several factors influencing homicide rates (n. homicides per 100,000 persons)
• He merged data from different sources (e.g., gross domestic product (GDP) per capita in purchasing power parity from World Bank; homicides from the International Criminal Police
Organization (Interpol) and the World Health Organization (WHO))
• Replication data and dofile:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/0IFFYZ
Panel data modelling Bruno Arpino
RECSM summer school 2017
An example of a country panel
dataset: The “Homicides” data
• We will focus a slightly modified version of Neumayer’s dataset. We will use the following variables:
➢ gdp: gross domestic product per capita in 1997 thousands dollars
➢ freedom: index of freedom ranging from 2 (least democratic) to
14 (most democratic)
➢ big: 1 = population bigger than 65 millions; 0 = otherwise (this variable was not included in Neumayer’s analyses)
Panel data modelling Bruno Arpino
RECSM summer school 2017
The Homicides data
• Data adapted from Neumayer (2003)
• n = 207 countries, T = 6 periods
(potentially 1,242 country-year obs.)
it indicates a country i observed at time t;
i = 1, 2, …, 207; t = 1, 2, …, 6.
Panel data modelling Bruno Arpino
RECSM summer school 2017
The Homicides data
• missing values
unbalanced data
• homiciderate, gdp, freedom are time variant
• Is “big” time-invariant?
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Between vs Within variations
• Between variation: variability of a variable between units (cross-sectional variation)
• Within variation: variability of a variable over time (within the units) (longitudinal variation)
Panel data modelling Bruno Arpino
RECSM summer school 2017
Between vs Within variations:
an example on the Homicides data25 4 countries extracted from
the homicides data
What is higher, the between or within variability?
6 8 10 12 14 fre e d o m
1980 1985 1990 1995
period
Colombia Philippines Spain United States
0 20 40 60 80 h o mi ci d e ra te
1980 1985 1990 1995
period
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Variance decomposition
• Every variable can be written as the sum of two
independent components:
• And so:
X
itvar
X
i.var
X
itX
i.
var
26
.
. it i
i
it
X
X
X
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Between vs Within variations:
an example on the Homicides data27
Xit var
X i. var Xit X i.
var
.
i
X Xit Xi.
it
X Xit X i. Xit Xi.
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Why panel data?
• Analysis of changes (pre-post a given event, trends)
• Better approximation of causality (e.g., adjustment for time invariant factors (including time invariant
measurement error), measuring covariates before outcomes)
• Dynamic analysis (e.g., how past values of outcome and/or covariates influence current outcome)
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Example: Margalit (2013)
• Margalit, Y. (2013) Explaining Social Policy Preferences: Evidence from the Great Recession, American Political Science Review, Vol. 107, No. 1, 80-103.
• One of the Margalit’s research question is:
➢ How do individuals’ preferences on welfare policy
shift in response to changes in their personal economic circumstances?
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Example: Margalit (2013)
• To analyse changes over time
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Example: Margalit (2013)
• Better approximation of causality
Margalit (2013):
“(past) evidence is based almost exclusively on analysis of cross-sectional survey data in which scholars find correlations between measures of survey respondents’ economic standing and their views on social policy.
But with this type of evidence a causal link between the two measures remains unclear [...] it is also plausible that an unobservable characteristic—such as people’s upbringing, or the influence of their parents—explains their
preferences on welfare provision and their standing in the labor market.”
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Motivation
• Consider the model:
• With panel data the error term ε includes both time-invariant and time-variant unobserved factors.
• The Ordinary Least Squares estimator (OLS) assumes that independent variables are not correlated with the error
term. Otherwise, we get biased estimates of β1 and β2.
34 it
i it
it
freedom
big
Panel data modelling Bruno Arpino
RECSM summer school 2017
Two key assumptions of OLS
• Given the linear model:
• OLS assumes:
– Strict exogeneity:
– Independent observations:
35 it i
i it
it
it
X
X
Z
Z
Y
1 1
2 2
...
1 1
2 2
...
it Xi1,Xi2,..., XiT,Zi
0, t 1,...,TE
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Time variant and time invariant
confounders
(the homicides example)36
Freedom
Homicides
Culture
(time invariant)
GDP
(time variant)
Supose we are interested in the “effect” of freedom on Homicides.
Exogeneity is violated if we do not adjust for culture and GDP.
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Why panel data?
• Better approximation of causality
Neumayer (2003):
“Few studies employ a fixed-effects estimator on panel data in order to control for unobserved country heterogeneity bias in the
estimated coefficients. Such bias can arise if country characteristics are either impossible to quantify or unobservable and if the
explanatory variables are correlated with these characteristics.”
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
The fixed effects model
• Let decompose the error term in the time-invariant and time-variant components:
• Where ui represents unobserved time-invariant variables (considered as fixed individuals’ features) and eit is the time-variant error term (idiosyncratic).
• If u is correlated with the independent variables, we get biased estimates of β1 and β2.
38 it
i i
it
it
freedom
big
u
e
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
How can we control for u?
• Consider two time periods:
• What happens if we subtract (1) from (2)?
Note that Δ indicates the difference operator.
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
How can we control for u?
• Taking the (first) difference of both the outcome and
independent variables and implementing the model (OLS estimator) on the transformed data eliminates ALL TIME-INVARIANT FACTORS.
• Note that also “big” disappears even tough it is observed!
• The constant captures a time trend
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
First difference (FD) estimator
• The model where all variables are differences between two time points
is known as first difference estimator
• In this model β1 can be interpreted as the effect of a unitary increase in
freedom on the change of homicides rate (only within variation is used)
• There is no constant otherwise we would be assuming a nonlinear time trend
41
i i
i
freedom
e
homicides
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
First difference (FD) estimator
• Pros: we do not have to worry about unobserved time-invariant
independent variables because they are eliminated (the exogeneity assumption can be “relaxed”: we can allow partial endogeneity):
• Cons: also observed time-invariant independent variables are
eliminated and cannot be used in the model (FD only uses within units variations).
• The FD estimator cannot eliminate time-variant unobserved variables
(like GDP). All time-variant confounders should be controlled
42
i i
i
freedom
e
homicides
1
eit ,Zi
0Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Extending the FD estimator to more
periods
• Consider 3 time periods:
• How can we eliminate u and have a single model?
• Again, taking the first differences!
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
The within estimator
• Let’s subtract each variable from its mean over time:
• Also this transformation of data (taking the deviations from the mean; de-meaning) eliminates ALL TIME-INVARIANT
FACTORS
• The within estimator is also commonly called the fixed
effects estimator (FE)
44
.
.
1 . 2 . 1 .it i it i
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
FE vs FD estimator
• Pros and cons are similar and with two time points (T = 2) FE and FD are equivalent
• FE and FD are both consistent under strict exogeneity but FE is more efficient if errors are serially uncorrelated while FD is more efficient when serial correlation follows a random walk process or AR with strong autocorrelation
• Apparently the FD estimator uses one period less for each
individual. This is misleading because the FE estimator looses the same number of degrees of freedom to estimate the fixed effects
• FD is more affected than FE by unbalanced panels because FD uses consecutive observations to calculate first differences
Panel data modelling Bruno Arpino
RECSM summer school 2017
The least-squares
dummy-variables estimator (LSDV)
• In linear models, an alternative way (equivalent to the within estimator) to adjust for unobservable time invariant u is to include a dummy variable that represents each unit:
Note that one of the T dummies and the observed time-invariant variables (“big”) have to be excluded for multicollinearity
(The areg command with the absorb option in Stata can be conveniently used but in short panels standard errors from xtreg, fe should be used; Cameron and Trivedi, 2010)
• It is computationally more demanding in case of a high T than FE and it does not bring any advantage in most applications
46
it it
it
freedom
D
D
e
Panel data modelling Bruno Arpino
RECSM summer school 2017
OLS vs FE models
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
OLS vs FE - homicides data
• Given the model:
• A simple OLS model (reg in STATA) assumes that both ui and eit are not correlated with the independent variables
• The FE model eliminates time-invariant factors (xtreg, fe
in STATA). In other words, it allows ui to be correlated with the independent variables, but it still assumes that they are not correlated with eit
48
it i
i it
it
it
gdp
freedom
big
u
e
Panel data modelling Bruno Arpino
RECSM summer school 2017
49
OLS vs FE - homicides data
Variables OLS FE
gdp -0.475*** -0.045 (0.082) (0.095) freedom 0.361* 0.487***
(0.144) (0.095) big 1.105 0.000
(1.663) (.)
_cons 7.677*** 3.036* (1.117) (1.196)
Panel data modelling Bruno Arpino
RECSM summer school 2017
• OLS and FE models give quite different results!
• OLS uses both within and between variation (more
efficient) but it does not adjust for time-invariant factors (other than “big”) and so it is biased in the presence of u.
• FE eliminates the influences of all time-invariant factors. It uses only the within variation.
50
Panel data modelling Bruno Arpino
RECSM summer school 2017
Independence assumption and standard
errors
• OLS assumes that observations are independent.
• But with panel data this assumption is untenable: repeated observations on the same unit at different points in time
are likely to be correlated (serial correlation). • Serial correlation can be due to:
– time invariant independent variables
– serial correlation in time variant independent variables
– true state dependence in the outcome (causal effect of lagged values of outcome)
51
Panel data modelling Bruno Arpino
RECSM summer school 2017
Independence assumption and standard
errors
• FE eliminates serial correlation due to time invariant
variables, but serial correlation in the errors may still be present
• For both OLS and FE we need to adjust the standard errors using a robust estimator (cluster option in Stata).
Panel data modelling Bruno Arpino
RECSM summer school 2017
OLS vs FE; standard vs robust s.e.
53
Variables OLS OLS_rob FE FE_rob
gdp -0.475*** -0.475*** -0.045 -0.045 (0.082) (0.106) (0.095) (0.066) freedom 0.361* 0.361 0.487*** 0.487***
(0.144) (0.187) (0.095) (0.129)
big 1.105 1.105
(1.663) (2.511)
_cons 7.677*** 7.677*** 3.036* 3.036** (1.117) (1.891) (1.196) (0.956) Standard errors in parentheses
Panel data modelling Bruno Arpino
RECSM summer school 2017
• Point estimates (coefficients) are not affected by robust standard errors.
• Robust standard errors are usually higher.
• The coefficient of freedom with the OLS is not statistically significant when robust standard errors are used.
54
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
The between estimator
• It only uses the between variation (cross-sectional)
• It is obtained taking the average of each variable over time:
• Effects of individual-invariant factors (e.g., time dummies) cannot estimated
• It is seldom used because:
– Differently from FE, it requires ui to be uncorrelated with covariates – It is never the most efficient estimator
56
i i
i
i freedom big u
homicides .
1 .
2 it i
i it
it freedom big u e
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Motivation of the RE model
• Given the model:
• We have seen that the FE solution to the presence of time-invariant factors is to eliminate all time-time-invariant variation • But this eliminates also time-invariant observed factors as
“big”
• The RE model allows to estimate the effect of both
time-variant and time-intime-variant observed independent variables
58
it i
i it
it
it
gdp
freedom
big
u
e
Panel data modelling Bruno Arpino
RECSM summer school 2017
The RE model
• Given the model:
• In the RE model, time-invariant factors are not treated as nuisance factors but they are modelled:
➢Including in the model relevant observed time-invariant variables (e.g., “big”)
➢Modelling residual time-invariant variability, u, with a parametric distribution: u ~ N(0, σu2)
59
it i
i it
it
it
gdp
freedom
big
u
e
Panel data modelling Bruno Arpino
RECSM summer school 2017
The RE model
• Pros:➢ As OLS and differently from FE, it allows estimating the effect of time-invariant observed variables
➢ It allows using both within and between variability (more efficient)
• Cons:
➢ As OLS and differently from FE, it imposes that u are not
correlated with independent variables
Panel data modelling Bruno Arpino
RECSM summer school 2017
Random Effects estimators in xtreg
• , re: Feasible Generalized Least Square (FGLS). Weighted estimator of between and within estimators
• , mle: (Full) Maximum Likelihood. It gives biased estimates of
variance components especially in small samples but it is more efficient
• , pa: Population Averaged Model (Generalized Estimating
Equations, GEE). It offers options to get more efficient estimates by exploiting some structure (e.g.; ar) in errors components
• Asymptotically equivalent; mle and pa give the same point estimates in balanced panel
Panel data modelling Bruno Arpino
RECSM summer school 2017
OLS, FE and RE - homicides data
62
Variables OLS_rob FE_rob RE_rob
gdp -0.475*** -0.045 -0.134*
(0.106) (0.066) (0.063)
freedom 0.361 0.487*** 0.464***
(0.187) (0.129) (0.121)
big 1.105 -1.266
(2.511) (2.279)
_cons 7.677*** 3.036** 4.764**
(1.891) (0.956) (1.452)
Standard errors in parentheses
Panel data modelling Bruno Arpino
RECSM summer school 2017
• (Not unexpectedly) OLS, FE and RE all give different results
• RE (FGLS) estimates are a weighted average of between and within estimates
• In this example, FE and RE estimates are similar.
63
Panel data modelling Bruno Arpino
RECSM summer school 2017
Fixed or random effects?
• Nature of data: if units are randomly sampled from apopulation then ui are random ( a ‘draw’ from the population
distribution)
– When units are countries it seems more natural to think about ui
as “fixed” effects
• If ui are not correlated with independent variables, both FE and RE are consistent estimators but RE is more efficient
• If u are correlated with independent variables, only FE is consistent because it eliminates u
Panel data modelling Bruno Arpino
RECSM summer school 2017
The Hausman test
• The Hausman test compares estimates obtained with FE and RE models:
➢ H0: Differences in coefficients from the two models are not systematic (RE is preferable because more efficient)
➢ H1: Differences are systematic (due to time-invariant unobserved factors) (FE is preferable because it is consistent)
Panel data modelling Bruno Arpino
RECSM summer school 2017
The Hausman test
• It can be implemented in STATA if FE is estimated without
cluster robust standard errors and if RE is estimated with the re option and without cluster robust standard errors
• It is restrictive because it assumes efficiency under the null which is not attained for example in the case of non
independent errors
• As with all statistical tests
– in case we reject the null we should consider whether differences are also substantially different
– in case we cannot reject the null, we should consider whether parameters are estimated with sufficient precision
Panel data modelling Bruno Arpino
RECSM summer school 2017
Fixed or random effects?
• Althoug the Hausman test (and other tests) can help, the choice between FE and RE should be motivated more on the ground on theoretical / substantive arguments
• E.g., in Arpino, Andersen and Pessin (2015) we test Esping-Andersen and Billari (2015) theory that as countries move from a gender egalitarian equilibrium toward and an egalitarian one fertility first declines and then, when gender egalitarian values are
sufficiently spread, it increases
67
Panel data modelling Bruno Arpino
RECSM summer school 2017
Better togheter?
• Remember that each variable can be written as the sum of its within (time-variant) and between (time-invariant)
components (see slide 26)
• Therefore we can write the RE model as:
where u ~ N(0, σu2)
68
i i i itPanel data modelling Bruno Arpino
RECSM summer school 2017
• The hybrid model with both within and between
components is also known as Cronbach or Mundlak
model
• Pros:
➢It allows distinguishing the within and between effects of covariates
➢It allows estimating the effect of both time variant and time invariant variables
➢As the FE, it adjusts for correlation between u and time-variant observed covariates
69
Panel data modelling Bruno Arpino
RECSM summer school 2017
Variables OLS OLS_hyb FE_rob RE RE_hyb
gdp -0.475*** -0.045 -0.134
(0.106) (0.066) (0.083)
freedom 0.361 0.487*** 0.464***
(0.187) (0.129) (0.091)
big 1.105 1.135 0.000 -1.266 -1.076
(2.511) (2.528) (.) (4.184) (4.056)
gdp_dev -0.045 -0.045
(0.066) (0.095)
gdp_m -0.503*** -0.541*
(0.133) (0.226)
free_dev 0.487*** 0.487***
(0.130) (0.095)
free_m 0.378 0.742
(0.261) (0.400)
_cons 7.677*** 7.764*** 3.036** 4.764** 5.470
(1.891) (2.103) (0.956) (1.574) (2.800)
Variables OLS OLS_hyb FE_rob RE RE_hyb
gdp -0.475*** -0.045 -0.134
(0.106) (0.066) (0.083)
freedom 0.361 0.487*** 0.464***
(0.187) (0.129) (0.091)
big 1.105 1.135 0.000 -1.266 -1.076
(2.511) (2.528) (.) (4.184) (4.056)
gdp_dev -0.045 -0.045
(0.066) (0.095)
gdp_m -0.503*** -0.541*
(0.133) (0.226)
free_dev 0.487*** 0.487***
(0.130) (0.095)
free_m 0.378 0.742
(0.261) (0.400)
_cons 7.677*** 7.764*** 3.036** 4.764** 5.470
(1.891) (2.103) (0.956) (1.574) (2.800)
OLS_hyb, FE, RE_hyb models
Panel data modelling Bruno Arpino
RECSM summer school 2017
OLS_hyb, FE, RE_hyb models
71
• FE gives the same within estimate of the Hybrid models
• The hybrid models are preferable to their “non hybrid” versions because they adjusts for correlation between u
and time-variant covariates
• In models with many covariates it is common to
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino
RECSM summer school 2017
Two-way models
• They include both unit and time fixed (or random) effects.
• We should control for time effects:
➢when special events happening at specific time points may affect the outcome variable (and may also be
correlated with independent variables)
➢to adjusts for time trends in the outcome
73
it t
i it
it
X
u
v
e
Panel data modelling Bruno Arpino
RECSM summer school 2017
Two-way fixed effects models
• To control for time effects (v) these models include a
dummy variable for each period (year) except one (used as reference):
• year(1) is a dummy variable for the first year and so on. • δ1 is the coefficient of year(1)
• u represents, as before, a fixed effect for units (e.g., countries)
74
it
it T
it i itit
it X year year year T u e
Panel data modelling Bruno Arpino
RECSM summer school 2017
One vs Two-way FE models - homicides data
75
• FE controls for country-level time-invariant unobserved factors (e.g., cultural traits of the country)
• FE_two also controls for period-specific effects
• Results indicate significantly higher homicide rates in 1995 as compared to 1980
Variables FE_rob FE_two
gdp -0.045 -0.244
(0.066) (0.136)
freedom 0.487*** 0.257
(0.129) (0.152)
Period (ref=1980): 0.000
1983 -0.322 (0.415) 1986 0.171 (0.520) 1989 0.890 (0.870) 1992 1.920 (0.975) 1995 2.099* (0.877)
_cons 3.036** 5.863***
Panel data modelling Bruno Arpino
RECSM summer school 2017
Interactions with time in FE models
76
• They allow to test whether the effects of covariates change over time
• Note: Neumayer E. (2003) did not consider such interactions. We should have good
theoretical arguments to introduce and interpret them
Variables FE_two2 Variables FE_two2 gdp -0.295 1983#freedom 0.098
(0.168) (0.053) freedom 0.120 1986#freedom 0.229**
(0.171) (0.070) Period (ref=1980): 0.000 1989#freedom 0.157 1983 -1.179 (0.129)
(0.667) 1992#freedom 0.150 1986 -1.660* (0.135)
(0.729) 1995#freedom 0.142 1989 -0.374 (0.129)
(1.470) _cons 7.407*** 1992 0.705 (1.997)
(1.500) 1995 0.988
Panel data modelling Bruno Arpino
RECSM summer school 2017
Interactions: graphical interpretation
77 The effect of freedom on homicide rates is always positive. It is strongest in 1986
4 6 8 10 L in e a r Pr e d ict io n
2 4 6 8 10 12 14
freedom
period=1980 period=1983 period=1986 period=1989 period=1992 period=1995
Predictive Margins of period
.1 .1 5 .2 .2 5 .3 .3 5 Eff e ct s o n L in e a r Pr e d ict io n
1980 1983 1986 1989 1992 1995 period
Panel data modelling Bruno Arpino
RECSM summer school 2017
Two-way random effects models
• Two-way (or cross-classified) RE models assume that both unit-specific (u) and time-specific (v) effects follow a
normal distribution:
u ~ N(0, σu2); v ~ N(0, σ v2)
u and v (and e) are not correlated
78 it
t i
it
it
X
u
v
e
Panel data modelling Bruno Arpino
RECSM summer school 2017
Bibliography
• Longhi, Simonetta and Alita Nandi. 2015. A Practical Guide to Using Panel Data. Sage.
• Andreß, Hans-Jürgen, Golsch, Katrin, and Alexander W.
Schmidt. 2013. Applied Panel Data Analysis for Economic and Social Surveys. Springer.
– The book’s web site (eswf.uni-koeln.de/panel) provides all necessary data sets and Stata syntax files to replicate the examples
• Both of them offer a relatively easy introduction to panel data analysis for social scientists
Panel data modelling Bruno Arpino
RECSM summer school 2017
Bibliography
• Cameron, A., & Trivedi, P. (2010). Microeconometrics using Stata. College Station: Stata Press.
– General applied econometrics book including more advanced topics (categorical data, dynamic models, instrumental variables...)
– Datasets can be downoloaded directly from Stata: http://www.stata-press.com/data/musr.html
• Rabe-Hesketh S., and Skrondal A. (2012) Multilevel and longitudinal modeling using Stata (3rd ed). Stata Press Publication, College Station, TX.
– Longitudinal data viewed as special case of multilevel data. Focussed on random effects models (including growth curve models)
– Datasets can be downoloaded directly from Stata: http://www.stata-press.com/data/mlmus3.html
Panel data modelling Bruno Arpino
RECSM summer school 2017
Panel data modelling Bruno Arpino