Applications of Models for Longitudinal and Multilevel Data in R and Stan ICPSR Summer Program
Georges Monette John Fox
McMaster University York University
JULY 17: Introduction to the R Statistical Computing Environment JULY 18-21: Applications of Models for Longitudinal and Multilevel Data
SECTION 1
Introduction to the R Statistical Computing Environment: One-Day Workshop
John Fox ICPSR Training Program
McMaster University Summer, 2017
The R statistical programming language and computing environment is the de-facto standard among statisticians for writing statistical software and has become very popular in other fields, including the social sciences: It is now possibly the most widely used statistical software in the world. R is a free, open-source implementation and extension of the S language, and is available for Windows, Mac OS X, and Unix/Linux systems. The substantial capabilities of the basic R software are augmented by nearly 10,000
contributed R “packages” for various statistical methods, freely available on the Comprehensive R Archive Network (CRAN) <https://cran.r-project.org/>.
This one-day workshop provides a basic introduction to R and RStudio, which is a sophisticated, and free, editor (“interactive development environment” or IDE)
customized for R. The goal of the workshop is to prepare participants who are unfamiliar with R for the subsequent four-day workshop on longitudinal and multilevel modeling.
Most, but not all, of the material for the R workshop is drawn from Fox and Weisberg, An R Companion to Applied Regression, Second Edition, and from the third edition of this book, which is in preparation. Topics to be covered in the workshop include:
1. Getting started with R and RStudio
2. Workflow in R and R Studio: Enabling reproducible research
3. Reading and manipulating data in R 4. Basic statistical graphics in R
5. Fitting and working with linear and generalized linear models in R Lecture Series Web Site
Materials for the workshop will be deposited at
<http://socserv.socsci.mcmaster.ca/jfox/Courses/R/York-R-course/>, abbreviation
< tinyurl.com/York-R-course>, which also has links to a variety of resources.
Acquiring R and RStudio
R, RStudio, and Stan are all free software, available on the internet. Stan implements state-of-art methods for Bayesian inference, and may be accessed through R via the rstan package. Instructions for installing R, RStudio, and Stan are on the workshop web site at
<http://socserv.socsci.mcmaster.ca/jfox/Courses/R/York-R-course/R-install- instructions.html>, with a link on the workshop home page.
Selected Bibliography
Publishers of statistical texts have been producing a steady stream of books on R. Of particular note is Springer's Use R! series <http://www.springer.com/series/6991> and Chapman and Hall/CRC’s The R Series
<http://www.crcpress.com/browse/series/crctherser>.
For a more extensive bibliography, see the syllabus for my R lectures at the ICPSR Summer Program in Ann Arbor.
Basic Text
A principal reference for this workshop is J. Fox and S. Weisberg, An R Companion to Applied Regression, Second Edition, Sage (2011), but you should be able to follow the workshop without reading the book. Additional materials are available on the web site for the book <http://socserv.mcmaster.ca/jfox/Books/Companion/index.html>,
including several appendices (on multivariate linear models, structural-equation models, mixed models, survival analysis, and more). As mentioned, a third edition of this book is in preparation. The book is associated with the car and effects packages.
Manuals
R is distributed with a set of manuals, which are also available at the CRAN web site
<https://cran.r-project.org/manuals.html>.
A great deal of information about using the RStudio interactive development environment is available on the RStudio website at
<https://support.rstudio.com/hc/en-us> (see under “Documentation”).
Mixed-Effects Models in R
Also see the package listing on CRAN <https://cran.r-
project.org/web/packages/index.html> and the Bayesian Inference and Statistics for the Social Sciences CRAN “task views”
<https://cran.r-project.org/web/views/index.html>.
A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. Rubin, Bayesian Data Analysis, Third Edition. Boca Raton: CRC/Chapman & Hall, 2013. More
demanding than McElreath’s text, described below, this is a tour-de-force exposition of Bayesian methods, including for mixed-effects models. An appendix to the text explains how to use R and Stan for Bayesian inference. Andrew Gelman and Aki Vehtari are among the developers of Stan.
A. Gelman and J. Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press, 2007. A wide-ranging and accessible yet deep treatment of hierarchical models and various related topics, predominantly but not exclusively from a Bayesian perspective, using both R and BUGS software.
R. McElreath, Statistical Retrinking: A Bayesian Course with Examples in R and Stan.
Boca Raton: CRC/Chapman & Hall, 2016. The title is reasonably descriptive of this very readable introduction to modern Bayesian methods. The use of R and Stan in the book is somewhat idiosyncratic, employing the author’s rethinking package, which is freely available but not from CRAN.
J. C. Pinheiro and D. M. Bates, Mixed-Effects Models in S and S-PLUS. New York:
Springer, 2000. An extensive treatment of linear and nonlinear mixed-effects models in S, focused on the authors' nlme package. Does not cover Bates, Maechler, and Bolker’s newer lme4 package.
W. N. Venables and B. D. Ripley. Modern Applied Statistics with S, Fourth Edition. New York: Springer, 2002. An influential and wide-ranging treatment of data analysis using S and R, including a chapter on mixed-effects models. Many of the facilities described in the book are programmed in the associated (and very useful) MASS, nnet, and spatial packages, which are included in the standard R distribution. This text is more advanced and has a broader focus than the R Companion. I once considered the MASS book the best moderately advanced reference on statistical data analysis in S and R. The book is still very useful, but it is showing its age.
SECTION 2
Applications of Models for Longitudinal and Multilevel Data: Four-Day Workshop
Georges Monette ICPSR Training Program
York University Summer, 2017
In the past 25 years, there have been enormous advances in statistical methods for the analysis of complex data. As tools become more powerful, researchers tackle data
structures and ask research questions that would have been impossible to broach not long ago.
For example, with multilevel and longitudinal data it is possible to perform analyses that get as close to valid causal inferences as is possible short of working with a randomized experiment. (Arjas and Parner 2004; Morgan and Winship 2014).
Most of the leading methods for longitudinal data analysis have limitations: they may be appropriate only for normally distributed responses variables, or they may allow
categorical response variables but provide limited possibilities for random effects and dependencies over time. The most general methods use Markov Chain Monte Carlo methods (MCMC) but until recently the implementation of these methods was forbidding for most researchers.
A recent approach to MCMC, known as Hamiltonian Monte Carlo, has been used to create a new modeling environment, the Stan modelling language (Carpenter et al. 2017), that is computationally efficient and whose use is accessible to researchers who are not statistical specialists.
This course does not assume any prior experience with hierarchical or longitudinal models, nor with MCMC or other Bayesian methods.
The first two days are devoted to an in-depth study of the classical methods used for normally distributed responses. After a review of statistical concepts using graphs and geometry as tools for statistical reasoning (Friendly, Monette, and Fox 2013), we follow the presentation of models for longitudinal data in well established textbooks
(Raudenbush and Bryk 2002; Singer and Willett 2003; Snijders and Boskers 2011). We will learn to implement these methods with the ‘nlme’ package in R (Pinheiro et al.
2017).
We will also consider interesting questions related to the causal interpretation of longitudinal models. (Morgan and Winship 2014; Raudenbush 2001).
In the last two days we will learn to use the Stan modelling language (Carpenter et al.
2017) to extend the methods learned in the first two days. We begin with an overview of Bayesian approaches to statistical inference and then apply Stan to a broad range of situations:
1. binary, multinomial and count responses, e.g. overdispersed Poisson
2. multivariate responses, including hurdle models, zero-inflated models and models
with mixed continuous and categorical responses, 3. models with measurement error in predictors,
4. models with responses at one point in time that are treated as covariates for the response at the next point in time.
Throughout the course, conceptual and theoretical discussions will be interwoven with practical examples and exercises in which you use the methods on your own computer, thus ensuring that you will be well equipped with the tools you need to use the methods explored in the course.
References:
Arjas, Elja, and Jan Parner. 2004. “Causal Reasoning from Longitudinal Data,”
Scandinavian Journal of Statistics 31 (2): 171–87. doi:10.1111/j.1467-9469.2004.02- 134.x.
Carpenter, Bob, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017.
“Stan: A probabilistic programming language.” Journal of Statistical Software 76 (1): 1–
32. doi:10.18637/jss.v076.i01.
Friendly, Michael, Georges Monette, and John Fox. 2013. “Elliptical Insights:
Understanding Statistical Methods through Elliptical Geometry.” Statistical Science 28 (1): 1–39. doi:10.1214/12-STS402.
Morgan, Stephen L, and Christopher Winship. 2014. Counterfactuals and causal inference. Cambridge University Press.
Pinheiro, Jose, Douglas Bates, Saikat DebRoy, Deepayan Sarkar, and R Core Team.
2017. nlme: Linear and Nonlinear Mixed Effects Models. https://CRAN.R- project.org/package=nlme.
Raudenbush, Stephen W, and Anthony S Bryk. 2002. Hierarchical Linear Models:
Applications and Data Analysis Methods. Sage.
Raudenbush, Stephen W. 2001. “Comparing Personal Trajectories and Drawing Causal Inferences from Longitudinal Data.” Annual Review of Psychology 52 (1): 501–25.
doi:10.1146/annurev.psych.52.1.501.
Singer, Judith D, and John B Willett. 2003. Applied Longitudinal Data Analysis:
Modeling Change and Event Occurrence. Oxford University Press.
Snijders, Tom AB, and Roel J. Bosker. 2012. Multilevel Analysis. Springer.