Data Analysis Using Regression and Multilevel/Hierarchical Models
Data Analysis Using Regression and Multilevel/Hierarchical Models is a comprehensive manual for the applied researcher who wants to perform data analysis using linear and nonlinear regression and multilevel models. The book introduces and demonstrates a wide variety of models, at the same time instructing the reader in how to fit these models using freely available software packages. The book illustrates the concepts by working through scores of real data examples that have arisen in the authors’ own applied research, with pro-gramming code provided for each one. Topics covered include causal inference, including regression, poststratification, matching, regression discontinuity, and instrumental vari-ables, as well as multilevel logistic regression and missing-data imputation. Practical tips regarding building, fitting, and understanding are provided throughout.Andrew Gelman is Professor of Statistics and Professor of Political Science at Columbia University. He has published more than 150 articles in statistical theory, methods, and computation and in applications areas including decision analysis, survey sampling, polit-ical science, public health, and policy. His other books are Bayesian Data Analysis (1995, second edition 2003) and Teaching Statistics: A Bag of Tricks (2002).
Jennifer Hill is Assistant Professor of Public Affairs in the Department of International and Public Affairs at Columbia University. She has coauthored articles that have appeared in the Journal of the American Statistical Association, American Political Science Review, American Journal of Public Health, Developmental Psychology, the Economic Journal, and the Journal of Policy Analysis and Management, among others.
Analytical Methods for Social Research
Analytical Methods for Social Research presents texts on empirical and formal methods for the social sciences. Volumes in the series address both the theoretical underpinnings of analytical techniques and their application in social research. Some series volumes are broad in scope, cutting across a number of disciplines. Others focus mainly on method-ological applications within specific fields such as political science, sociology, demography, and public health. The series serves a mix of students and researchers in the social sciences and statistics.
Series Editors: R. Michael Alvarez, California Institute of Technology Nathaniel L. Beck, New York University
Lawrence L. Wu, New York University
Other Titles in the Series:
Event History Modeling: A Guide for Social Scientists, by Janet M. Box-Steffensmeier and Bradford S. Jones
Ecological Inference: New Methodological Strategies, edited by Gary King, Ori Rosen, and Martin A. Tanner
Spatial Models of Parliamentary Voting, by Keith T. Poole
Essential Mathematics for Political and Social Research, by Jeff Gill
Data Analysis Using Regression and
Multilevel/Hierarchical Models
ANDREW GELMAN
Columbia University
JENNIFER HILL
Columbia University
CAMBRIDGE UNIVERSITY PRESS
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo Cambridge University Press
The Edinburgh Building, Cambridge CB2 8RU, UK
First published in print format
ISBN-13 978-0-521-86706-1
ISBN-13 978-0-521-68689-1 ISBN-13 978-0-511-26878-6
© Andrew Gelman and Jennifer Hill 2007
2006
Information on this title: www.cambridg e.org /9780521867061
This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press.
ISBN-10 0-511-26878-5
ISBN-10 0-521-86706-1
ISBN-10 0-521-68689-X
Cambridge University Press has no responsibility for the persistence or accuracy of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. Published in the United States of America by Cambridge University Press, New York
www.cambridge.org hardback paperback paperback eBook (EBL) eBook (EBL) hardback
Data Analysis Using Regression and
Multilevel/Hierarchical Models
(Corrected final version: 9 Aug 2006)
Please do not reproduce in any form
without permission
Andrew Gelman
Department of Statistics and Department of Political Science
Columbia University, New York
Jennifer Hill
School of International and Public Affairs
Columbia University, New York
c
2002, 2003, 2004, 2005, 2006 by Andrew Gelman and Jennifer Hill
To be published in October, 2006 by Cambridge University Press
Contents
List of examples pagexvii
Preface xix
1 Why? 1
1.1 What is multilevel regression modeling? 1
1.2 Some examples from our own research 3
1.3 Motivations for multilevel modeling 6
1.4 Distinctive features of this book 8
1.5 Computing 9
2 Concepts and methods from basic probability and statistics 13
2.1 Probability distributions 13
2.2 Statistical inference 16
2.3 Classical confidence intervals 18
2.4 Classical hypothesis testing 20
2.5 Problems with statistical significance 22 2.6 55,000 residents desperately need your help! 23
2.7 Bibliographic note 26
2.8 Exercises 26
Part 1A: Single-level regression 29
3 Linear regression: the basics 31
3.1 One predictor 31
3.2 Multiple predictors 32
3.3 Interactions 34
3.4 Statistical inference 37
3.5 Graphical displays of data and fitted model 42
3.6 Assumptions and diagnostics 45
3.7 Prediction and validation 47
3.8 Bibliographic note 49
3.9 Exercises 49
4 Linear regression: before and after fitting the model 53
4.1 Linear transformations 53
4.2 Centering and standardizing, especially for models with interactions 55 4.3 Correlation and “regression to the mean” 57
4.4 Logarithmic transformations 59
4.5 Other transformations 65
4.6 Building regression models for prediction 68
4.7 Fitting a series of regressions 73
x CONTENTS
4.8 Bibliographic note 74
4.9 Exercises 74
5 Logistic regression 79
5.1 Logistic regression with a single predictor 79 5.2 Interpreting the logistic regression coefficients 81
5.3 Latent-data formulation 85
5.4 Building a logistic regression model: wells in Bangladesh 86 5.5 Logistic regression with interactions 92 5.6 Evaluating, checking, and comparing fitted logistic regressions 97 5.7 Average predictive comparisons on the probability scale 101
5.8 Identifiability and separation 104
5.9 Bibliographic note 105
5.10 Exercises 105
6 Generalized linear models 109
6.1 Introduction 109
6.2 Poisson regression, exposure, and overdispersion 110
6.3 Logistic-binomial model 116
6.4 Probit regression: normally distributed latent data 118
6.5 Multinomial regression 119
6.6 Robust regression using the t model 124
6.7 Building more complex generalized linear models 125
6.8 Constructive choice models 127
6.9 Bibliographic note 131
6.10 Exercises 132
Part 1B: Working with regression inferences 135
7 Simulation of probability models and statistical inferences 137
7.1 Simulation of probability models 137
7.2 Summarizing linear regressions using simulation: an informal
Bayesian approach 140
7.3 Simulation for nonlinear predictions: congressional elections 144 7.4 Predictive simulation for generalized linear models 148
7.5 Bibliographic note 151
7.6 Exercises 152
8 Simulation for checking statistical procedures and model fits 155
8.1 Fake-data simulation 155
8.2 Example: using fake-data simulation to understand residual plots 157 8.3 Simulating from the fitted model and comparing to actual data 158 8.4 Using predictive simulation to check the fit of a time-series model 163
8.5 Bibliographic note 165
8.6 Exercises 165
9 Causal inference using regression on the treatment variable 167
9.1 Causal inference and predictive comparisons 167 9.2 The fundamental problem of causal inference 170
9.3 Randomized experiments 172
CONTENTS xi
9.5 Observational studies 181
9.6 Understanding causal inference in observational studies 186 9.7 Do not control for post-treatment variables 188 9.8 Intermediate outcomes and causal paths 190
9.9 Bibliographic note 194
9.10 Exercises 194
10 Causal inference using more advanced models 199
10.1 Imbalance and lack of complete overlap 199 10.2 Subclassification: effects and estimates for different subpopulations 204 10.3 Matching: subsetting the data to get overlapping and balanced
treatment and control groups 206
10.4 Lack of overlap when the assignment mechanism is known:
regression discontinuity 212
10.5 Estimating causal effects indirectly using instrumental variables 215 10.6 Instrumental variables in a regression framework 220 10.7 Identification strategies that make use of variation within or between
groups 226
10.8 Bibliographic note 229
10.9 Exercises 231
Part 2A: Multilevel regression 235
11 Multilevel structures 237
11.1 Varying-intercept and varying-slope models 237 11.2 Clustered data: child support enforcement in cities 237 11.3 Repeated measurements, time-series cross sections, and other
non-nested structures 241
11.4 Indicator variables and fixed or random effects 244 11.5 Costs and benefits of multilevel modeling 246
11.6 Bibliographic note 247
11.7 Exercises 248
12 Multilevel linear models: the basics 251
12.1 Notation 251
12.2 Partial pooling with no predictors 252
12.3 Partial pooling with predictors 254
12.4 Quickly fitting multilevel models in R 259
12.5 Five ways to write the same model 262
12.6 Group-level predictors 265
12.7 Model building and statistical significance 270 12.8 Predictions for new observations and new groups 272 12.9 How many groups and how many observations per group are
needed to fit a multilevel model? 275
12.10 Bibliographic note 276
12.11 Exercises 277
13 Multilevel linear models: varying slopes, non-nested models, and
other complexities 279
13.1 Varying intercepts and slopes 279
xii CONTENTS 13.3 Modeling multiple varying coefficients using the scaled
inverse-Wishart distribution 284
13.4 Understanding correlations between group-level intercepts and
slopes 287
13.5 Non-nested models 289
13.6 Selecting, transforming, and combining regression inputs 293
13.7 More complex multilevel models 297
13.8 Bibliographic note 297
13.9 Exercises 298
14 Multilevel logistic regression 301
14.1 State-level opinions from national polls 301 14.2 Red states and blue states: what’s the matter with Connecticut? 310 14.3 Item-response and ideal-point models 314 14.4 Non-nested overdispersed model for death sentence reversals 320
14.5 Bibliographic note 321
14.6 Exercises 322
15 Multilevel generalized linear models 325
15.1 Overdispersed Poisson regression: police stops and ethnicity 325 15.2 Ordered categorical regression: storable votes 331 15.3 Non-nested negative-binomial model of structure in social networks 332
15.4 Bibliographic note 342
15.5 Exercises 342
Part 2B: Fitting multilevel models 343
16 Multilevel modeling in Bugs and R: the basics 345
16.1 Why you should learn Bugs 345
16.2 Bayesian inference and prior distributions 345 16.3 Fitting and understanding a varying-intercept multilevel model
using R and Bugs 348
16.4 Step by step through a Bugs model, as called from R 353 16.5 Adding individual- and group-level predictors 359 16.6 Predictions for new observations and new groups 361
16.7 Fake-data simulation 363
16.8 The principles of modeling in Bugs 366
16.9 Practical issues of implementation 369
16.10 Open-ended modeling in Bugs 370
16.11 Bibliographic note 373
16.12 Exercises 373
17 Fitting multilevel linear and generalized linear models in Bugs
and R 375
17.1 Varying-intercept, varying-slope models 375 17.2 Varying intercepts and slopes with group-level predictors 379
17.3 Non-nested models 380
17.4 Multilevel logistic regression 381
17.5 Multilevel Poisson regression 382
17.6 Multilevel ordered categorical regression 383 17.7 Latent-data parameterizations of generalized linear models 384
CONTENTS xiii
17.8 Bibliographic note 385
17.9 Exercises 385
18 Likelihood and Bayesian inference and computation 387
18.1 Least squares and maximum likelihood estimation 387 18.2 Uncertainty estimates using the likelihood surface 390 18.3 Bayesian inference for classical and multilevel regression 392 18.4 Gibbs sampler for multilevel linear models 397 18.5 Likelihood inference, Bayesian inference, and the Gibbs sampler:
the case of censored data 402
18.6 Metropolis algorithm for more general Bayesian computation 408 18.7 Specifying a log posterior density, Gibbs sampler, and Metropolis
algorithm in R 409
18.8 Bibliographic note 413
18.9 Exercises 413
19 Debugging and speeding convergence 415
19.1 Debugging and confidence building 415
19.2 General methods for reducing computational requirements 418
19.3 Simple linear transformations 419
19.4 Redundant parameters and intentionally nonidentifiable models 419 19.5 Parameter expansion: multiplicative redundant parameters 424 19.6 Using redundant parameters to create an informative prior
distribution for multilevel variance parameters 427
19.7 Bibliographic note 434
19.8 Exercises 434
Part 3: From data collection to model understanding to model
checking 435
20 Sample size and power calculations 437
20.1 Choices in the design of data collection 437 20.2 Classical power calculations: general principles, as illustrated by
estimates of proportions 439
20.3 Classical power calculations for continuous outcomes 443 20.4 Multilevel power calculation for cluster sampling 447 20.5 Multilevel power calculation using fake-data simulation 449
20.6 Bibliographic note 454
20.7 Exercises 454
21 Understanding and summarizing the fitted models 457
21.1 Uncertainty and variability 457
21.2 Superpopulation and finite-population variances 459 21.3 Contrasts and comparisons of multilevel coefficients 462
21.4 Average predictive comparisons 466
21.5 R2and explained variance 473
21.6 Summarizing the amount of partial pooling 477 21.7 Adding a predictor can increase the residual variance! 480 21.8 Multiple comparisons and statistical significance 481
21.9 Bibliographic note 484
xiv CONTENTS
22 Analysis of variance 487
22.1 Classical analysis of variance 487
22.2 ANOVA and multilevel linear and generalized linear models 490 22.3 Summarizing multilevel models using ANOVA 492 22.4 Doing ANOVA using multilevel models 494 22.5 Adding predictors: analysis of covariance and contrast analysis 496 22.6 Modeling the variance parameters: a split-plot latin square 498
22.7 Bibliographic note 501
22.8 Exercises 501
23 Causal inference using multilevel models 503
23.1 Multilevel aspects of data collection 503 23.2 Estimating treatment effects in a multilevel observational study 506 23.3 Treatments applied at different levels 507 23.4 Instrumental variables and multilevel modeling 509
23.5 Bibliographic note 512
23.6 Exercises 512
24 Model checking and comparison 513
24.1 Principles of predictive checking 513
24.2 Example: a behavioral learning experiment 515
24.3 Model comparison and deviance 524
24.4 Bibliographic note 526
24.5 Exercises 527
25 Missing-data imputation 529
25.1 Missing-data mechanisms 530
25.2 Missing-data methods that discard data 531 25.3 Simple missing-data approaches that retain all the data 532 25.4 Random imputation of a single variable 533 25.5 Imputation of several missing variables 539
25.6 Model-based imputation 540
25.7 Combining inferences from multiple imputations 542
25.8 Bibliographic note 542
25.9 Exercises 543
Appendixes 545
A Six quick tips to improve your regression modeling 547
A.1 Fit many models 547
A.2 Do a little work to make your computations faster and more reliable 547 A.3 Graphing the relevant and not the irrelevant 548
A.4 Transformations 548
A.5 Consider all coefficients as potentially varying 549 A.6 Estimate causal inferences in a targeted way, not as a byproduct
of a large regression 549
B Statistical graphics for research and presentation 551
B.1 Reformulating a graph by focusing on comparisons 552
B.2 Scatterplots 553
CONTENTS xv
B.4 Bibliographic note 562
B.5 Exercises 563
C Software 565
C.1 Getting started with R, Bugs, and a text editor 565 C.2 Fitting classical and multilevel regressions in R 565
C.3 Fitting models in Bugs and R 567
C.4 Fitting multilevel models using R, Stata, SAS, and other software 568
C.5 Bibliographic note 573
References 575
Author index 601
List of examples
Home radon 3, 36, 252, 279, 479
Forecasting elections 3, 144
State-level opinions from national polls 4, 301, 493
Police stops by ethnic group 5, 21, 112, 325
Public opinion on the death penalty 19
Testing for election fraud 23
Sex ratio of births 27, 137
Mothers’ education and children’s test scores 31, 55
Height and weight 41, 75
Beauty and teaching evaluations 51, 277
Height and earnings 53, 59, 140, 288
Handedness 66
Yields of mesquite bushes 70
Political party identification over time 73
Income and voting 79, 107
Arsenic in drinking water 86, 128, 193
Death-sentencing appeals process 116, 320, 540 Ordered logistic model for storable votes 120, 331
Cockroaches in apartments 126, 161
Behavior of couples at risk for HIV 132, 166
Academy Award voting 133
Incremental cost-effectiveness ratio 152
Unemployment time series 163
The Electric Company TV show 174, 503
Hypothetical study of parenting quality as an intermediate outcome 188
Sesame Street TV show 196
Messy randomized experiment of cow feed 196
Incumbency and congressional elections 197
xviii LIST OF EXAMPLES
Value of a statistical life 197
Evaluating the Infant Health and Development Program 201, 506
Ideology of congressmembers 213
Hypothetical randomized-encouragement study 216
Child support enforcement 237
Adolescent smoking 241
Rodents in apartments 248
Olympic judging 248
Time series of children’s CD4 counts 249, 277, 449
Flight simulator experiment 289, 464, 488
Latin square agricultural experiment 292, 497
Income and voting by state 310
Item-response models 314
Ideal-point modeling for the Supreme Court 317
Speed dating 322
Social networks 332
Regression with censored data 402
Educational testing experiments 430
Zinc for HIV-positive children 439
Cluster sampling of New York City residents 448
Value added of school teachers 458
Advanced Placement scores and college grades 463
Prison sentences 470
Magnetic fields and brain functioning 481
Analysis of variance for web connect times 492
Split-plot latin square 498
Educational-subsidy program in Mexican villages 508 Checking models of behavioral learning in dogs 515 Missing data in the Social Indicators Survey 529
Preface
Aim of this book
This book originated as lecture notes for a course in regression and multilevel mod-eling, offered by the statistics department at Columbia University and attended by graduate students and postdoctoral researchers in social sciences (political sci-ence, economics, psychology, education, business, social work, and public health) and statistics. The prerequisite is statistics up to and including an introduction to multiple regression.
Advanced mathematics is not assumed—it is important to understand the linear model in regression, but it is not necessary to follow the matrix algebra in the derivation of least squares computations. It is useful to be familiar with exponents and logarithms, especially when working with generalized linear models.
After completing Part 1 of this book, you should be able to fit classical linear and generalized linear regression models—and do more with these models than simply look at their coefficients and their statistical significance. Applied goals include causal inference, prediction, comparison, and data description. After completing Part 2, you should be able to fit regression models for multilevel data. Part 3 takes you from data collection, through model understanding (looking at a table of estimated coefficients is usually not enough), to model checking and missing data. The appendixes include some reference materials on key tips, statistical graphics, and software for model fitting.
What you should be able to do after reading this book and working through the examples
This text is structured through models and examples, with the intention that after each chapter you should have certain skills in fitting, understanding, and displaying models:
• Part 1A: Fit, understand, and graph classical regressions and generalized linear models.
– Chapter 3: Fit linear regressions and be able to interpret and display estimated
coefficients.
– Chapter 4: Build linear regression models by transforming and combining
variables.
– Chapter 5: Fit, understand, and display logistic regression models for binary
data.
– Chapter 6: Fit, understand, and display generalized linear models, including
Poisson regression with overdispersion and ordered logit and probit models. • Part 1B: Use regression to learn about quantities of substantive interest (not
just regression coefficients).
– Chapter 7: Simulate probability models and uncertainty about inferences and
predictions.
xx PREFACE
– Chapter 8: Check model fits using fake-data simulation and predictive
simu-lation.
– Chapter 9: Understand assumptions underlying causal inference. Set up
re-gressions for causal inference and understand the challenges that arise.
– Chapter 10: Understand the assumptions underlying propensity score
match-ing, instrumental variables, and other techniques to perform causal inference when simple regression is not enough. Be able to use these when appropriate. • Part 2A: Understand and graph multilevel models.
– Chapter 11: Understand multilevel data structures and models as
generaliza-tions of classical regression.
– Chapter 12: Understand and graph simple varying-intercept regressions and
interpret as partial-pooling estimates.
– Chapter 13: Understand and graph multilevel linear models with varying
in-tercepts and slopes, non-nested structures, and other complications.
– Chapter 14: Understand and graph multilevel logistic models.
– Chapter 15: Understand and graph multilevel overdispersed Poisson, ordered
logit and probit, and other generalized linear models.
• Part 2B: Fit multilevel models using the software packages R and Bugs.
– Chapter 16: Fit varying-intercept regressions and understand the basics of
Bugs. Check your programming using fake-data simulation.
– Chapter 17: Use Bugs to fit various models from Part 2A.
– Chapter 18: Understand Bayesian inference as a generalization of least squares
and maximum likelihood. Use the Gibbs sampler to fit multilevel models.
– Chapter 19: Use redundant parameterizations to speed the convergence of the
Gibbs sampler. • Part 3:
– Chapter 20: Perform sample size and power calculations for classical and
hier-archical models: standard-error formulas for basic calculations and fake-data simulation for harder problems.
– Chapter 21: Calculate and understand contrasts, explained variance, partial
pooling coefficients, and other summaries of fitted multilevel models.
– Chapter 22: Use the ideas of analysis of variance to summarize fitted multilevel
models; use multilevel models to perform analysis of variance.
– Chapter 23: Use multilevel models in causal inference.
– Chapter 24: Check the fit of models using predictive simulation.
– Chapter 25: Use regression to impute missing data in multivariate datasets.
In summary, you should be able to fit, graph, and understand classical and mul-tilevel linear and generalized linear models and to use these model fits to make predictions and inferences about quantities of interest, including causal treatment effects.
PREFACE xxi
Data for the examples and homework assignments and other resources for teaching and learning
The website www.stat.columbia.edu/∼gelman/arm/ contains datasets used in the examples and homework problems of the book, as well as sample computer code. The website also includes some tips for teaching regression and multilevel modeling through class participation rather than lecturing. We plan to update these tips based on feedback from instructors and students; please send your comments and suggestions to gelman@stat.columbia.edu.
Outline of a course
When teaching a course based on this book, we recommend starting with a self-contained review of linear regression, logistic regression, and generalized linear mod-els, focusing not on the mathematics but on understanding these methods and im-plementing them in a reasonable way. This is also a convenient way to introduce the statistical language R, which we use throughout for modeling, computation, and graphics. One thing that will probably be new to the reader is the use of random simulations to summarize inferences and predictions.
We then introduce multilevel models in the simplest case of nested linear models, fitting in the Bayesian modeling language Bugs and examining the results in R. Key concepts covered at this point are partial pooling, variance components, prior distributions, identifiability, and the interpretation of regression coefficients at dif-ferent levels of the hierarchy. We follow with non-nested models, multilevel logistic regression, and other multilevel generalized linear models.
Next we detail the steps of fitting models in Bugs and give practical tips for repa-rameterizing a model to make it converge faster and additional tips on debugging. We also present a brief review of Bayesian inference and computation. Once the student is able to fit multilevel models, we move in the final weeks of the class to the final part of the book, which covers more advanced issues in data collection, model understanding, and model checking.
As we show throughout, multilevel modeling fits into a view of statistics that unifies substantive modeling with accurate data fitting, and graphical methods are crucial both for seeing unanticipated features in the data and for understanding the implications of fitted models.
Acknowledgments
We thank the many students and colleagues who have helped us understand and implement these ideas. Most important have been Jouni Kerman, David Park, and Joe Bafumi for years of suggestions throughout this project, and for many insights into how to present this material to students.
In addition, we thank Hal Stern and Gary King for discussions on the structure of this book; Chuanhai Liu, Xiao-Li Meng, Zaiying Huang, John Boscardin, Jouni Kerman, and Alan Zaslavsky for discussions about statistical computation; Iven Van Mechelen and Hans Berkhof for discussions about model checking; Iain Par-doe for discussions of average predictive effects and other summaries of regression models; Matt Salganik and Wendy McKelvey for suggestions on the presentation of sample size calculations; T. E. Raghunathan, Donald Rubin, Rajeev Dehejia, Michael Sobel, Guido Imbens, Samantha Cook, Ben Hansen, Dylan Small, and Ed Vytlacil for concepts of missing-data modeling and causal inference; Eric Loken for help in understanding identifiability in item-response models; Niall Bolger, Agustin
xxii PREFACE Calatroni, John Carlin, Rafael Guerrero-Preston, Reid Landes, Eduardo Leoni, and Dan Rabinowitz for code in Stata, SAS, and SPSS; Hans Skaug for code in AD Model Builder; Uwe Ligges, Sibylle Sturtz, Douglas Bates, Peter Dalgaard, Martyn Plummer, and Ravi Varadhan for help with multilevel modeling and general advice on R; and the students in Statistics / Political Science 4330 at Columbia for their invaluable feedback throughout.
Collaborators on specific examples mentioned in this book include Phillip Price on the home radon study; Tom Little, David Park, Joe Bafumi, and Noah Kaplan on the models of opinion polls and political ideal points; Jane Waldfogel, Jeanne Brooks-Gunn, and Wen Han for the mothers and children’s intelligence data; Lex van Geen and Alex Pfaff on the arsenic in Bangladesh; Gary King on election forecasting; Jeffrey Fagan and Alex Kiss on the study of police stops; Tian Zheng and Matt Salganik on the social network analysis; John Carlin for the data on mesquite bushes and the adolescent-smoking study; Alessandra Casella and Tom Palfrey for the storable-votes study; Rahul Dodhia for the flight simulator exam-ple; Boris Shor, Joe Bafumi, and David Park on the voting and income study; Alan Edelman for the internet connections data; Donald Rubin for the Electric Com-pany and educational-testing examples; Jeanne Brooks-Gunn and Jane Waldfogel for the mother and child IQ scores example and Infant Health and Development Program data; Nabila El-Bassel for the risky behavior data; Lenna Nepomnyaschy for the child support example; Howard Wainer with the Advanced Placement study; Iain Pardoe for the prison-sentencing example; James Liebman, Jeffrey Fagan, Va-lerie West, and Yves Chretien for the death-penalty study; Marcia Meyers, Julien Teitler, Irv Garfinkel, Marilyn Sinkowicz, and Sandra Garcia with the Social Indi-cators Study; Wendy McKelvey for the cockroach and rodent examples; Stephen Arpadi for the zinc and HIV study; Eric Verhoogen and Jan von der Goltz for the Progresa data; and Iven van Mechelen, Yuri Goegebeur, and Francis Tuerlincx on the stochastic learning models. These applied projects motivated many of the methodological ideas presented here, for example the display and interpretation of varying-intercept, varying-slope models from the analysis of income and voting (see Section 14.2), the constraints in the model of senators’ ideal points (see Section 14.3), and the difficulties with two-level interactions as revealed by the radon study (see Section 21.7). Much of the work in Section 5.7 and Chapter 21 on summarizing regression models was done in collaboration with Iain Pardoe.
Many errors were found and improvements suggested by Brad Carlin, John Car-lin, Samantha Cook, Caroline Rosenthal Gelman, Kosuke Imai, Jonathan Katz, Uwe Ligges, Wendy McKelvey, Jong-Hee Park, Martyn Plummer, Phillip Price, Song Qian, Dylan Small, Elizabeth Stuart, Sibylle Sturtz, and Alex Tabarrok. Brian MacDonald’s copyediting has saved us from much embarrassment, and we also thank Yu-Sung Su for typesetting help, Sarah Ryu for assistance with index-ing, and Ed Parsons and his colleagues at Cambridge University Press for their help in putting this book together. We especially thank Bob O’Hara and Gregor Gorjanc for incredibly detailed and useful comments on the nearly completed manuscript.
We also thank the developers of free software, especially R (for statistical com-putation and graphics) and Bugs (for Bayesian modeling), and also Emacs and LaTex (used in the writing of this book). We thank Columbia University for its collaborative environment for research and teaching, and the U.S. National Science Foundation for financial support. Above all, we thank our families for their love and support during the writing of this book.
CHAPTER 1
Why?
1.1 What is multilevel regression modeling?
Consider an educational study with data from students in many schools, predicting in each school the students’ grades y on a standardized test given their scores on a pre-test x and other information. A separate regression model can be fit within each school, and the parameters from these schools can themselves be modeled as depending on school characteristics (such as the socioeconomic status of the school’s neighborhood, whether the school is public or private, and so on). The student-level regression and the school-level regression here are the two levels of a
multilevel model.
In this example, a multilevel model can be expressed in (at least) three equivalent ways as a student-level regression:
• A model in which the coefficients vary by school (thus, instead of a model such as y = α + βx + error, we have y = αj+ βjx + error, where the subscripts j index
schools),
• A model with more than one variance component (student-level and school-level variation),
• A regression with many predictors, including an indicator variable for each school in the data.
More generally, we consider a multilevel model to be a regression (a linear or gen-eralized linear model) in which the parameters—the regression coefficients—are given a probability model. This second-level model has parameters of its own—the
hyperparameters of the model—which are also estimated from data.
The two key parts of a multilevel model are varying coefficients, and a model for those varying coefficients (which can itself include group-level predictors). Classi-cal regression can sometimes accommodate varying coefficients by using indicator variables. The feature that distinguishes multilevel models from classical regression is in the modeling of the variation between groups.
Models for regression coefficients
To give a preview of our notation, we write the regression equations for two multi-level models. To keep notation simple, we assume just one student-multi-level predictor x (for example, a pre-test score) and one school-level predictor u (for example, average parents’ incomes).
Varying-intercept model. First we write the model in which the regressions have the same slope in each of the schools, and only the intercepts vary. We use the
2 WHY? notation i for individual students and j[i] for the school j containing student i:1
yi = αj[i]+ βxi+ i, for students i = 1, . . . , n
αj = a + buj+ ηj, for schools j = 1, . . . , J. (1.1)
Here, xi and ujrepresent predictors at the student and school levels, respectively,
and iand ηjare independent error terms at each of the two levels. The model can
be written in several other equivalent ways, as we discuss in Section 12.5. The number of “data points” J (here, schools) in the higher-level regression is typically much less than n, the sample size of the lower-level model (for students in this example).
Varying-intercept, varying-slope model. More complicated is the model where in-tercepts and slopes both can vary by school:
yi = αj[i]+ βj[i]xi+ i, for students i = 1, . . . , n
αj = a0+ b0uj+ ηj1, for schools j = 1, . . . , J
βj = a1+ b1uj+ ηj2, for schools j = 1, . . . , J.
Compared to model (1.1), this has twice as many vectors of varying coefficients (α, β), twice as many vectors of second-level coefficients (a, b), and potentially cor-related second-level errors η1, η2. We will be able to handle these complications. Labels
“Multilevel” or “hierarchical.” Multilevel models are also called hierarchical, for two different reasons: first, from the structure of the data (for example, students clustered within schools); and second, from the model itself, which has its own hier-archy, with the parameters of the within-school regressions at the bottom, controlled by the hyperparameters of the upper-level model.
Later we shall consider non-nested models—for example, individual observations that are nested within states and years. Neither “state” nor “year” is above the other in a hierarchical sense. In this sort of example, we can consider individuals, states, and years to be three different levels without the requirement of a full ordering or hierarchy. More complex structures, such as three-level nesting (for example, students within schools within school districts) are also easy to handle within the general multilevel framework.
Why we avoid the term “random effects.” Multilevel models are often known as random-effects or mixed-effects models. The regression coefficients that are being modeled are called random effects, in the sense that they are considered random outcomes of a process identified with the model that is predicting them. In contrast,
fixed effects correspond either to parameters that do not vary (for example, fitting
the same regresslon line for each of the schools) or to parameters that vary but are not modeled themselves (for example, fitting a least squares regression model with various predictors, including indicators for the schools). A mixed-effects model includes both fixed and random effects; for example, in model (1.1), the varying intercepts αjhave a group-level model, but β is fixed and does not vary by group.
1The model can also be written asy
ij=αj+βxij+ij, whereyijis the measurement from studenti in school j. We prefer using the single sequence i to index all students (and j[i] to label schools) because this fits in better with our multilevel modeling framework with data and models at the individual and group levels. The data areyibecause they can exist without reference to the groupings, and we prefer to include information about the groupings as numerical data— that is, the index variablej[i]—rather than through reordering the data through subscripting. We discuss the structure of the data and models further in Chapter 11.
SOME EXAMPLES FROM OUR OWN RESEARCH 3 Fixed effects can be viewed as special cases of random effects, in which the higher-level variance (in model (1.1), this would be σ2
α) is set to 0 or∞. Hence, in our
framework, all regression parameters are “random,” and the term “multilevel” is all-encompassing. As we discuss on page 245, we find the terms “fixed,” “random,” and “mixed” effects to be confusing and often misleading, and so we avoid their use.
1.2 Some examples from our own research
Multilevel modeling can be applied to just about any problem. Just to give a feel of the ways it can be used, we give here a few examples from our applied work.
Combining information for local decisions: home radon measurement and remediation
Radon is a carcinogen—a naturally occurring radioactive gas whose decay products are also radioactive—known to cause lung cancer in high concentrations and esti-mated to cause several thousand lung cancer deaths per year in the United States. The distribution of radon levels in U.S. homes varies greatly, with some houses hav-ing dangerously high concentrations. In order to identify the areas with high radon exposures, the Environmental Protection Agency coordinated radon measurements in a random sample of more than 80,000 houses throughout the country.
To simplify the problem somewhat, our goal in analyzing these data was to estimate the distribution of radon levels in each of the approximately 3000 counties in the United States, so that homeowners could make decisions about measuring or remediating the radon in their houses based on the best available knowledge of local conditions. For the purpose of this analysis, the data were structured hierarchically: houses within counties. If we were to analyze multiple measurements within houses, there would be a three-level hierarchy of measurements, houses, and counties.
In performing the analysis, we had an important predictor—the floor on which the measurement was taken, either basement or first floor; radon comes from un-derground and can enter more easily when a house is built into the ground. We also had an important county-level predictor—a measurement of soil uranium that was available at the county level. We fit a model of the form (1.1), where yiis the
logarithm of the radon measurement in house i, x is the floor of the measurement (that is, 0 for basement and 1 for first floor), and u is the uranium measurement at the county level. The errors iin the first line of (1.1) represent “within-county
vari-ation,” which in this case includes measurement error, natural variation in radon levels within a house over time, and variation between houses (beyond what is ex-plained by the floor of measurement). The errors ηj in the second line represent
variation between counties, beyond what is explained by the county-level uranium predictor.
The hierarchical model allows us to fit a regression model to the individual mea-surements while accounting for systematic unexplained variation among the 3000 counties. We return to this example in Chapter 12.
Modeling correlations: forecasting presidential elections
It is of practical interest to politicians and theoretical interest to political scientists that the outcomes of elections can be forecast with reasonable accuracy given in-formation available months ahead of time. To understand this better, we set up a
4 WHY? model to forecast presidential elections. Our predicted outcomes were the Demo-cratic Party’s share of the two-party vote in each state in each of the 11 elections from 1948 through 1988, yielding 511 data points (the analysis excluded states that were won by third parties), and we had various predictors, including the per-formance of the Democrats in the previous election, measures of state-level and national economic trends, and national opinion polls up to two months before the election.
We set up our forecasting model two months before the 1992 presidential election and used it to make predictions for the 50 states. Predictions obtained using classical regression are reasonable, but when the model is evaluated historically (fitting to all but one election and then using the model to predict that election, then repeating this for the different past elections), the associated predictive intervals turn out to be too narrow: that is, the predictions are not as accurate as claimed by the model. Fewer than 50% of the predictions fall in the 50% predictive intervals, and fewer than 95% are inside the 95% intervals. The problem is that the 511 original data points are structured, and the state-level errors are correlated. It is overly optimistic to say that we have 511 independent data points.
Instead, we model
yi= β0+ Xi1β1+ Xi2β2+· · · + Xikβk+ ηt[i]+ δr[i],t[i]+ i, for i = 1, . . . , n, (1.2)
where t[i] is a indicator for time (election year), and r[i] is an indicator for the region of the country (Northeast, Midwest, South, or West), and n = 511 is the number of state-years used to fit the model. For each election year, ηtis a nationwide error
and the δr,t’s are four independent regional errors.
The error terms must then be given distributions. As usual, the default is the normal distribution, which for this model we express as
ηt ∼ N(0, ση2), for t = 1, . . . , 11
δr,t ∼ N(0, σδ2), for r = 1, . . . , 4; t = 1, . . . , 11
i ∼ N(0, σ2), for i = 1, . . . , 511. (1.3)
In the multilevel model, all the parameters β, ση, σδ, σare estimated from the data.
We can then make a prediction by simulating the election outcome in the 50 states in the next election year, t = 12:
yi= β0+ Xi1β1+ Xi2β2+· · · + Xikβk+ η12+ δr[i],12+ i, for i = n+1, . . . , n+50.
To define the predictive distribution of these 50 outcomes, we need the point pre-dictors Xiβ = β0+ Xi1β1+ Xi2β2+· · · + Xikβk and the state-level errors as
before, but we also need a new national error η12and four new regional errors δr,12,
which we simulate from the distributions (1.3). The variation from these gives a more realistic statement of prediction uncertainties.
Small-area estimation: state-level opinions from national polls
In a micro-level version of election forecasting, it is possible to predict the political opinions of individual voters given demographic information and where they live. Here the data sources are opinion polls rather than elections.
For example, we analyzed the data from seven CBS News polls from the 10 days immediately preceding the 1988 U.S. presidential election. For each survey respondent i, we label yi = 1 if he or she preferred George Bush (the Republican
candidate), 0 if he or she preferred Michael Dukakis (the Democrat). We excluded respondents who preferred others or had no opinion, leaving a sample size n of
SOME EXAMPLES FROM OUR OWN RESEARCH 5 about 6000. We then fit the model,
Pr(yi= 1) = logit−1(Xiβ),
where X included 85 predictors: • A constant term
• An indicator for “female” • An indicator for “black”
• An indicator for “female and black”
• 4 indicators for age categories (18–29, 30–44, 45–64, and 65+)
• 4 indicators for education categories (less than high school, high school, some college, college graduate)
• 16 indicators for age × education
• 51 indicators for states (including the District of Columbia)
• 5 indicators for regions (Northeast, Midwest, South, West, and D.C.)
• The Republican share of the vote for president in the state in the previous election.
In classical regression, it would be unwise to fit this many predictors because the estimates will be unreliable, especially for small states. In addition, it would be necessary to leave predictors out of each batch of indicators (the 4 age categories, the 4 education categories, the 16 age× education interactions, the 51 states, and the 5 regions) to avoid collinearity.
With a multilevel model, the coefficients for each batch of indicators are fit to a probability distribution, and it is possible to include all the predictors in the model. We return to this example in Section 14.1.
Social science modeling: police stops by ethnic group with variation across precincts
There have been complaints in New York City and elsewhere that the police harass members of ethnic minority groups. In 1999 the New York State Attorney General’s Office instigated a study of the New York City police department’s “stop and frisk” policy: the lawful practice of “temporarily detaining, questioning, and, at times, searching civilians on the street.” The police have a policy of keeping records on every stop and frisk, and this information was collated for all stops (about 175,000 in total) over a 15-month period in 1998–1999. We analyzed these data to see to what extent different ethnic groups were stopped by the police. We focused on blacks (African Americans), hispanics (Latinos), and whites (European Americans). We excluded others (about 4% of the stops) because of sensitivity to ambiguities in classifications. The ethnic categories were as recorded by the police making the stops.
It was found that blacks and hispanics represented 50% and 33% of the stops, respectively, despite constituting only 26% and 24%, respectively, of the population of the city. An arguably more relevant baseline comparison, however, is to the num-ber of crimes committed by memnum-bers of each ethnic group. Data on actual crimes are not available, of course, so as a proxy we used the number of arrests within New York City in 1997 as recorded by the Division of Criminal Justice Services (DCJS) of New York State. We used these numbers to represent the frequency of crimes that the police might suspect were committed by members of each group. When compared in that way, the ratio of stops to previous DCJS arrests was 1.24 for
6 WHY? whites, 1.53 for blacks, and 1.72 for hispanics—the minority groups still appeared to be stopped disproportionately often.
These ratios are suspect too, however, because they average over the whole city. Suppose the police make more stops in high-crime areas but treat the different ethnic groups equally within any locality. Then the citywide ratios could show strong differences between ethnic groups even if stops are entirely determined by location rather than ethnicity. In order to separate these two kinds of predictors, we performed a multilevel analysis using the city’s 75 precincts. For each ethnic group e = 1, 2, 3 and precinct p = 1, . . . , 75, we model the number of stops yepusing an
overdispersed Poisson regression. The exponentiated coefficients from this model represent relative rates of stops compared to arrests for the different ethnic groups, after controlling for precinct. We return to this example in Section 15.1.
1.3 Motivations for multilevel modeling
Multilevel models can be used for a variety of inferential goals including causal inference, prediction, and descriptive modeling.
Learning about treatment effects that vary
One of the basic goals of regression analysis is estimating treatment effects—how does y change when some x is varied, with all other inputs held constant? In many applications, it is not an overall effect of x that is of interest, but how this effect varies in the population. In classical statistics we can study this variation using
interactions: for example, a particular educational innovation may be more effective
for girls than for boys, or more effective for students who expressed more interest in school in a pre-test measurement.
Multilevel models also allow us to study effects that vary by group, for example an intervention that is more effective in some schools than others (perhaps because of unmeasured school-level factors such as teacher morale). In classical regression, estimates of varying effects can be noisy, especially when there are few observations per group; multilevel modeling allows us to estimate these interactions to the extent supported by the data.
Using all the data to perform inferences for groups with small sample size
A related problem arises when we are trying to estimate some group-level quan-tity, perhaps a local treatment effect or maybe simply a group-level average (as in the small-area estimation example on page 4). Classical estimation just using the local information can be essentially useless if the sample size is small in the group. At the other extreme, a classical regression ignoring group indicators can be mis-leading in ignoring group-level variation. Multilevel modeling allows the estimation of group averages and group-level effects, compromising between the overly noisy within-group estimate and the oversimplified regression estimate that ignores group indicators.
Prediction
Regression models are commonly used for predicting outcomes for new cases. But what if the data vary by group? Then we can make predictions for new units in existing groups or in new groups. The latter is difficult to do in classical regression:
MOTIVATIONS FOR MULTILEVEL MODELING 7 if a model ignores group effects, it will tend to understate the error in predictions for new groups. But a classical regression that includes group effects does not have any automatic way of getting predictions for a new group.
A natural attack on the problem is a two-stage regression, first including group indicators and then fitting a regression of estimated group effects on group-level predictors. One can then forecast for a new group, with the group effect predicted from the group-level model, and then the observations predicted from the unit-level model. However, if sample sizes are small in some groups, it can be difficult or even impossible to fit such a two-stage model classically, and fully accounting for the uncertainty at both levels leads directly to a multilevel model.
Analysis of structured data
Some datasets are collected with an inherent multilevel structure, for example, stu-dents within schools, patients within hospitals, or data from cluster sampling. Sta-tistical theory—whether sampling-theory or Bayesian—says that inference should include the factors used in the design of data collection. As we shall see, multi-level modeling is a direct way to include indicators for clusters at all multi-levels of a design, without being overwhelmed with the problems of overfitting that arise from applying least squares or maximum likelihood to problems with large numbers of parameters.
More efficient inference for regression parameters
Data often arrive with multilevel structure (students within schools and grades, laboratory assays on plates, elections in districts within states, and so forth). Even simple cross-sectional data (for example, a random sample survey of 1000 Amer-icans) can typically be placed within a larger multilevel context (for example, an annual series of such surveys). The traditional alternatives to multilevel modeling are complete pooling, in which differences between groups are ignored, and no
pool-ing, in which data from different sources are analyzed separately. As we shall discuss
in detail throughout the book, both these approaches have problems: no pooling ignores information and can give unacceptably variable inferences, and complete pooling suppresses variation that can be important or even the main goal of a study. The extreme alternatives can in fact be useful as preliminary estimates, but ultimately we prefer the partial pooling that comes out of a multilevel analysis.
Including predictors at two different levels
In the radon example described in Section 1.2, we have outcome measurements at the individual level and predictors at the individual and county levels. How can this information be put together? One possibility is simply to run a classical regression with predictors at both levels. But this does not correct for differences between counties beyond what is included in the predictors. Another approach would be to augment this model with indicators (dummy variables) for the counties. But in a classical regression it is not possible to include county-level indicators as well along with county-level predictors—the predictors would become collinear (see the end of Section 4.5 for a discussion of collinearity and nonidentifiability in this context).
Another approach is to fit the model with county indicators but without the county-level predictors, and then to fit a second model. This is possible but limited because it relies on the classical regression estimates of the coefficients for those
8 WHY? county-level indicators—and if the data are sparse within counties, these estimates won’t be very good. Another possibility in the classical framework would be to fit separate models in each group, but this is not possible unless the sample size is large in each group. The multilevel model provides a coherent model that simultaneously incorporates both individual- and group-level models.
Getting the right standard error: accurately accounting for uncertainty in prediction and estimation
Another motivation for multilevel modeling is for predictions, for example, when forecasting state-by-state outcomes of U.S. presidential elections, as described in Section 1.2. To get an accurate measure of predictive uncertainty, one must account for correlation of the outcome between states in a given election year. Multilevel modeling is a convenient way to do this.
For certain kinds of predictions, multilevel models are essential. For example, consider a model of test scores for students within schools. In classical regression, school-level variability might be modeled by including an indicator variable for each school. In this framework though, it is impossible to make a prediction for a new student in a new school, because there would not be an indicator for this new school in the model. This prediction problem is handled seamlessly using multilevel models.
1.4 Distinctive features of this book
The topics and methods covered in this book overlap with many other textbooks on regression, multilevel modeling, and applied statistics. We differ from most other books in these areas in the following ways:
• We present methods and software that allow the reader to fit complicated, linear or nonlinear, nested or non-nested models. We emphasize the use of the statistical software packages R and Bugs and provide code for many examples as well as methods such as redundant parameterization that speed computation and lead to new modeling ideas.
• We include a wide range of examples, almost all from our own applied research. The statistical methods are thus motivated in the best way, as successful practical tools.
• Most books define regression in terms of matrix operations. We avoid much of this matrix algebra for the simple reason that it is now done automatically by computers. We are more interested in understanding the “forward,” or predic-tive, matrix multiplication Xβ than the more complicated inferential formula (XtX)−1Xty. The latter computation and its generalizations are important but
can be done out of sight of the user. For details of the underlying matrix algebra, we refer readers to the regression textbooks listed in Section 3.8.
• We try as much as possible to display regression results graphically rather than through tables. Here we apply ideas such as those presented in the books by Ramsey and Schafer (2001) for classical regression and Kreft and De Leeuw (1998) for multilevel models. We consider graphical display of model estimates to be not just a useful teaching method but also a necessary tool in applied research.
Statistical texts commonly recommend graphical displays for model diagnostics. These can be very useful, and we refer readers to texts such as Cook and Weisberg
COMPUTING 9 (1999) for more on this topic—but here we are emphasizing graphical displays of the fitted models themselves. It is our experience that, even when a model fits data well, we have difficulty understanding it if all we do is look at tables of regression coefficients.
• We consider multilevel modeling as generally applicable to structured data, not limited to clustered data, panel data, or nested designs. For example, in a random-digit-dialed survey of the United States, one can, and should, use multilevel models if one is interested in estimating differences among states or demographic subgroups—even if no multilevel structure is in the survey design. Ultimately, you have to learn these methods by doing it yourself, and this chapter is intended to make things easier by recounting stories about how we learned this by doing it ourselves. But we warn you ahead of time that we include more of our successes than our failures.
Costs and benefits of our approach
Doing statistics as described in this book is not easy. The difficulties are not math-ematical but rather conceptual and computational. For classical regressions and generalized linear models, the actual fitting is easy (as illustrated in Part 1), but programming effort is still required to graph the results relevantly and to simulate predictions and replicated data. When we move to multilevel modeling, the fitting itself gets much more complicated (see Part 2B), and displaying and checking the models require correspondingly more work. Our emphasis on R and Bugs means that an initial effort is required simply to learn and use the software. Also, compared to usual treatments of multilevel models, we describe a wider variety of modeling options for the researcher so that more decisions will need to be made.
A simpler alternative is to use classical regression and generalized linear modeling where possible—this can be done in R or, essentially equivalently, in Stata, SAS, SPSS, and various other software—and then, when multilevel modeling is really needed, to use functions that adapt classical regression to handle simple multilevel models. Such functions, which can be run with only a little more effort than simple regression fitting, exist in many standard statistical packages.
Compared to these easier-to-use programs, our approach has several advantages: • We can fit a greater variety of models. The modular structure of Bugs allows us
to add complexity where needed to fit data and study patterns of interest. • By working with simulations (rather than simply point estimates of parameters),
we can directly capture inferential uncertainty and propagate it into predictions (as discussed in Chapter 7 and applied throughout the book). We can directly obtain inference for quantities other than regression coefficients and variance parameters.
• R gives us flexibility to display inferences and data flexibly.
We recognize, however, that other software and approaches may be useful too, either as starting points or to check results. Section C.4 describes briefly how to fit multilevel models in several other popular statistical software packages.
1.5 Computing
We perform computer analyses using the freely available software R and Bugs. Appendix C gives instructions on obtaining and using these programs. Here we outline how these programs fit into our overall strategy for data analysis.
10 WHY?
Our general approach to statistical computing
In any statistical analysis, we like to be able to directly manipulate the data, model, and inferences. We just about never know the right thing to do ahead of time, so we have to spend much of our effort examining and cleaning the data, fitting many different models, summarizing the inferences from the models in different ways, and then going back and figuring how to expand the model to allow new data to be included in the analysis.
It is important, then, to be able to select subsets of the data, to graph whatever aspect of the data might be of interest, and to be able to compute numerical sum-maries and fit simple models easily. All this can be done within R—you will have to put some initial effort into learning the language, but it will pay off later.
You will almost always need to try many different models for any problem: not just different subsets of predictor variables as in linear regression, and not just minor changes such as fitting a logit or probit model, but entirely different formulations of the model—different ways of relating observed inputs to outcomes. This is especially true when using new and unfamiliar tools such as multilevel models. In Bugs, we can easily alter the internal structure of the models we are fitting, in a way that cannot easily be done with other statistical software.
Finally, our analyses are almost never simply summarized by a set of parameter estimates and standard errors. As we illustrate throughout, we need to look carefully at our inferences to see if they make sense and to understand the operation of the model, and we usually need to postprocess the parameter estimates to get predictions or generalizations to new settings. These inference manipulations are similar to data manipulations, and we do them in R to have maximum flexibility.
Model fitting in Part 1
Part 1 of this book uses the R software for three general tasks: (1) fitting classical linear and generalized linear models, (2) graphing data and estimated models, and (3) using simulation to propagate uncertainty in inferences and predictions (see Sections 7.1–7.2 for more on this).
Model fitting in Parts 2 and 3
When we move to multilevel modeling, we begin by fitting directly in R; however, for more complicated models we move to Bugs, which has a general language for writing statistical models. We call Bugs from R and continue to use R for preprocessing of data, graphical display of data and inferences, and simulation-based prediction and model checking.
R and S
Our favorite all-around statistics software is R, which is a free open-source version of S, a program developed in the 1970s and 1980s at Bell Laboratories. S is also available commercially as S-Plus. We shall refer to R throughout, but other versions of S generally do the same things.
R is excellent for graphics, classical statistical modeling (most relevant here are the lm() and glm() functions for linear and generalized linear models), and various nonparametric methods. As we discuss in Part 2, the lmer() function provides quick fits in R for many multilevel models. Other packages such as MCMCpack exist to fit specific classes of models in R, and other such programs are in development.
COMPUTING 11 Beyond the specific models that can be fit by these packages, R is fully pro-grammable and can thus fit any model, if enough programming is done. It is pos-sible to link R to Fortran or C to write faster programs. R also can choke on large datasets (which is one reason we automatically “thin” large Bugs outputs before reading into R; see Section 16.9).
Bugs
Bugs (an acronym for Bayesian Inference using Gibbs Sampling) is a program de-veloped by statisticians at the Medical Research Council in Cambridge, England. As of this writing, the most powerful versions available are WinBugs 1.4 and Open-Bugs. In this book, when we say “Bugs,” we are referring to WinBugs 1.4; however, the code should also work (perhaps with some modification) under OpenBugs or future implementations.
The Bugs modeling language has a modular form that allows the user to put together all sorts of Bayesian models, including most of the multilevel models cur-rently fit in social science applications. The two volumes of online examples in Bugs give some indication of the possibilities—in fact, it is common practice to write a Bugs script by starting with an example with similar features and then altering it step by step to fit the particular problem at hand.
The key advantage of Bugs is its generality in setting up models; its main disad-vantage is that it is slow and can get stuck with large datasets. These problems can be somewhat reduced in practice by randomly sampling from the full data to create a smaller dataset for preliminary modeling and debugging, saving the full data until you are clear on what model you want to fit. (This is simply a computational trick and should not be confused with cross-validation, a statistical method in which a procedure is applied to a subset of the data and then checked using the rest of the data.) Bugs does not always use the most efficient simulation algorithms, and currently its most powerful version runs only in Windows, which in practice reduces the ability to implement long computations in time-share with other processes.
When fitting complicated models, we set up the data in R, fit models in Bugs, then go back to R for further statistical analysis using the fitted models.
Some models cannot be fit in Bugs. For these we illustrate in Section 15.3 a new R package under development called Umacs (universal Markov chain sampler). Umacs is less automatic than Bugs and requires more knowledge of the algebra of Bayesian inference.
Other software
Some statistical software has been designed specifically for fitting multilevel mod-els, notably MLWin and HLM. It is also possible to fit some multilevel models in R, Stata, SAS, and other general-purpose statistical software, but without the flex-ibility of modeling in Bugs. The models allowed by these programs are less general than available in Bugs; however, they are generally faster and can handle larger datasets. We discuss these packages further in Section C.4.
Data and code for examples
Data and computer code for the examples and exercises in the book can be down-loaded at the website www.stat.columbia.edu/∼gelman/arm/, which also includes other supporting materials for this book.
CHAPTER 2
Concepts and methods from basic
probability and statistics
Simple methods from introductory statistics have three important roles in regres-sion and multilevel modeling. First, simple probability distributions are the build-ing blocks for elaborate models. Second, multilevel models are generalizations of classical complete-pooling and no-pooling estimates, and so it is important to un-derstand where these classical estimates come from. Third, it is often useful in practice to construct quick confidence intervals and hypothesis tests for small parts of a problem—before fitting an elaborate model, or in understanding the output from such a model.
This chapter provides a quick review of some of these methods.
2.1 Probability distributions
A probability distribution corresponds to an urn with a potentially infinite number of balls inside. When a ball is drawn at random, the “random variable” is what is written on this ball.
Areas of application of probability distributions include:
• Distributions of data (for example, heights of men, heights of women, heights of adults), for which we use the notation yi, i = 1, . . . , n.
• Distributions of parameter values, for which we use the notation θj, j = 1, . . . , J,
or other Greek letters such as α, β, γ. We shall see many of these with the mul-tilevel models in Part 2 of the book. For now, consider a regression model (for example, predicting students’ grades from pre-test scores) fit separately in each of several schools. The coefficients of the separate regressions can be modeled as following a distribution, which can be estimated from data.
• Distributions of error terms, which we write as i, i = 1, . . . , n—or, for
group-level errors, ηj, j = 1, . . . , J.
A “distribution” is how we describe a set of objects that are not identified, or when the identification gives no information. For example, the heights of a set of unnamed persons have a distribution, as contrasted with the heights of a particular set of your friends.
The basic way that distributions are used in statistical modeling is to start by fitting a distribution to data y, then get predictors X and model y given X with errors . Further information in X can change the distribution of the ’s (typically, by reducing their variance). Distributions are often thought of as data summaries, but in the regression context they are more commonly applied to ’s.
Normal distribution; means and variances
The Central Limit Theorem of probability states that the sum of many small inde-pendent random variables will be a random variable with an approximate normal