• No results found

0521867061 - Andrew Gelman - Data Analysis Using Regression and Multilevel~Hierarchical Models [2006]

N/A
N/A
Protected

Academic year: 2021

Share "0521867061 - Andrew Gelman - Data Analysis Using Regression and Multilevel~Hierarchical Models [2006]"

Copied!
651
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)
(3)

Data Analysis Using Regression and Multilevel/Hierarchical Models

Data Analysis Using Regression and Multilevel/Hierarchical Models is a comprehensive manual for the applied researcher who wants to perform data analysis using linear and nonlinear regression and multilevel models. The book introduces and demonstrates a wide variety of models, at the same time instructing the reader in how to fit these models using freely available software packages. The book illustrates the concepts by working through scores of real data examples that have arisen in the authors’ own applied research, with pro-gramming code provided for each one. Topics covered include causal inference, including regression, poststratification, matching, regression discontinuity, and instrumental vari-ables, as well as multilevel logistic regression and missing-data imputation. Practical tips regarding building, fitting, and understanding are provided throughout.

Andrew Gelman is Professor of Statistics and Professor of Political Science at Columbia University. He has published more than 150 articles in statistical theory, methods, and computation and in applications areas including decision analysis, survey sampling, polit-ical science, public health, and policy. His other books are Bayesian Data Analysis (1995, second edition 2003) and Teaching Statistics: A Bag of Tricks (2002).

Jennifer Hill is Assistant Professor of Public Affairs in the Department of International and Public Affairs at Columbia University. She has coauthored articles that have appeared in the Journal of the American Statistical Association, American Political Science Review, American Journal of Public Health, Developmental Psychology, the Economic Journal, and the Journal of Policy Analysis and Management, among others.

(4)
(5)

Analytical Methods for Social Research

Analytical Methods for Social Research presents texts on empirical and formal methods for the social sciences. Volumes in the series address both the theoretical underpinnings of analytical techniques and their application in social research. Some series volumes are broad in scope, cutting across a number of disciplines. Others focus mainly on method-ological applications within specific fields such as political science, sociology, demography, and public health. The series serves a mix of students and researchers in the social sciences and statistics.

Series Editors: R. Michael Alvarez, California Institute of Technology Nathaniel L. Beck, New York University

Lawrence L. Wu, New York University

Other Titles in the Series:

Event History Modeling: A Guide for Social Scientists, by Janet M. Box-Steffensmeier and Bradford S. Jones

Ecological Inference: New Methodological Strategies, edited by Gary King, Ori Rosen, and Martin A. Tanner

Spatial Models of Parliamentary Voting, by Keith T. Poole

Essential Mathematics for Political and Social Research, by Jeff Gill

(6)
(7)

Data Analysis Using Regression and

Multilevel/Hierarchical Models

ANDREW GELMAN

Columbia University

JENNIFER HILL

Columbia University

(8)

CAMBRIDGE UNIVERSITY PRESS

Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo Cambridge University Press

The Edinburgh Building, Cambridge CB2 8RU, UK

First published in print format

ISBN-13 978-0-521-86706-1

ISBN-13 978-0-521-68689-1 ISBN-13 978-0-511-26878-6

© Andrew Gelman and Jennifer Hill 2007

2006

Information on this title: www.cambridg e.org /9780521867061

This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press.

ISBN-10 0-511-26878-5

ISBN-10 0-521-86706-1

ISBN-10 0-521-68689-X

Cambridge University Press has no responsibility for the persistence or accuracy of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. Published in the United States of America by Cambridge University Press, New York

www.cambridge.org hardback paperback paperback eBook (EBL) eBook (EBL) hardback

(9)

Data Analysis Using Regression and

Multilevel/Hierarchical Models

(Corrected final version: 9 Aug 2006)

Please do not reproduce in any form

without permission

Andrew Gelman

Department of Statistics and Department of Political Science

Columbia University, New York

Jennifer Hill

School of International and Public Affairs

Columbia University, New York

c

2002, 2003, 2004, 2005, 2006 by Andrew Gelman and Jennifer Hill

To be published in October, 2006 by Cambridge University Press

(10)
(11)
(12)
(13)

Contents

List of examples pagexvii

Preface xix

1 Why? 1

1.1 What is multilevel regression modeling? 1

1.2 Some examples from our own research 3

1.3 Motivations for multilevel modeling 6

1.4 Distinctive features of this book 8

1.5 Computing 9

2 Concepts and methods from basic probability and statistics 13

2.1 Probability distributions 13

2.2 Statistical inference 16

2.3 Classical confidence intervals 18

2.4 Classical hypothesis testing 20

2.5 Problems with statistical significance 22 2.6 55,000 residents desperately need your help! 23

2.7 Bibliographic note 26

2.8 Exercises 26

Part 1A: Single-level regression 29

3 Linear regression: the basics 31

3.1 One predictor 31

3.2 Multiple predictors 32

3.3 Interactions 34

3.4 Statistical inference 37

3.5 Graphical displays of data and fitted model 42

3.6 Assumptions and diagnostics 45

3.7 Prediction and validation 47

3.8 Bibliographic note 49

3.9 Exercises 49

4 Linear regression: before and after fitting the model 53

4.1 Linear transformations 53

4.2 Centering and standardizing, especially for models with interactions 55 4.3 Correlation and “regression to the mean” 57

4.4 Logarithmic transformations 59

4.5 Other transformations 65

4.6 Building regression models for prediction 68

4.7 Fitting a series of regressions 73

(14)

x CONTENTS

4.8 Bibliographic note 74

4.9 Exercises 74

5 Logistic regression 79

5.1 Logistic regression with a single predictor 79 5.2 Interpreting the logistic regression coefficients 81

5.3 Latent-data formulation 85

5.4 Building a logistic regression model: wells in Bangladesh 86 5.5 Logistic regression with interactions 92 5.6 Evaluating, checking, and comparing fitted logistic regressions 97 5.7 Average predictive comparisons on the probability scale 101

5.8 Identifiability and separation 104

5.9 Bibliographic note 105

5.10 Exercises 105

6 Generalized linear models 109

6.1 Introduction 109

6.2 Poisson regression, exposure, and overdispersion 110

6.3 Logistic-binomial model 116

6.4 Probit regression: normally distributed latent data 118

6.5 Multinomial regression 119

6.6 Robust regression using the t model 124

6.7 Building more complex generalized linear models 125

6.8 Constructive choice models 127

6.9 Bibliographic note 131

6.10 Exercises 132

Part 1B: Working with regression inferences 135

7 Simulation of probability models and statistical inferences 137

7.1 Simulation of probability models 137

7.2 Summarizing linear regressions using simulation: an informal

Bayesian approach 140

7.3 Simulation for nonlinear predictions: congressional elections 144 7.4 Predictive simulation for generalized linear models 148

7.5 Bibliographic note 151

7.6 Exercises 152

8 Simulation for checking statistical procedures and model fits 155

8.1 Fake-data simulation 155

8.2 Example: using fake-data simulation to understand residual plots 157 8.3 Simulating from the fitted model and comparing to actual data 158 8.4 Using predictive simulation to check the fit of a time-series model 163

8.5 Bibliographic note 165

8.6 Exercises 165

9 Causal inference using regression on the treatment variable 167

9.1 Causal inference and predictive comparisons 167 9.2 The fundamental problem of causal inference 170

9.3 Randomized experiments 172

(15)

CONTENTS xi

9.5 Observational studies 181

9.6 Understanding causal inference in observational studies 186 9.7 Do not control for post-treatment variables 188 9.8 Intermediate outcomes and causal paths 190

9.9 Bibliographic note 194

9.10 Exercises 194

10 Causal inference using more advanced models 199

10.1 Imbalance and lack of complete overlap 199 10.2 Subclassification: effects and estimates for different subpopulations 204 10.3 Matching: subsetting the data to get overlapping and balanced

treatment and control groups 206

10.4 Lack of overlap when the assignment mechanism is known:

regression discontinuity 212

10.5 Estimating causal effects indirectly using instrumental variables 215 10.6 Instrumental variables in a regression framework 220 10.7 Identification strategies that make use of variation within or between

groups 226

10.8 Bibliographic note 229

10.9 Exercises 231

Part 2A: Multilevel regression 235

11 Multilevel structures 237

11.1 Varying-intercept and varying-slope models 237 11.2 Clustered data: child support enforcement in cities 237 11.3 Repeated measurements, time-series cross sections, and other

non-nested structures 241

11.4 Indicator variables and fixed or random effects 244 11.5 Costs and benefits of multilevel modeling 246

11.6 Bibliographic note 247

11.7 Exercises 248

12 Multilevel linear models: the basics 251

12.1 Notation 251

12.2 Partial pooling with no predictors 252

12.3 Partial pooling with predictors 254

12.4 Quickly fitting multilevel models in R 259

12.5 Five ways to write the same model 262

12.6 Group-level predictors 265

12.7 Model building and statistical significance 270 12.8 Predictions for new observations and new groups 272 12.9 How many groups and how many observations per group are

needed to fit a multilevel model? 275

12.10 Bibliographic note 276

12.11 Exercises 277

13 Multilevel linear models: varying slopes, non-nested models, and

other complexities 279

13.1 Varying intercepts and slopes 279

(16)

xii CONTENTS 13.3 Modeling multiple varying coefficients using the scaled

inverse-Wishart distribution 284

13.4 Understanding correlations between group-level intercepts and

slopes 287

13.5 Non-nested models 289

13.6 Selecting, transforming, and combining regression inputs 293

13.7 More complex multilevel models 297

13.8 Bibliographic note 297

13.9 Exercises 298

14 Multilevel logistic regression 301

14.1 State-level opinions from national polls 301 14.2 Red states and blue states: what’s the matter with Connecticut? 310 14.3 Item-response and ideal-point models 314 14.4 Non-nested overdispersed model for death sentence reversals 320

14.5 Bibliographic note 321

14.6 Exercises 322

15 Multilevel generalized linear models 325

15.1 Overdispersed Poisson regression: police stops and ethnicity 325 15.2 Ordered categorical regression: storable votes 331 15.3 Non-nested negative-binomial model of structure in social networks 332

15.4 Bibliographic note 342

15.5 Exercises 342

Part 2B: Fitting multilevel models 343

16 Multilevel modeling in Bugs and R: the basics 345

16.1 Why you should learn Bugs 345

16.2 Bayesian inference and prior distributions 345 16.3 Fitting and understanding a varying-intercept multilevel model

using R and Bugs 348

16.4 Step by step through a Bugs model, as called from R 353 16.5 Adding individual- and group-level predictors 359 16.6 Predictions for new observations and new groups 361

16.7 Fake-data simulation 363

16.8 The principles of modeling in Bugs 366

16.9 Practical issues of implementation 369

16.10 Open-ended modeling in Bugs 370

16.11 Bibliographic note 373

16.12 Exercises 373

17 Fitting multilevel linear and generalized linear models in Bugs

and R 375

17.1 Varying-intercept, varying-slope models 375 17.2 Varying intercepts and slopes with group-level predictors 379

17.3 Non-nested models 380

17.4 Multilevel logistic regression 381

17.5 Multilevel Poisson regression 382

17.6 Multilevel ordered categorical regression 383 17.7 Latent-data parameterizations of generalized linear models 384

(17)

CONTENTS xiii

17.8 Bibliographic note 385

17.9 Exercises 385

18 Likelihood and Bayesian inference and computation 387

18.1 Least squares and maximum likelihood estimation 387 18.2 Uncertainty estimates using the likelihood surface 390 18.3 Bayesian inference for classical and multilevel regression 392 18.4 Gibbs sampler for multilevel linear models 397 18.5 Likelihood inference, Bayesian inference, and the Gibbs sampler:

the case of censored data 402

18.6 Metropolis algorithm for more general Bayesian computation 408 18.7 Specifying a log posterior density, Gibbs sampler, and Metropolis

algorithm in R 409

18.8 Bibliographic note 413

18.9 Exercises 413

19 Debugging and speeding convergence 415

19.1 Debugging and confidence building 415

19.2 General methods for reducing computational requirements 418

19.3 Simple linear transformations 419

19.4 Redundant parameters and intentionally nonidentifiable models 419 19.5 Parameter expansion: multiplicative redundant parameters 424 19.6 Using redundant parameters to create an informative prior

distribution for multilevel variance parameters 427

19.7 Bibliographic note 434

19.8 Exercises 434

Part 3: From data collection to model understanding to model

checking 435

20 Sample size and power calculations 437

20.1 Choices in the design of data collection 437 20.2 Classical power calculations: general principles, as illustrated by

estimates of proportions 439

20.3 Classical power calculations for continuous outcomes 443 20.4 Multilevel power calculation for cluster sampling 447 20.5 Multilevel power calculation using fake-data simulation 449

20.6 Bibliographic note 454

20.7 Exercises 454

21 Understanding and summarizing the fitted models 457

21.1 Uncertainty and variability 457

21.2 Superpopulation and finite-population variances 459 21.3 Contrasts and comparisons of multilevel coefficients 462

21.4 Average predictive comparisons 466

21.5 R2and explained variance 473

21.6 Summarizing the amount of partial pooling 477 21.7 Adding a predictor can increase the residual variance! 480 21.8 Multiple comparisons and statistical significance 481

21.9 Bibliographic note 484

(18)

xiv CONTENTS

22 Analysis of variance 487

22.1 Classical analysis of variance 487

22.2 ANOVA and multilevel linear and generalized linear models 490 22.3 Summarizing multilevel models using ANOVA 492 22.4 Doing ANOVA using multilevel models 494 22.5 Adding predictors: analysis of covariance and contrast analysis 496 22.6 Modeling the variance parameters: a split-plot latin square 498

22.7 Bibliographic note 501

22.8 Exercises 501

23 Causal inference using multilevel models 503

23.1 Multilevel aspects of data collection 503 23.2 Estimating treatment effects in a multilevel observational study 506 23.3 Treatments applied at different levels 507 23.4 Instrumental variables and multilevel modeling 509

23.5 Bibliographic note 512

23.6 Exercises 512

24 Model checking and comparison 513

24.1 Principles of predictive checking 513

24.2 Example: a behavioral learning experiment 515

24.3 Model comparison and deviance 524

24.4 Bibliographic note 526

24.5 Exercises 527

25 Missing-data imputation 529

25.1 Missing-data mechanisms 530

25.2 Missing-data methods that discard data 531 25.3 Simple missing-data approaches that retain all the data 532 25.4 Random imputation of a single variable 533 25.5 Imputation of several missing variables 539

25.6 Model-based imputation 540

25.7 Combining inferences from multiple imputations 542

25.8 Bibliographic note 542

25.9 Exercises 543

Appendixes 545

A Six quick tips to improve your regression modeling 547

A.1 Fit many models 547

A.2 Do a little work to make your computations faster and more reliable 547 A.3 Graphing the relevant and not the irrelevant 548

A.4 Transformations 548

A.5 Consider all coefficients as potentially varying 549 A.6 Estimate causal inferences in a targeted way, not as a byproduct

of a large regression 549

B Statistical graphics for research and presentation 551

B.1 Reformulating a graph by focusing on comparisons 552

B.2 Scatterplots 553

(19)

CONTENTS xv

B.4 Bibliographic note 562

B.5 Exercises 563

C Software 565

C.1 Getting started with R, Bugs, and a text editor 565 C.2 Fitting classical and multilevel regressions in R 565

C.3 Fitting models in Bugs and R 567

C.4 Fitting multilevel models using R, Stata, SAS, and other software 568

C.5 Bibliographic note 573

References 575

Author index 601

(20)
(21)

List of examples

Home radon 3, 36, 252, 279, 479

Forecasting elections 3, 144

State-level opinions from national polls 4, 301, 493

Police stops by ethnic group 5, 21, 112, 325

Public opinion on the death penalty 19

Testing for election fraud 23

Sex ratio of births 27, 137

Mothers’ education and children’s test scores 31, 55

Height and weight 41, 75

Beauty and teaching evaluations 51, 277

Height and earnings 53, 59, 140, 288

Handedness 66

Yields of mesquite bushes 70

Political party identification over time 73

Income and voting 79, 107

Arsenic in drinking water 86, 128, 193

Death-sentencing appeals process 116, 320, 540 Ordered logistic model for storable votes 120, 331

Cockroaches in apartments 126, 161

Behavior of couples at risk for HIV 132, 166

Academy Award voting 133

Incremental cost-effectiveness ratio 152

Unemployment time series 163

The Electric Company TV show 174, 503

Hypothetical study of parenting quality as an intermediate outcome 188

Sesame Street TV show 196

Messy randomized experiment of cow feed 196

Incumbency and congressional elections 197

(22)

xviii LIST OF EXAMPLES

Value of a statistical life 197

Evaluating the Infant Health and Development Program 201, 506

Ideology of congressmembers 213

Hypothetical randomized-encouragement study 216

Child support enforcement 237

Adolescent smoking 241

Rodents in apartments 248

Olympic judging 248

Time series of children’s CD4 counts 249, 277, 449

Flight simulator experiment 289, 464, 488

Latin square agricultural experiment 292, 497

Income and voting by state 310

Item-response models 314

Ideal-point modeling for the Supreme Court 317

Speed dating 322

Social networks 332

Regression with censored data 402

Educational testing experiments 430

Zinc for HIV-positive children 439

Cluster sampling of New York City residents 448

Value added of school teachers 458

Advanced Placement scores and college grades 463

Prison sentences 470

Magnetic fields and brain functioning 481

Analysis of variance for web connect times 492

Split-plot latin square 498

Educational-subsidy program in Mexican villages 508 Checking models of behavioral learning in dogs 515 Missing data in the Social Indicators Survey 529

(23)

Preface

Aim of this book

This book originated as lecture notes for a course in regression and multilevel mod-eling, offered by the statistics department at Columbia University and attended by graduate students and postdoctoral researchers in social sciences (political sci-ence, economics, psychology, education, business, social work, and public health) and statistics. The prerequisite is statistics up to and including an introduction to multiple regression.

Advanced mathematics is not assumed—it is important to understand the linear model in regression, but it is not necessary to follow the matrix algebra in the derivation of least squares computations. It is useful to be familiar with exponents and logarithms, especially when working with generalized linear models.

After completing Part 1 of this book, you should be able to fit classical linear and generalized linear regression models—and do more with these models than simply look at their coefficients and their statistical significance. Applied goals include causal inference, prediction, comparison, and data description. After completing Part 2, you should be able to fit regression models for multilevel data. Part 3 takes you from data collection, through model understanding (looking at a table of estimated coefficients is usually not enough), to model checking and missing data. The appendixes include some reference materials on key tips, statistical graphics, and software for model fitting.

What you should be able to do after reading this book and working through the examples

This text is structured through models and examples, with the intention that after each chapter you should have certain skills in fitting, understanding, and displaying models:

• Part 1A: Fit, understand, and graph classical regressions and generalized linear models.

Chapter 3: Fit linear regressions and be able to interpret and display estimated

coefficients.

Chapter 4: Build linear regression models by transforming and combining

variables.

Chapter 5: Fit, understand, and display logistic regression models for binary

data.

Chapter 6: Fit, understand, and display generalized linear models, including

Poisson regression with overdispersion and ordered logit and probit models. • Part 1B: Use regression to learn about quantities of substantive interest (not

just regression coefficients).

Chapter 7: Simulate probability models and uncertainty about inferences and

predictions.

(24)

xx PREFACE

Chapter 8: Check model fits using fake-data simulation and predictive

simu-lation.

Chapter 9: Understand assumptions underlying causal inference. Set up

re-gressions for causal inference and understand the challenges that arise.

Chapter 10: Understand the assumptions underlying propensity score

match-ing, instrumental variables, and other techniques to perform causal inference when simple regression is not enough. Be able to use these when appropriate. • Part 2A: Understand and graph multilevel models.

Chapter 11: Understand multilevel data structures and models as

generaliza-tions of classical regression.

Chapter 12: Understand and graph simple varying-intercept regressions and

interpret as partial-pooling estimates.

Chapter 13: Understand and graph multilevel linear models with varying

in-tercepts and slopes, non-nested structures, and other complications.

Chapter 14: Understand and graph multilevel logistic models.

Chapter 15: Understand and graph multilevel overdispersed Poisson, ordered

logit and probit, and other generalized linear models.

• Part 2B: Fit multilevel models using the software packages R and Bugs.

Chapter 16: Fit varying-intercept regressions and understand the basics of

Bugs. Check your programming using fake-data simulation.

Chapter 17: Use Bugs to fit various models from Part 2A.

Chapter 18: Understand Bayesian inference as a generalization of least squares

and maximum likelihood. Use the Gibbs sampler to fit multilevel models.

Chapter 19: Use redundant parameterizations to speed the convergence of the

Gibbs sampler. • Part 3:

Chapter 20: Perform sample size and power calculations for classical and

hier-archical models: standard-error formulas for basic calculations and fake-data simulation for harder problems.

Chapter 21: Calculate and understand contrasts, explained variance, partial

pooling coefficients, and other summaries of fitted multilevel models.

Chapter 22: Use the ideas of analysis of variance to summarize fitted multilevel

models; use multilevel models to perform analysis of variance.

Chapter 23: Use multilevel models in causal inference.

Chapter 24: Check the fit of models using predictive simulation.

Chapter 25: Use regression to impute missing data in multivariate datasets.

In summary, you should be able to fit, graph, and understand classical and mul-tilevel linear and generalized linear models and to use these model fits to make predictions and inferences about quantities of interest, including causal treatment effects.

(25)

PREFACE xxi

Data for the examples and homework assignments and other resources for teaching and learning

The website www.stat.columbia.edu/∼gelman/arm/ contains datasets used in the examples and homework problems of the book, as well as sample computer code. The website also includes some tips for teaching regression and multilevel modeling through class participation rather than lecturing. We plan to update these tips based on feedback from instructors and students; please send your comments and suggestions to gelman@stat.columbia.edu.

Outline of a course

When teaching a course based on this book, we recommend starting with a self-contained review of linear regression, logistic regression, and generalized linear mod-els, focusing not on the mathematics but on understanding these methods and im-plementing them in a reasonable way. This is also a convenient way to introduce the statistical language R, which we use throughout for modeling, computation, and graphics. One thing that will probably be new to the reader is the use of random simulations to summarize inferences and predictions.

We then introduce multilevel models in the simplest case of nested linear models, fitting in the Bayesian modeling language Bugs and examining the results in R. Key concepts covered at this point are partial pooling, variance components, prior distributions, identifiability, and the interpretation of regression coefficients at dif-ferent levels of the hierarchy. We follow with non-nested models, multilevel logistic regression, and other multilevel generalized linear models.

Next we detail the steps of fitting models in Bugs and give practical tips for repa-rameterizing a model to make it converge faster and additional tips on debugging. We also present a brief review of Bayesian inference and computation. Once the student is able to fit multilevel models, we move in the final weeks of the class to the final part of the book, which covers more advanced issues in data collection, model understanding, and model checking.

As we show throughout, multilevel modeling fits into a view of statistics that unifies substantive modeling with accurate data fitting, and graphical methods are crucial both for seeing unanticipated features in the data and for understanding the implications of fitted models.

Acknowledgments

We thank the many students and colleagues who have helped us understand and implement these ideas. Most important have been Jouni Kerman, David Park, and Joe Bafumi for years of suggestions throughout this project, and for many insights into how to present this material to students.

In addition, we thank Hal Stern and Gary King for discussions on the structure of this book; Chuanhai Liu, Xiao-Li Meng, Zaiying Huang, John Boscardin, Jouni Kerman, and Alan Zaslavsky for discussions about statistical computation; Iven Van Mechelen and Hans Berkhof for discussions about model checking; Iain Par-doe for discussions of average predictive effects and other summaries of regression models; Matt Salganik and Wendy McKelvey for suggestions on the presentation of sample size calculations; T. E. Raghunathan, Donald Rubin, Rajeev Dehejia, Michael Sobel, Guido Imbens, Samantha Cook, Ben Hansen, Dylan Small, and Ed Vytlacil for concepts of missing-data modeling and causal inference; Eric Loken for help in understanding identifiability in item-response models; Niall Bolger, Agustin

(26)

xxii PREFACE Calatroni, John Carlin, Rafael Guerrero-Preston, Reid Landes, Eduardo Leoni, and Dan Rabinowitz for code in Stata, SAS, and SPSS; Hans Skaug for code in AD Model Builder; Uwe Ligges, Sibylle Sturtz, Douglas Bates, Peter Dalgaard, Martyn Plummer, and Ravi Varadhan for help with multilevel modeling and general advice on R; and the students in Statistics / Political Science 4330 at Columbia for their invaluable feedback throughout.

Collaborators on specific examples mentioned in this book include Phillip Price on the home radon study; Tom Little, David Park, Joe Bafumi, and Noah Kaplan on the models of opinion polls and political ideal points; Jane Waldfogel, Jeanne Brooks-Gunn, and Wen Han for the mothers and children’s intelligence data; Lex van Geen and Alex Pfaff on the arsenic in Bangladesh; Gary King on election forecasting; Jeffrey Fagan and Alex Kiss on the study of police stops; Tian Zheng and Matt Salganik on the social network analysis; John Carlin for the data on mesquite bushes and the adolescent-smoking study; Alessandra Casella and Tom Palfrey for the storable-votes study; Rahul Dodhia for the flight simulator exam-ple; Boris Shor, Joe Bafumi, and David Park on the voting and income study; Alan Edelman for the internet connections data; Donald Rubin for the Electric Com-pany and educational-testing examples; Jeanne Brooks-Gunn and Jane Waldfogel for the mother and child IQ scores example and Infant Health and Development Program data; Nabila El-Bassel for the risky behavior data; Lenna Nepomnyaschy for the child support example; Howard Wainer with the Advanced Placement study; Iain Pardoe for the prison-sentencing example; James Liebman, Jeffrey Fagan, Va-lerie West, and Yves Chretien for the death-penalty study; Marcia Meyers, Julien Teitler, Irv Garfinkel, Marilyn Sinkowicz, and Sandra Garcia with the Social Indi-cators Study; Wendy McKelvey for the cockroach and rodent examples; Stephen Arpadi for the zinc and HIV study; Eric Verhoogen and Jan von der Goltz for the Progresa data; and Iven van Mechelen, Yuri Goegebeur, and Francis Tuerlincx on the stochastic learning models. These applied projects motivated many of the methodological ideas presented here, for example the display and interpretation of varying-intercept, varying-slope models from the analysis of income and voting (see Section 14.2), the constraints in the model of senators’ ideal points (see Section 14.3), and the difficulties with two-level interactions as revealed by the radon study (see Section 21.7). Much of the work in Section 5.7 and Chapter 21 on summarizing regression models was done in collaboration with Iain Pardoe.

Many errors were found and improvements suggested by Brad Carlin, John Car-lin, Samantha Cook, Caroline Rosenthal Gelman, Kosuke Imai, Jonathan Katz, Uwe Ligges, Wendy McKelvey, Jong-Hee Park, Martyn Plummer, Phillip Price, Song Qian, Dylan Small, Elizabeth Stuart, Sibylle Sturtz, and Alex Tabarrok. Brian MacDonald’s copyediting has saved us from much embarrassment, and we also thank Yu-Sung Su for typesetting help, Sarah Ryu for assistance with index-ing, and Ed Parsons and his colleagues at Cambridge University Press for their help in putting this book together. We especially thank Bob O’Hara and Gregor Gorjanc for incredibly detailed and useful comments on the nearly completed manuscript.

We also thank the developers of free software, especially R (for statistical com-putation and graphics) and Bugs (for Bayesian modeling), and also Emacs and LaTex (used in the writing of this book). We thank Columbia University for its collaborative environment for research and teaching, and the U.S. National Science Foundation for financial support. Above all, we thank our families for their love and support during the writing of this book.

(27)

CHAPTER 1

Why?

1.1 What is multilevel regression modeling?

Consider an educational study with data from students in many schools, predicting in each school the students’ grades y on a standardized test given their scores on a pre-test x and other information. A separate regression model can be fit within each school, and the parameters from these schools can themselves be modeled as depending on school characteristics (such as the socioeconomic status of the school’s neighborhood, whether the school is public or private, and so on). The student-level regression and the school-level regression here are the two levels of a

multilevel model.

In this example, a multilevel model can be expressed in (at least) three equivalent ways as a student-level regression:

• A model in which the coefficients vary by school (thus, instead of a model such as y = α + βx + error, we have y = αj+ βjx + error, where the subscripts j index

schools),

• A model with more than one variance component (student-level and school-level variation),

• A regression with many predictors, including an indicator variable for each school in the data.

More generally, we consider a multilevel model to be a regression (a linear or gen-eralized linear model) in which the parameters—the regression coefficients—are given a probability model. This second-level model has parameters of its own—the

hyperparameters of the model—which are also estimated from data.

The two key parts of a multilevel model are varying coefficients, and a model for those varying coefficients (which can itself include group-level predictors). Classi-cal regression can sometimes accommodate varying coefficients by using indicator variables. The feature that distinguishes multilevel models from classical regression is in the modeling of the variation between groups.

Models for regression coefficients

To give a preview of our notation, we write the regression equations for two multi-level models. To keep notation simple, we assume just one student-multi-level predictor x (for example, a pre-test score) and one school-level predictor u (for example, average parents’ incomes).

Varying-intercept model. First we write the model in which the regressions have the same slope in each of the schools, and only the intercepts vary. We use the

(28)

2 WHY? notation i for individual students and j[i] for the school j containing student i:1

yi = αj[i]+ βxi+ i, for students i = 1, . . . , n

αj = a + buj+ ηj, for schools j = 1, . . . , J. (1.1)

Here, xi and ujrepresent predictors at the student and school levels, respectively,

and iand ηjare independent error terms at each of the two levels. The model can

be written in several other equivalent ways, as we discuss in Section 12.5. The number of “data points” J (here, schools) in the higher-level regression is typically much less than n, the sample size of the lower-level model (for students in this example).

Varying-intercept, varying-slope model. More complicated is the model where in-tercepts and slopes both can vary by school:

yi = αj[i]+ βj[i]xi+ i, for students i = 1, . . . , n

αj = a0+ b0uj+ ηj1, for schools j = 1, . . . , J

βj = a1+ b1uj+ ηj2, for schools j = 1, . . . , J.

Compared to model (1.1), this has twice as many vectors of varying coefficients (α, β), twice as many vectors of second-level coefficients (a, b), and potentially cor-related second-level errors η1, η2. We will be able to handle these complications. Labels

“Multilevel” or “hierarchical.” Multilevel models are also called hierarchical, for two different reasons: first, from the structure of the data (for example, students clustered within schools); and second, from the model itself, which has its own hier-archy, with the parameters of the within-school regressions at the bottom, controlled by the hyperparameters of the upper-level model.

Later we shall consider non-nested models—for example, individual observations that are nested within states and years. Neither “state” nor “year” is above the other in a hierarchical sense. In this sort of example, we can consider individuals, states, and years to be three different levels without the requirement of a full ordering or hierarchy. More complex structures, such as three-level nesting (for example, students within schools within school districts) are also easy to handle within the general multilevel framework.

Why we avoid the term “random effects.” Multilevel models are often known as random-effects or mixed-effects models. The regression coefficients that are being modeled are called random effects, in the sense that they are considered random outcomes of a process identified with the model that is predicting them. In contrast,

fixed effects correspond either to parameters that do not vary (for example, fitting

the same regresslon line for each of the schools) or to parameters that vary but are not modeled themselves (for example, fitting a least squares regression model with various predictors, including indicators for the schools). A mixed-effects model includes both fixed and random effects; for example, in model (1.1), the varying intercepts αjhave a group-level model, but β is fixed and does not vary by group.

1The model can also be written asy

ij=αj+βxij+ij, whereyijis the measurement from studenti in school j. We prefer using the single sequence i to index all students (and j[i] to label schools) because this fits in better with our multilevel modeling framework with data and models at the individual and group levels. The data areyibecause they can exist without reference to the groupings, and we prefer to include information about the groupings as numerical data— that is, the index variablej[i]—rather than through reordering the data through subscripting. We discuss the structure of the data and models further in Chapter 11.

(29)

SOME EXAMPLES FROM OUR OWN RESEARCH 3 Fixed effects can be viewed as special cases of random effects, in which the higher-level variance (in model (1.1), this would be σ2

α) is set to 0 or∞. Hence, in our

framework, all regression parameters are “random,” and the term “multilevel” is all-encompassing. As we discuss on page 245, we find the terms “fixed,” “random,” and “mixed” effects to be confusing and often misleading, and so we avoid their use.

1.2 Some examples from our own research

Multilevel modeling can be applied to just about any problem. Just to give a feel of the ways it can be used, we give here a few examples from our applied work.

Combining information for local decisions: home radon measurement and remediation

Radon is a carcinogen—a naturally occurring radioactive gas whose decay products are also radioactive—known to cause lung cancer in high concentrations and esti-mated to cause several thousand lung cancer deaths per year in the United States. The distribution of radon levels in U.S. homes varies greatly, with some houses hav-ing dangerously high concentrations. In order to identify the areas with high radon exposures, the Environmental Protection Agency coordinated radon measurements in a random sample of more than 80,000 houses throughout the country.

To simplify the problem somewhat, our goal in analyzing these data was to estimate the distribution of radon levels in each of the approximately 3000 counties in the United States, so that homeowners could make decisions about measuring or remediating the radon in their houses based on the best available knowledge of local conditions. For the purpose of this analysis, the data were structured hierarchically: houses within counties. If we were to analyze multiple measurements within houses, there would be a three-level hierarchy of measurements, houses, and counties.

In performing the analysis, we had an important predictor—the floor on which the measurement was taken, either basement or first floor; radon comes from un-derground and can enter more easily when a house is built into the ground. We also had an important county-level predictor—a measurement of soil uranium that was available at the county level. We fit a model of the form (1.1), where yiis the

logarithm of the radon measurement in house i, x is the floor of the measurement (that is, 0 for basement and 1 for first floor), and u is the uranium measurement at the county level. The errors iin the first line of (1.1) represent “within-county

vari-ation,” which in this case includes measurement error, natural variation in radon levels within a house over time, and variation between houses (beyond what is ex-plained by the floor of measurement). The errors ηj in the second line represent

variation between counties, beyond what is explained by the county-level uranium predictor.

The hierarchical model allows us to fit a regression model to the individual mea-surements while accounting for systematic unexplained variation among the 3000 counties. We return to this example in Chapter 12.

Modeling correlations: forecasting presidential elections

It is of practical interest to politicians and theoretical interest to political scientists that the outcomes of elections can be forecast with reasonable accuracy given in-formation available months ahead of time. To understand this better, we set up a

(30)

4 WHY? model to forecast presidential elections. Our predicted outcomes were the Demo-cratic Party’s share of the two-party vote in each state in each of the 11 elections from 1948 through 1988, yielding 511 data points (the analysis excluded states that were won by third parties), and we had various predictors, including the per-formance of the Democrats in the previous election, measures of state-level and national economic trends, and national opinion polls up to two months before the election.

We set up our forecasting model two months before the 1992 presidential election and used it to make predictions for the 50 states. Predictions obtained using classical regression are reasonable, but when the model is evaluated historically (fitting to all but one election and then using the model to predict that election, then repeating this for the different past elections), the associated predictive intervals turn out to be too narrow: that is, the predictions are not as accurate as claimed by the model. Fewer than 50% of the predictions fall in the 50% predictive intervals, and fewer than 95% are inside the 95% intervals. The problem is that the 511 original data points are structured, and the state-level errors are correlated. It is overly optimistic to say that we have 511 independent data points.

Instead, we model

yi= β0+ Xi1β1+ Xi2β2+· · · + Xikβk+ ηt[i]+ δr[i],t[i]+ i, for i = 1, . . . , n, (1.2)

where t[i] is a indicator for time (election year), and r[i] is an indicator for the region of the country (Northeast, Midwest, South, or West), and n = 511 is the number of state-years used to fit the model. For each election year, ηtis a nationwide error

and the δr,t’s are four independent regional errors.

The error terms must then be given distributions. As usual, the default is the normal distribution, which for this model we express as

ηt ∼ N(0, ση2), for t = 1, . . . , 11

δr,t ∼ N(0, σδ2), for r = 1, . . . , 4; t = 1, . . . , 11

i ∼ N(0, σ2), for i = 1, . . . , 511. (1.3)

In the multilevel model, all the parameters β, ση, σδ, σare estimated from the data.

We can then make a prediction by simulating the election outcome in the 50 states in the next election year, t = 12:

yi= β0+ Xi1β1+ Xi2β2+· · · + Xikβk+ η12+ δr[i],12+ i, for i = n+1, . . . , n+50.

To define the predictive distribution of these 50 outcomes, we need the point pre-dictors Xiβ = β0+ Xi1β1+ Xi2β2+· · · + Xikβk and the state-level errors  as

before, but we also need a new national error η12and four new regional errors δr,12,

which we simulate from the distributions (1.3). The variation from these gives a more realistic statement of prediction uncertainties.

Small-area estimation: state-level opinions from national polls

In a micro-level version of election forecasting, it is possible to predict the political opinions of individual voters given demographic information and where they live. Here the data sources are opinion polls rather than elections.

For example, we analyzed the data from seven CBS News polls from the 10 days immediately preceding the 1988 U.S. presidential election. For each survey respondent i, we label yi = 1 if he or she preferred George Bush (the Republican

candidate), 0 if he or she preferred Michael Dukakis (the Democrat). We excluded respondents who preferred others or had no opinion, leaving a sample size n of

(31)

SOME EXAMPLES FROM OUR OWN RESEARCH 5 about 6000. We then fit the model,

Pr(yi= 1) = logit−1(Xiβ),

where X included 85 predictors: • A constant term

• An indicator for “female” • An indicator for “black”

• An indicator for “female and black”

• 4 indicators for age categories (18–29, 30–44, 45–64, and 65+)

• 4 indicators for education categories (less than high school, high school, some college, college graduate)

• 16 indicators for age × education

• 51 indicators for states (including the District of Columbia)

• 5 indicators for regions (Northeast, Midwest, South, West, and D.C.)

• The Republican share of the vote for president in the state in the previous election.

In classical regression, it would be unwise to fit this many predictors because the estimates will be unreliable, especially for small states. In addition, it would be necessary to leave predictors out of each batch of indicators (the 4 age categories, the 4 education categories, the 16 age× education interactions, the 51 states, and the 5 regions) to avoid collinearity.

With a multilevel model, the coefficients for each batch of indicators are fit to a probability distribution, and it is possible to include all the predictors in the model. We return to this example in Section 14.1.

Social science modeling: police stops by ethnic group with variation across precincts

There have been complaints in New York City and elsewhere that the police harass members of ethnic minority groups. In 1999 the New York State Attorney General’s Office instigated a study of the New York City police department’s “stop and frisk” policy: the lawful practice of “temporarily detaining, questioning, and, at times, searching civilians on the street.” The police have a policy of keeping records on every stop and frisk, and this information was collated for all stops (about 175,000 in total) over a 15-month period in 1998–1999. We analyzed these data to see to what extent different ethnic groups were stopped by the police. We focused on blacks (African Americans), hispanics (Latinos), and whites (European Americans). We excluded others (about 4% of the stops) because of sensitivity to ambiguities in classifications. The ethnic categories were as recorded by the police making the stops.

It was found that blacks and hispanics represented 50% and 33% of the stops, respectively, despite constituting only 26% and 24%, respectively, of the population of the city. An arguably more relevant baseline comparison, however, is to the num-ber of crimes committed by memnum-bers of each ethnic group. Data on actual crimes are not available, of course, so as a proxy we used the number of arrests within New York City in 1997 as recorded by the Division of Criminal Justice Services (DCJS) of New York State. We used these numbers to represent the frequency of crimes that the police might suspect were committed by members of each group. When compared in that way, the ratio of stops to previous DCJS arrests was 1.24 for

(32)

6 WHY? whites, 1.53 for blacks, and 1.72 for hispanics—the minority groups still appeared to be stopped disproportionately often.

These ratios are suspect too, however, because they average over the whole city. Suppose the police make more stops in high-crime areas but treat the different ethnic groups equally within any locality. Then the citywide ratios could show strong differences between ethnic groups even if stops are entirely determined by location rather than ethnicity. In order to separate these two kinds of predictors, we performed a multilevel analysis using the city’s 75 precincts. For each ethnic group e = 1, 2, 3 and precinct p = 1, . . . , 75, we model the number of stops yepusing an

overdispersed Poisson regression. The exponentiated coefficients from this model represent relative rates of stops compared to arrests for the different ethnic groups, after controlling for precinct. We return to this example in Section 15.1.

1.3 Motivations for multilevel modeling

Multilevel models can be used for a variety of inferential goals including causal inference, prediction, and descriptive modeling.

Learning about treatment effects that vary

One of the basic goals of regression analysis is estimating treatment effects—how does y change when some x is varied, with all other inputs held constant? In many applications, it is not an overall effect of x that is of interest, but how this effect varies in the population. In classical statistics we can study this variation using

interactions: for example, a particular educational innovation may be more effective

for girls than for boys, or more effective for students who expressed more interest in school in a pre-test measurement.

Multilevel models also allow us to study effects that vary by group, for example an intervention that is more effective in some schools than others (perhaps because of unmeasured school-level factors such as teacher morale). In classical regression, estimates of varying effects can be noisy, especially when there are few observations per group; multilevel modeling allows us to estimate these interactions to the extent supported by the data.

Using all the data to perform inferences for groups with small sample size

A related problem arises when we are trying to estimate some group-level quan-tity, perhaps a local treatment effect or maybe simply a group-level average (as in the small-area estimation example on page 4). Classical estimation just using the local information can be essentially useless if the sample size is small in the group. At the other extreme, a classical regression ignoring group indicators can be mis-leading in ignoring group-level variation. Multilevel modeling allows the estimation of group averages and group-level effects, compromising between the overly noisy within-group estimate and the oversimplified regression estimate that ignores group indicators.

Prediction

Regression models are commonly used for predicting outcomes for new cases. But what if the data vary by group? Then we can make predictions for new units in existing groups or in new groups. The latter is difficult to do in classical regression:

(33)

MOTIVATIONS FOR MULTILEVEL MODELING 7 if a model ignores group effects, it will tend to understate the error in predictions for new groups. But a classical regression that includes group effects does not have any automatic way of getting predictions for a new group.

A natural attack on the problem is a two-stage regression, first including group indicators and then fitting a regression of estimated group effects on group-level predictors. One can then forecast for a new group, with the group effect predicted from the group-level model, and then the observations predicted from the unit-level model. However, if sample sizes are small in some groups, it can be difficult or even impossible to fit such a two-stage model classically, and fully accounting for the uncertainty at both levels leads directly to a multilevel model.

Analysis of structured data

Some datasets are collected with an inherent multilevel structure, for example, stu-dents within schools, patients within hospitals, or data from cluster sampling. Sta-tistical theory—whether sampling-theory or Bayesian—says that inference should include the factors used in the design of data collection. As we shall see, multi-level modeling is a direct way to include indicators for clusters at all multi-levels of a design, without being overwhelmed with the problems of overfitting that arise from applying least squares or maximum likelihood to problems with large numbers of parameters.

More efficient inference for regression parameters

Data often arrive with multilevel structure (students within schools and grades, laboratory assays on plates, elections in districts within states, and so forth). Even simple cross-sectional data (for example, a random sample survey of 1000 Amer-icans) can typically be placed within a larger multilevel context (for example, an annual series of such surveys). The traditional alternatives to multilevel modeling are complete pooling, in which differences between groups are ignored, and no

pool-ing, in which data from different sources are analyzed separately. As we shall discuss

in detail throughout the book, both these approaches have problems: no pooling ignores information and can give unacceptably variable inferences, and complete pooling suppresses variation that can be important or even the main goal of a study. The extreme alternatives can in fact be useful as preliminary estimates, but ultimately we prefer the partial pooling that comes out of a multilevel analysis.

Including predictors at two different levels

In the radon example described in Section 1.2, we have outcome measurements at the individual level and predictors at the individual and county levels. How can this information be put together? One possibility is simply to run a classical regression with predictors at both levels. But this does not correct for differences between counties beyond what is included in the predictors. Another approach would be to augment this model with indicators (dummy variables) for the counties. But in a classical regression it is not possible to include county-level indicators as well along with county-level predictors—the predictors would become collinear (see the end of Section 4.5 for a discussion of collinearity and nonidentifiability in this context).

Another approach is to fit the model with county indicators but without the county-level predictors, and then to fit a second model. This is possible but limited because it relies on the classical regression estimates of the coefficients for those

(34)

8 WHY? county-level indicators—and if the data are sparse within counties, these estimates won’t be very good. Another possibility in the classical framework would be to fit separate models in each group, but this is not possible unless the sample size is large in each group. The multilevel model provides a coherent model that simultaneously incorporates both individual- and group-level models.

Getting the right standard error: accurately accounting for uncertainty in prediction and estimation

Another motivation for multilevel modeling is for predictions, for example, when forecasting state-by-state outcomes of U.S. presidential elections, as described in Section 1.2. To get an accurate measure of predictive uncertainty, one must account for correlation of the outcome between states in a given election year. Multilevel modeling is a convenient way to do this.

For certain kinds of predictions, multilevel models are essential. For example, consider a model of test scores for students within schools. In classical regression, school-level variability might be modeled by including an indicator variable for each school. In this framework though, it is impossible to make a prediction for a new student in a new school, because there would not be an indicator for this new school in the model. This prediction problem is handled seamlessly using multilevel models.

1.4 Distinctive features of this book

The topics and methods covered in this book overlap with many other textbooks on regression, multilevel modeling, and applied statistics. We differ from most other books in these areas in the following ways:

• We present methods and software that allow the reader to fit complicated, linear or nonlinear, nested or non-nested models. We emphasize the use of the statistical software packages R and Bugs and provide code for many examples as well as methods such as redundant parameterization that speed computation and lead to new modeling ideas.

• We include a wide range of examples, almost all from our own applied research. The statistical methods are thus motivated in the best way, as successful practical tools.

• Most books define regression in terms of matrix operations. We avoid much of this matrix algebra for the simple reason that it is now done automatically by computers. We are more interested in understanding the “forward,” or predic-tive, matrix multiplication Xβ than the more complicated inferential formula (XtX)−1Xty. The latter computation and its generalizations are important but

can be done out of sight of the user. For details of the underlying matrix algebra, we refer readers to the regression textbooks listed in Section 3.8.

• We try as much as possible to display regression results graphically rather than through tables. Here we apply ideas such as those presented in the books by Ramsey and Schafer (2001) for classical regression and Kreft and De Leeuw (1998) for multilevel models. We consider graphical display of model estimates to be not just a useful teaching method but also a necessary tool in applied research.

Statistical texts commonly recommend graphical displays for model diagnostics. These can be very useful, and we refer readers to texts such as Cook and Weisberg

(35)

COMPUTING 9 (1999) for more on this topic—but here we are emphasizing graphical displays of the fitted models themselves. It is our experience that, even when a model fits data well, we have difficulty understanding it if all we do is look at tables of regression coefficients.

• We consider multilevel modeling as generally applicable to structured data, not limited to clustered data, panel data, or nested designs. For example, in a random-digit-dialed survey of the United States, one can, and should, use multilevel models if one is interested in estimating differences among states or demographic subgroups—even if no multilevel structure is in the survey design. Ultimately, you have to learn these methods by doing it yourself, and this chapter is intended to make things easier by recounting stories about how we learned this by doing it ourselves. But we warn you ahead of time that we include more of our successes than our failures.

Costs and benefits of our approach

Doing statistics as described in this book is not easy. The difficulties are not math-ematical but rather conceptual and computational. For classical regressions and generalized linear models, the actual fitting is easy (as illustrated in Part 1), but programming effort is still required to graph the results relevantly and to simulate predictions and replicated data. When we move to multilevel modeling, the fitting itself gets much more complicated (see Part 2B), and displaying and checking the models require correspondingly more work. Our emphasis on R and Bugs means that an initial effort is required simply to learn and use the software. Also, compared to usual treatments of multilevel models, we describe a wider variety of modeling options for the researcher so that more decisions will need to be made.

A simpler alternative is to use classical regression and generalized linear modeling where possible—this can be done in R or, essentially equivalently, in Stata, SAS, SPSS, and various other software—and then, when multilevel modeling is really needed, to use functions that adapt classical regression to handle simple multilevel models. Such functions, which can be run with only a little more effort than simple regression fitting, exist in many standard statistical packages.

Compared to these easier-to-use programs, our approach has several advantages: • We can fit a greater variety of models. The modular structure of Bugs allows us

to add complexity where needed to fit data and study patterns of interest. • By working with simulations (rather than simply point estimates of parameters),

we can directly capture inferential uncertainty and propagate it into predictions (as discussed in Chapter 7 and applied throughout the book). We can directly obtain inference for quantities other than regression coefficients and variance parameters.

• R gives us flexibility to display inferences and data flexibly.

We recognize, however, that other software and approaches may be useful too, either as starting points or to check results. Section C.4 describes briefly how to fit multilevel models in several other popular statistical software packages.

1.5 Computing

We perform computer analyses using the freely available software R and Bugs. Appendix C gives instructions on obtaining and using these programs. Here we outline how these programs fit into our overall strategy for data analysis.

(36)

10 WHY?

Our general approach to statistical computing

In any statistical analysis, we like to be able to directly manipulate the data, model, and inferences. We just about never know the right thing to do ahead of time, so we have to spend much of our effort examining and cleaning the data, fitting many different models, summarizing the inferences from the models in different ways, and then going back and figuring how to expand the model to allow new data to be included in the analysis.

It is important, then, to be able to select subsets of the data, to graph whatever aspect of the data might be of interest, and to be able to compute numerical sum-maries and fit simple models easily. All this can be done within R—you will have to put some initial effort into learning the language, but it will pay off later.

You will almost always need to try many different models for any problem: not just different subsets of predictor variables as in linear regression, and not just minor changes such as fitting a logit or probit model, but entirely different formulations of the model—different ways of relating observed inputs to outcomes. This is especially true when using new and unfamiliar tools such as multilevel models. In Bugs, we can easily alter the internal structure of the models we are fitting, in a way that cannot easily be done with other statistical software.

Finally, our analyses are almost never simply summarized by a set of parameter estimates and standard errors. As we illustrate throughout, we need to look carefully at our inferences to see if they make sense and to understand the operation of the model, and we usually need to postprocess the parameter estimates to get predictions or generalizations to new settings. These inference manipulations are similar to data manipulations, and we do them in R to have maximum flexibility.

Model fitting in Part 1

Part 1 of this book uses the R software for three general tasks: (1) fitting classical linear and generalized linear models, (2) graphing data and estimated models, and (3) using simulation to propagate uncertainty in inferences and predictions (see Sections 7.1–7.2 for more on this).

Model fitting in Parts 2 and 3

When we move to multilevel modeling, we begin by fitting directly in R; however, for more complicated models we move to Bugs, which has a general language for writing statistical models. We call Bugs from R and continue to use R for preprocessing of data, graphical display of data and inferences, and simulation-based prediction and model checking.

R and S

Our favorite all-around statistics software is R, which is a free open-source version of S, a program developed in the 1970s and 1980s at Bell Laboratories. S is also available commercially as S-Plus. We shall refer to R throughout, but other versions of S generally do the same things.

R is excellent for graphics, classical statistical modeling (most relevant here are the lm() and glm() functions for linear and generalized linear models), and various nonparametric methods. As we discuss in Part 2, the lmer() function provides quick fits in R for many multilevel models. Other packages such as MCMCpack exist to fit specific classes of models in R, and other such programs are in development.

(37)

COMPUTING 11 Beyond the specific models that can be fit by these packages, R is fully pro-grammable and can thus fit any model, if enough programming is done. It is pos-sible to link R to Fortran or C to write faster programs. R also can choke on large datasets (which is one reason we automatically “thin” large Bugs outputs before reading into R; see Section 16.9).

Bugs

Bugs (an acronym for Bayesian Inference using Gibbs Sampling) is a program de-veloped by statisticians at the Medical Research Council in Cambridge, England. As of this writing, the most powerful versions available are WinBugs 1.4 and Open-Bugs. In this book, when we say “Bugs,” we are referring to WinBugs 1.4; however, the code should also work (perhaps with some modification) under OpenBugs or future implementations.

The Bugs modeling language has a modular form that allows the user to put together all sorts of Bayesian models, including most of the multilevel models cur-rently fit in social science applications. The two volumes of online examples in Bugs give some indication of the possibilities—in fact, it is common practice to write a Bugs script by starting with an example with similar features and then altering it step by step to fit the particular problem at hand.

The key advantage of Bugs is its generality in setting up models; its main disad-vantage is that it is slow and can get stuck with large datasets. These problems can be somewhat reduced in practice by randomly sampling from the full data to create a smaller dataset for preliminary modeling and debugging, saving the full data until you are clear on what model you want to fit. (This is simply a computational trick and should not be confused with cross-validation, a statistical method in which a procedure is applied to a subset of the data and then checked using the rest of the data.) Bugs does not always use the most efficient simulation algorithms, and currently its most powerful version runs only in Windows, which in practice reduces the ability to implement long computations in time-share with other processes.

When fitting complicated models, we set up the data in R, fit models in Bugs, then go back to R for further statistical analysis using the fitted models.

Some models cannot be fit in Bugs. For these we illustrate in Section 15.3 a new R package under development called Umacs (universal Markov chain sampler). Umacs is less automatic than Bugs and requires more knowledge of the algebra of Bayesian inference.

Other software

Some statistical software has been designed specifically for fitting multilevel mod-els, notably MLWin and HLM. It is also possible to fit some multilevel models in R, Stata, SAS, and other general-purpose statistical software, but without the flex-ibility of modeling in Bugs. The models allowed by these programs are less general than available in Bugs; however, they are generally faster and can handle larger datasets. We discuss these packages further in Section C.4.

Data and code for examples

Data and computer code for the examples and exercises in the book can be down-loaded at the website www.stat.columbia.edu/∼gelman/arm/, which also includes other supporting materials for this book.

(38)
(39)

CHAPTER 2

Concepts and methods from basic

probability and statistics

Simple methods from introductory statistics have three important roles in regres-sion and multilevel modeling. First, simple probability distributions are the build-ing blocks for elaborate models. Second, multilevel models are generalizations of classical complete-pooling and no-pooling estimates, and so it is important to un-derstand where these classical estimates come from. Third, it is often useful in practice to construct quick confidence intervals and hypothesis tests for small parts of a problem—before fitting an elaborate model, or in understanding the output from such a model.

This chapter provides a quick review of some of these methods.

2.1 Probability distributions

A probability distribution corresponds to an urn with a potentially infinite number of balls inside. When a ball is drawn at random, the “random variable” is what is written on this ball.

Areas of application of probability distributions include:

• Distributions of data (for example, heights of men, heights of women, heights of adults), for which we use the notation yi, i = 1, . . . , n.

• Distributions of parameter values, for which we use the notation θj, j = 1, . . . , J,

or other Greek letters such as α, β, γ. We shall see many of these with the mul-tilevel models in Part 2 of the book. For now, consider a regression model (for example, predicting students’ grades from pre-test scores) fit separately in each of several schools. The coefficients of the separate regressions can be modeled as following a distribution, which can be estimated from data.

• Distributions of error terms, which we write as i, i = 1, . . . , n—or, for

group-level errors, ηj, j = 1, . . . , J.

A “distribution” is how we describe a set of objects that are not identified, or when the identification gives no information. For example, the heights of a set of unnamed persons have a distribution, as contrasted with the heights of a particular set of your friends.

The basic way that distributions are used in statistical modeling is to start by fitting a distribution to data y, then get predictors X and model y given X with errors . Further information in X can change the distribution of the ’s (typically, by reducing their variance). Distributions are often thought of as data summaries, but in the regression context they are more commonly applied to ’s.

Normal distribution; means and variances

The Central Limit Theorem of probability states that the sum of many small inde-pendent random variables will be a random variable with an approximate normal

References

Related documents

MOVED by Stephen Manchester and SUPPORTED by Kay Pray that the Program and Planning Committee recommends that the Board of Directors of the Community Mental Health

secalis Mycosphaerella graminis Phaeosphaeria nodorum Fusarium culmorum Ustilago tritici Ustilago nuda Tilletia indica Tilletia caries Pyrenophora tritici-repentis

Despite this, in broad, the evidence appears to corroborate the fixed exchange regime of Hong Kong with the US as the monetary anchor country during the global crisis period

Although a progressive increase in warm ischemia time was associated with an increased hazard for death or graft failure relative to times of 10 to <20 minutes, patients who had

Onderwijsmanagers zouden meer aandacht moeten hebben voor de ‘lijm’ die losjes gekoppelde organisaties bijeenhoudt: zij moeten uit hun kantoren komen en veel tijd

Passive smoking was also found to be prevalent, with parents and grandparents often being the main influence in childhood, and friends and work colleagues in adulthood 30 Figure

As shown in a previous study of the Nebraska ethanol industry (Brooks et al., 2019), the ethanol industry has some indirect impact on other activities in Ne- braska.. Brooks

A second feature of the program is that it provides Library and Information Science graduate students with a practice-based learning experience in the highly