Regression III: Advanced Methods

(1)

Lecture 16: Generalized Additive Models

http://polisci.msu.edu/jacoby/icpsr/regress3

Regression III:

Advanced Methods

Bill Jacoby

Michigan State University

(2)

Goals of the Lecture

• Introduce Additive Models

– Explain how they extend from simple nonparametric regression (i.e., local polynomial regression)

– Discuss estimation using backfitting – Explain how to interpret their results

• Conclude with some examples of Additive Models

applied to real social science data

(3)

Limitations of the Multiple Nonparametric Models

• As we see here, the multiple nonparametric model allows all possible interactions between the independent variables in their effects on Y—we specify a jointly conditional

functional form

• This model is ideal under the following circumstances:

1. There are no more than two predictors

2. The pattern of nonlinearity is complicated and thus

cannot be easily modelled with a simple transformation or polynomial regression

3. The sample size is sufficiently large

• Recall that the general nonparametric model (both the

lowess smooth and the smoothing spline) takes the

following form:

(4)

Limitations of the Multiple Nonparametric Models (2)

• The general nonparametric model becomes impossible to interpret and unstable as we add more explanatory

variables, however

1. For example, in the lowess case, as the number of variables increases, the window span must become wider in order to ensure that each local regression has enough cases This process can create significant bias (the curve becomes too smooth)

2. It is impossible to interpret general nonparametric regression when there are more than two variables—

there are no coefficients, and we cannot graph effects more than three dimensions

• These limitations lead us to the Additive Models

(5)

Additive Regression Models

• Additive regression models essentially apply local

regression to low dimensional projections of the data

• The nonparametric additive regression model is

The f

_i

are arbitrary functions estimated from the data;

the errors ε are assumed to have constant variance and a mean of 0

• Additive models create an estimate of the regression surface by a combination of a collection of one- dimensional functions

• The estimated functions f

_i

are the analogues of the

coefficients in linear regression

(6)

• The assumption that the contribution of each covariate is additive is analogous to the assumption in linear

regression that each component is estimated separately

• Recall that the linear regression model is

Additive Regression Models (2)

where the B

_j

represent linear effects

• For the additive model we model Y as an additive combination of arbitrary functions of the Xs

• The f

_j

represent arbitrary functions that can be estimated

by lowess or smoothing splines

(7)

• Now comes the question: How do we find these arbitrary functions?

• If the X’s were completely independent—which will not be the case—we could simply estimate each functional form using a nonparametric regression of Y on each of the X’s separately

– Similarly in linear regression when the X’s are

completely uncorrelated the partial regression slopes are identical to the marginal regression slopes

• Since the X’s are related, however, we need to proceed in another way, in effect removing the effects of other

predictors—which are unknown before we begin

• We use a procedure called backfitting to find each curve, controlling for the effects of the others

Additive Regression Models (3)

(8)

Estimation and Backfitting

• Suppose that we had a two predictor additive model:

• If we unrealistically knew the partial regression function f

₂

but not f

₁

we could rearrange the equation in order to solve for f

₁

• In other words, smoothing Y

_i

-f

₂

(x

_i2

) against x

_i1

produces an estimate of α+f

₁

(x

_i1

).

• Simply put, knowing one function allows us to find the

other—in the real world, however we don’t know either so

we must proceed initially with estimates

(9)

Estimation and Backfitting (2)

1. We start by expressing the variables in mean deviation form so that the partial regressions sum to zero, thus eliminating the individual intercepts

2. We then take preliminary estimates of each function from a least-squares regression of Y on the X’s

4. We then find the partial residuals for X

₁

, which removes Y from its linear relationship to X

₂

but retains the

relationship between Y and X

₁

3. These estimates are then used as step (0) in an iterative

estimation process

(10)

Estimation and Backfitting (3)

The partial residuals for X

₁

are then

where S is the (n × n) smoother transformation matrix for X

_j

that depends only on the configuration of X

_ij

for the 5. The same procedure in step 4 is done for X

₂

6. Next we smooth these partial residuals against their

respective X’s, providing a new estimate of f

(11)

Estimation and Backfitting (4)

• This process of finding new estimates of the functions by smoothing the partial residuals is reiterated until the partial functions converge

– That is, when the estimates of the smooth functions stabilize from one iteration to the next we stop

• When this process is done, we obtain estimates of s

_j

(X

_ij

) for every value of X

_j

• More importantly, we will have reduced a multiple regression to a series of two-dimensional partial regression problems, making interpretation easy:

– Since each partial regression is only two-dimensional, the functional forms can be plotted on two-dimensional plots showing the partial effects of each X

_j

on Y

– In other words, perspective plots are no longer

necessary unless we include an interaction between two

smoother terms

(12)

Interpreting the Effects

• A plot of of X

_j

versus s

_j

(X

_j

) shows the relationship between X

_j

and Y holding constant the other variables in the model

• Since Y is expressed in mean deviation form, the smooth term s

_j

(X

_j

) is also centered and thus each plot represents how Y changes relative to its mean with changes in X

• Interpreting the scale of the graphs then becomes easy:

– The value of 0 on the Y-axis is the mean of Y

– As the line moves away from 0 in a negative direction we subtract the distance from the mean when

determining the fitted value. For example, if the mean is 45, and for a particular X-value (say x=15) the curve is at s

_j

(X

_j

)=4, this means the fitted value of Y controlling for all other explanatory variables is 45+4=49.

– If there are several nonparametric relationships, we can add together the effects on the two graphs for any

particular observation to find its fitted value of Y

(13)

Additive Regression Models in R:

Example: Canadian prestige data

• Here we use the Canadian Prestige data to fit an additive model to prestige regressed on income and occupation

• In R we use the gam function (for generalized additive models) that is found in mgcv package

– The gam function in mgcv fits only smoothing splines (local polynomial regression can be done in S-PLUS) – The formula takes the same form as the glm function

except now we have the option of having parametric terms and smoothed estimates

– Smooths will be fit to any variable specified with the s(variable) argument

• The simple R-script is as follows:

(14)

Additive Regression Models in R:

Example: Canadian prestige data (2)

• The summary function returns tests for each smooth, the

degrees of freedom for each smooth, and an adjusted R-

square for the model. The deviance can be obtained from

the deviance(model) command

(15)

Additive Regression Models in R:

Example: Canadian prestige data (3)

• Again, as with other nonparametric models, we have no slope parameters to investigate (we do have an

intercept, however)

• A plot of the regression surface is necessary

(16)

Additive Regression Models in R:

Example: Canadian prestige data (4)

Additive Model:

• We can see the

nonlinear relationship for both education and Income with Prestige but there is no

interaction between them—i.e., the slope for income is the same at every value of

education

• We can compare this model to the general nonparametric

regression model

Income 5000

10000 15000

20000 25000

Education 8

10 12

14 Pre

stige 20 40 60 80

(17)

Additive Regression Models in R:

Example: Canadian prestige data (5)

General Nonparametric Model:

• This model is quite similar to the additive model, but there are some nuances—

particularly in the mid- range of income—that are not picked up by the additive model because the X’s do not interact

Income 5000

10000 15000

20000

25000

Education 8

10 12

14 Pre

stige 20 40 60 80

(18)

Additive Regression Models in R:

Example: Canadian prestige data (6)

• Perspective plots can also be made automatically

using the persp.gam

function. These graphs include a 95% confidence region

income

5000

10000 15000

20000

25000

education

8 10 12

14 20

40 60 80

income

5000

10000 15000

20000

25000

education

8 10 12

14 20

40 60 80

income

5000

10000 15000

20000

25000

education

8 10 12

14 20

40 60 80

income

5000

10000 15000

20000

25000

education

8 10 12

14 20

40

60

80

(19)

Additive Regression Models in R:

Example: Canadian prestige data (7)

• Since the slices of the additive regression in the direction of one predictor (holding the other constant) are parallel, we can graph each partial- regression function separately

• This is the benefit of the additive model—we can graph as many plots as there are variables, and allowing us to

easily visualize the relationships

• In other words, a multidimensional regression has been reduced to a series of two-dimensional partial-regression plots

• To get these in R:

(20)

Additive Regression Models in R:

Example: Canadian prestige data (8)

0 5000 10000 15000 20000 25000

-20010

income

s(income,3.12)

6 8 10 12 14 16

-20010

s(education,3.18)

(21)

Additive Regression Models in R:

Example: Canadian prestige data (9)

0 5000 15000 25000

-20-1001020

income

s(income,3.12)

6 8 10 12 14 16

-20-1001020

education

s(education,3.18)

(22)

R-script for previous slide

(23)

Residual Sum of Squares

• As was the case for smoothing splines and lowess

smooths, statistical inference and hypothesis testing is based on the residual sum of squares (or deviance in the case of generalized additive models) and the degrees of freedom

• The RSS for an additive model is easily defined in the usual manner:

• The approximate degrees of freedom, however, need to be

adjusted from the regular nonparametric case, however,

because we are no longer specifying a jointly-conditional

functional form

(24)

Degrees of Freedom

• Recall that for nonparametric regression, the approximate degrees of freedom are equal to the trace of the smoother matrix (the matrix that projects Y onto Y-hat)

• We extend this to the additive model:

1 is subtracted from each df reflecting the constraint that each partial regression function sums to zero (the individual intercept have been removed)

• Parametric terms entered in the model each occupy a single degree of freedom as in the linear regression case

• The individual degrees of freedom are then combined for a single measure:

1 is added to the final degrees of freedom to account

(25)

Testing for Linearity

• I can compare the linear model of prestige regressed on income and education with the additive model by

carrying out an analysis of deviance

• I begin by fitting the linear model using the gam function

• Next I want the residual degrees of freedom from the

additive model

(26)

Testing for Linearity (2)

• Now I simply calculate the difference in the deviance between the two model relative to the difference in degrees of freedom (difference in df=7.3-2=5)

• This gives a Chi-square test for linearity

• The difference between the models is highly statistically

significant—the additive model describe the relationship

between prestige and education and income much better

(27)