Week 5: Multiple Linear Regression

(1)

BUS41100 Applied Regression Analysis

Week 5: Multiple Linear Regression

Parameter estimation and inference, forecasting, diagnostics, dummy variables

Robert B. Gramacy

The University of Chicago Booth School of Business

(2)

Beyond SLR

Many problems involve more than one independent variable or factor which affects the dependent or response variable.

I Multi-factor asset pricing models (beyond CAPM).

I Demand for a product given prices of competing brands, advertising, household attributes, etc.

I More than size to predict house price!

In SLR, the conditional mean of Y depends on X . The multiple linear regression (MLR) model extends this idea to include more than one independent variable.

(3)

Categorical effects/dummy variables

Week 1 already introduced one type of multiple regression:

I Regression for Y onto a categorical X :

E[Y | group = r ] = β0+ βr, for r = 1, . . . , R − 1.

I.e., ANOVA for grouped data.

To represent these qualitative factors in multiple regression, we usedummy,binary, orindicator variables.

(4)

Dummy variables allow the mean (intercept) to shift by taking on the value 0 or 1. Examples:

I temporal effects (1 if Holiday season, 0 if not)

I spatial (1 if in Midwest, 0 if not)

If a factor X takes R possible levels, we can represent X through R − 1 dummy variables

E[Y |X ] = β0+ β11[X =2]+ β21[X =3]+ · · · + βR−11[X =R]

(1[X =r ] = 1 if X = r , 0 if X 6= r .)

(5)

Recall the pick-up truck data example:

> pickup <- read.csv("pickup.csv")

> t1 <- lm(log(price) ~ make, data=pickup)

> summary(t1) ## abbreviated output

Call:

lm(formula = log(price) ~ make, data = pickup) Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 8.71499 0.23109 37.713 <2e-16 ***

makeFord 0.17725 0.31289 0.566 0.574

makeGMC -0.05124 0.27505 -0.186 0.853

(6)

What if you also want to include mileage?

I No problem.

> t2 <- lm(log(price) ~ make + log(miles), data=pickup)

> summary(t2) ## abbreviated output

Call:

lm(formula = log(price) ~ make + log(miles), data=pickup) Coefficients:

(Intercept) 13.83117 1.08608 12.735 5.17e-16 ***

makeFord 0.32419 0.25658 1.264 0.213

makeGMC 0.13265 0.22720 0.584 0.562

(7)

The MLR Model

The MLR model is same as always, but with more covariates.

Y |X

1

, . . . , X

d ind

∼ N(β

0

+ β

1

X

1

+ · · · + β

d

X

d

, σ

2

)

Recall the key assumptions of our linear regression model:

(i) The conditional mean of Y is linear in the Xj variables.

(ii) The additive errors (deviations from line)

I are normally distributed I independent from each other

(8)

Our interpretation of regression coefficients can be extended from the simple single covariate regression case:

β

j

=

∂E[Y |X

1

, . . . , X

d

]

∂X

j

I Holding all other variables constant, βj is the average change in Y per unit change in Xj.

—

(9)

If d = 2, we can plot the regression surface in 3D. Consider sales of a product as predicted by price of this product (P1) and the price of a competing product (P2).

I Everything measured on log-scale, of course.

(10)

The data and least squares

The data in multiple regression is a set of points with values for output Y and for each input variable.

Data: Yi and xi = [X1i, X2i, . . . , Xdi], for i = 1, . . . , n.

Or, as a data array (i.e.,data.frame),

Data =    Y1 X11 X21 . . . Xd 1 .. . Yn X1n X2n . . . Xdn   

(11)

So the model is

Yi = β0+ β1Xi 1+ · · · + βdXid + εi, εi iid

∼ N (0, σ2₎

. I How do we estimate the MLR model parameters?

The principle of least squares is unchanged; define:

I fitted valuesYˆi = b0+ b1X1i + b2X2i + · · · + bdXdi I residualsei = Yi − ˆYi I standard error s = qPn i =1ei2 n−p , where p = d + 1.

Then find the best fitting plane, i.e., coefs b0, b1, b2, . . . , bd, by minimizing the sum of squared residuals, s2_.

(12)

What are the LS coefficient values? Say that Y =    Y1 .. . Yn    ˆ X =    1 X11 X12 . . . X1d .. . 1 Xn1 Xn2 . . . Xnd   .

Then the estimates are [b0· · · bd]0 = b =( ˆX0X)ˆ −1Xˆ0Y.

Same intuition as for SLR:

I b captures the covariance between X_j and Y (Xˆ0Y), I normalized by input sum of squares (Xˆ0Xˆ).

(13)

Obtaining these estimates in R is very easy: > salesdata <- read.csv("sales.csv") > attach(salesdata) > salesMLR <- lm(Sales ~ P1 + P2) > salesMLR Call: lm(formula = Sales ~ P1 + P2) Coefficients: (Intercept) P1 P2 1.003 -1.006 1.098

(14)

Multiple vs simple linear regression

Basic concepts and techniques translate directly from SLR.

I Individual parameter inference and estimation is the same,

conditional on the rest of the model.

I ANOVA is exactly the same, and the F -test still applies. I Our diagnostics and transformations apply directly.

I We still use lm, summary, rstudent, predict, etc.

I We have been plotting HD data since Week 1.

The hardest part would be moving to matrix algebra to translate all of our equations. Luckily, R does all that for you.

(15)

Residual standard error

First off, the calculation for s2 = var(e) is exactly the same:

s

2

=

P

n i =1

(Y

i

− ˆ

Y

i

)

2

n − p

.

I Yˆi = b0+P bjXji and p = d + 1.

(16)

Residuals in MLR

As in the SLR model, the residuals in multiple regression are purged of any relationship to the independent variables. We decompose Y into the part predicted by X and the part due to idiosyncratic error.

Y = ˆ

Y + e

(17)

Residual diagnostics for MLR

Consider the residuals from the sales data:

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 1.0 1.5 2.0 −0.03 0.00 0.02 fitted residuals ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 0.4 0.6 0.8 −0.03 0.00 0.02 P1 residuals ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 0.6 1.0 −0.03 0.00 0.02 P2 residuals

We use the same residual diagnostics (scatterplots, QQ, etc).

I Plot residuals (raw or student) against ˆY to see overall fit.

(18)

Another great plot for MLR problems is to look atY (true values) againstYˆ (fitted values).

> plot(salesMLR$fitted, Sales,

+ main= "Fitted vs True Response for Sales data",

+ pch=20, col=4, ylab="Y", xlab="Y.hat")

> abline(0, 1, lty=2, col=8)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0

Fitted vs True Response for Sales data

Y

I If things are working,

these values should form a nice straight line.

(19)

Transforming variables in MLR

The transformation techniques are also the same as in SLR.

I For Y nonlinear in Xj, consider adding Xj2 and other polynomial terms.

I For nonconstant variance, use log(Y ), √Y , or another transformation such as Y /Xj that moves the data to a linear scale.

I Use the log-log model for price/sales data and other multiplicative relationships.

Again, diagnosing the problem and finding a solution involves looking at lots of residual plots (against different Xj’s).

(20)

For example, the sales, P1, and P2 variables were pre-transformed from raw values to a log scale. On the original scale, things don’t look so good:

> expsalesMLR <- lm(exp(Sales) ~ exp(P1) + exp(P2))

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 −0.2 0.2 0.6 fitted residuals ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.2 1.6 2.0 2.4 −0.2 0.2 0.6 exp(P1) residuals ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 2.0 2.5 3.0 3.5 −0.2 0.2 0.6 exp(P2) residuals

(21)

In particular, the studentized residuals are heavily right skewed. (“studentizing” is the same, but leverage is now distance in d -dim.) > hist(rstudent(expsalesMLR), col=7,

+ xlab="Studentized Residuals", main="")

Studentized Residuals Frequency −2 −1 0 1 2 3 4 0 10 20 30 40 50 I Our log-log transform fixes this problem.

(22)

Inference for coefficients

As before in SLR, the LS linear coefficients are random (different for each sample) and correlated with each other. The LS estimators are unbiased:

E[bj] = βj for j = 0, . . . , d .

In particular, the sampling distribution for b is a multivariate normal, with mean β = [β0· · · βd]0 and covariance matrix Sb.

(23)

Coefficient covariance matrix

var(b) :the p × p covariance matrix for random vector b is

Sb=        var(b0) cov(b0, b1) cov(b1, b0) var(b1) . .. var(bd −1) cov(bd −1, bd) cov(bd, bd −1) var(bd)       

(24)

Calculating the covariance is “easy”: Sb = s2 ˆX0Xˆ −1 > X <- cbind(1, P1, P2) > cov.b <- summary(salesMLR)$sigma^2*solve(t(X)%*%X) > print(cov.b) P1 P2

5.542812e-05 -5.393036e-05 -3.451517e-05

P1 -5.393036e-05 8.807396e-05 9.718699e-06

P2 -3.451517e-05 9.718699e-06 4.127672e-05

> se.b <- sqrt(diag(cov.b)) > se.b

P1 P2

0.007445007 0.009384773 0.006424696

(25)

Standard errors

Conveniently, R’s summarygives you all the standard errors.

> summary(salesMLR) ## abbreviated output Coefficients:

(Intercept) 1.002688 0.007445 134.7 <2e-16 ***

P1 -1.005900 0.009385 -107.2 <2e-16 ***

P2 1.097872 0.006425 170.9 <2e-16 ***

---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.01453 on 97 degrees of freedom

Multiple R-squared: 0.998, Adjusted R-squared: 0.9979

(26)

Inference for individual coefficients

Intervals and t-statistics areexactly the same as in SLR.

I A (1 − α)100% C.I. for β_j isb_j ± t_α/2,n−ps_b_j.

I zbj = (bj − β

0

j)/sbj ∼ tn−p(0, 1) is number of standard

errors between the LS estimate and the null value. Intervals/testing via bj & sbj are one-at-a-time procedures:

I You are evaluating the jth coefficient conditional on the other X ’s being in the model, but regardless of the values you’ve estimated for the other b’s.

(27)

R

2

for multiple regression

Recall that we already view R2 _{as a measure of “fit”:}

R

2

=

SSR

SST

=

P

n i =1

( ˆ

Y

i

− ¯

Y )

2

P

n i =1

(Y

i

− ¯

Y )

2

.

And the correlation interpretation is very similar:

R

2

= corr

2

( ˆ

Y , Y ) = r

_{y y}_ˆ2

.

(28)

Consider our marketing model:

> summary(salesMLR)$r.square [1] 0.9979764

I P1 and P2 explain most of the variability in log volume. Consider the pickup regressions (1: make, 2: make + miles).

> summary(trucklm1)$r.square [1] 0.01806738

> summary(trucklm2)$r.square [1] 0.3643183

(29)

Forecasting in MLR

Prediction follows exactly the same methodology as in SLR. For new data xf = [X1,f · · · Xd ,f]0,

I _E[Y_f|x_f] = ˆYf = b0+ b1X1f + · · · + bdXdf

I var[Yf|xf] = var( ˆYf) + var(ef) = sfit2 + s 2 _{= s}2

pred.

With ˆX our design matrix (slide 9) and ˆxf = [1, X1,f · · · Xd ,f]0

s

_fit2

= s

2

· ˆ

x

0_f

( ˆ

X

0

X)

ˆ

−1

ˆ

x

f

(30)

The syntax in R is also exactly the same as before:

> predict(salesMLR, data.frame(P1=1, P2=1),

+ interval="prediction", level=0.95)

fit lwr upr

1 1.094661 1.064015 1.125306

And we can get sfit using our equation, or from R.

> xf <- matrix(c(1,1,1), ncol=1) > X <- cbind(1, P1, P2) > s <- summary(salesMLR)$sigma > sqrt(drop(t(xf)%*%solve(t(X)%*%X)%*%xf))*s [1] 0.005227347 > predict(salesMLR, data.frame(P1=1, P2=1), + se.fit=TRUE)$se.fit [1] 0.005227347

(31)

Glossary and equations

MLR updates to the LS equations:

I b = ( ˆX0X)ˆ −1Xˆ0Y I var(b) = S_b= s2 ˆX0Xˆ −1 I s_fit2 = s2· ˆx_f0( ˆX0X)ˆ −1ˆxf I R2 _{= SSR/SST = cor}2_{( ˆ}_{Y , Y ) = r}2 ˆ y y