ChVIIIAdvancedMRTopics.pdf

(1)

VIII. MR Topics: Multicolinearity, Leverage, Nonlinearity

a.  MR Standard Errors

b.  Multicolinearity

c.  Standardized Residuals and Leverage

d.  Outliers

e.  Influence

f.  Nonlinearity

g.  Dummy Variables

h.  Heteroskedasticity

(2)

a. MR Standard Errors

In Chapter V, we developed an intuition regarding the computation of standard errors in multiple regression.

We decided that the SLR formula could be modified by inserting a measure of the independent variation of the X variable.

Now let’s look at this formula for the general case:

(3)

We estimate this quantity to form the standard error of b_j as follows:

Where:

Is from:

se b

( )

_j

=

s

(4)

Back to Country Returns data:

Let’s do this for the standard error of the coefficient on canada.

The regression equation is

usa = 0.0064 +.478canada + .0007 australia -.0181 germany + .213 france - . 0187 japan

S = 0.02378

Now we need to regress Canada on Australia, Germany, France, Japan

se b

( )

_j = s

SSE =

0.02378

(5)

b. Multicolinearity

In certain situations, the dependence among the X variables can be so strong that it may difficult to estimate the

regression coefficients.

In this situation, we say that the dataset exhibits

multicolinearity.

Key Concept:

(6)

What is true about multicolinearity:

i.  colinear variables can have coefficients with large

standard errors

ii.  colinear variables can have insignificant t’s but very

significant F’s

iii.  getting a larger sample doesn’t necessarily help much

What is not true about multicolinearity :

i.  multicolinearity is not a “disease”

ii.  it is not a violation of the model assumptions

iii.  Least squares and the least squares standard errors are

(7)

Example with simulated data:

Overall F is very

-2 -1 0 1

-2

-1

0

1

x1

(8)

Y is clearly related to each X when considered alone.

-2 -1 0 1

-6 -4 -2 0 2 4 x1 y

-2 -1 0 1

(9)

How Should Multicolinearity Be Detected?:

–  Inter-correlations of X’s

–  Regress X_j on other X’s and get very high R2

What Can Be Done?

–  delete some X’s (give up on those partial effects)

–  combine the highly correlated indep variables into indices

(10)

What Should Not Be Done:

- delete X variables just because they are highly inter-correlated!

(11)

c. Standardized Residuals and Leverage

As in the SLR, the basic diagnostic tools we use plots involving the least squares residuals.

Recall that the basic model assumptions are stated in terms of the true regression lines errors (ε) and we use the least squares residuals as proxies for these true errors.

To what extent are e and ε are similar and different?

(12)

Let’s look at some plots to help develop some intuition for this. (blue line is the “true” line)

Y

X

Y

X

(13)

Now for some plots where ε and e are not at all alike. (blue line is the “true” line)

Y Y

(14)

When the X value is unusually far away from the center of the X data, the least squares line will go out of its way to “chase the point”.

(15)

What about more than one regressor? Let’s look at a dangerous situations with 2 variables

X2 unusual X

value

We need a measure of

remoteness relative to the bulk of the X data.

(16)

The basic idea behind the measure used most commonly is to measure the distance from point of means.

The idea is that we measure distance from the point of means and adjust for the

extent of variability along the principle axes of variation in the X data.

X2

X1

(17)

The distance measure motivated by the above diagram can be computed for each observation.

We can’t write down any easy formula for the leverage except in the case of SLR:

This formula makes good sense since it is larger the larger

For SLR : h

_i

=

1

N

+

X

_i

−

X

(

)

2

X

_j

−

X

(

)

2

j₌1

N

(18)

We can relate the variance of the least squares residuals to the h_i.

Var e

( )

_i

=

σ

2

( )

1

−

h

_i

Observations with large leverage (h_i) will have smaller

residuals which is exactly the intuition we developed using the plots on the earlier slides.

We use the leverage measures to standardize the least squares residuals:

r

_i

=

e

i

s 1

( )

−

h

_i

≈

ε

_i

(19)

What is a large hi value?

It turns out that and

Thus, the average h_i value is (k+1)/N

We can use the following useful rule:

i.e. flag observations with h values > 3 x average h c. Standardized Residuals and Leverage

(20)

What is true:

–  observations with large h_i values are remote in X space

–  observations with large h_i values are the potential to

influence the fitted line

What is not true:

–  observations with large h_i values have actually exerted

influence

(21)

d. Outlying Observations

I prefer the term “unusual” observations. How can we find unusual observations?

–  Large h_i values

–  Large r_i values

Observations with large r_i values are often called “outliers”

We should look for an “assignable cause” for outliers

–  error in data entry

(22)

Sometimes outliers are the most interesting observations.

If y is a performance measure, then observations with large positive r_i are super-performers.

Residuals from the Market Model can be used for “event” studies: –  changes in accounting standards

–  regulatory changes

–  opening of markets

–  effect of tender offers on a firm

Don’t routinely delete observations without cause –  this will bias the estimate of σ downward

–  make you overly confident in your predictions and

(23)

(24)

In fact, there is only a couple of really large residuals. Here is the result of residualPlots.

Standardized Residuals F re qu en cy

-4 -2 0 2 4

0

20

40

60

-0.10 -0.05 0.00 0.05 0.10 0.15

-0 .0 6 -0 .0 2 0.02 0.06 Fitted R esi du al s

-2 -1 0 1 2

-0 .0 6 -0 .0 2 0.02 0.06 R esi du al s

Quantiles Under Normality

A-D p-value= 0

0.01 0.02 0.03 0.04 0.05 0.06

(25)

Standardized Residuals F re qu en cy

-4 -2 0 2 4

0

10

30

50

70

-0.10 -0.05 0.00 0.05 0.10 0.15

-0 .0 6 -0 .0 2 0.02 0.06 Fitted R esi du al s

-2 -1 0 1 2

-0 .0 6 -0 .0 2 0.02 0.06 R esi du al s

Quantiles Under Normality

A-D p-value= 0

0.01 0.03 0.05

-4 -3 -2 -1 0 1 2 3 Leverage St an da rd ize d R esi du al s

Zoom in on the bottom right:

Outside the yellow: large std res

Right of green: large leverage One large std res (around -4)

(26)

Histogram of h_i h_i F re qu en cy

0.01 0.02 0.03 0.04 0.05 0.06 0.07

0 20 40 60 80 100

There a number of observations flagged for “influence.” We know that means leverage. Let’s plot the h_i values...

(27)

e. Influence Measures

An observation is influential if deleting it substantially changes the fitted line (least squares coefficients).

A natural measure of influence would be to: i.  delete each observation one by one

(28)

When will observations cause large changes in the fitted line when deleted.

Obviously, large r, or large h in and of themselves are not sufficient as shown in the plots below:

Y

X

Large hi

Y

X

Large r_i

(29)

What about the situation in which both h_i and r_i are large?

Y

X

(30)

Cook’s distance combines both r and h as follows:

Most use a rule that if D > 1, then we should examine further.

BUT It is best to:

i.  plot Cook’s distance values for every observation in the

dataset

ii.  look for “outlying” values of D.

(31)

Example: Rats Liver Data – dataset(rat)

Goal of Study: predict amount of drug absorbed by Rat liver given observable (without the scalpel) variables:

Y = percent dose in liver

(32)

5 10 15 0.0 0.2 0.4 0.6 0.8 Index co oks. di st an ce (lm(Y ~ x1 + x2 + x3 , d at a = ra t))

Now let’s look at the Cook’s distance numbers.

(33)

(34)

f. Nonlinearity

We have emphasized that the standard regression model is a linear conditional mean model.

We can check this assumption by:

a. plot residuals vs. fitted (overall non-linearity)

b. plot residuals vs. specific X vars (is non-linearity related to specific variables?)

Why should I care?

(35)

f. Nonlinearity

In many situations in practice, it is desirable to have some flexibility to specify non-linear regression functions.

The standard linear regression model can be "tricked" into displaying non-linearity by two techniques:

i.  Adding additional independent variables

(36)

f. Nonlinearity

i.  Adding additional independent variables

Trick 1:

Here we add the additional variable X2_{to the regression.}

This allows the regression function to bend in a parabolic or quadratic curve

Remember that:

–  If β₂ is negative, it "sheds water";

(37)

f. Nonlinearity

i.  Adding additional independent variables cont.

With a multiple regression, we might consider adding the squares of two or more variables.

(38)

f. Nonlinearity

Here we add the interaction term which is a new variable that consists of the product of X₁ and X₂.

The effect of the interaction term is to make the derivative of the regression with respect to X(1 or 2) depend on the level of the other variable.

Note here: the first term is what you get from the standard linear model.

Trick 2:

(39)

f. Nonlinearity

What does this look like?

Effect of X1 higher at high X2

(40)

f. Nonlinearity

We can put both ideas together and include squares and interaction terms for all variables.

What should you be careful about?

too many coefficients (over-fitting)

nonlinear terms are dangerous in out-of-sample forecasting.

possible colinearity problems

Y

=

β

₀

+

β

₁

X

₁

+

β

₂

X

₁2

+

β

₃

X

₂

+

β

₄

X

₂2

(41)

f. Nonlinearity

ii.  Take transformations of the dependent variable

The log transformation is the most common single transformation of the dependent variable.

Statisticians perform this by:

i.  taking the log of the dependent variable

ii.  then entering the log of the dependent variable in the

(42)

f. Nonlinearity

To see the effect of the log transform of Y. Let’s first start with a model that is “compatible” with the log transform:

E Y | X⎡⎣ ⎤⎦ = AXβ1

Basic multiplicative model given by:

0.5 1.0 1.5 2.0 2.5

1

2

3

0.5 1.0 1.5 2.0 2.5

1 2 3 4 5 6 β

1= 2

β₁= 1

β₁= 0.5

Negative β1 Positive β1

β₁ = -2

β₁ = -1

(43)

f. Nonlinearity

If we take the logs of both sides of the multiplicative model, we get a model that is linear in the logs.

log E Y| X

(

⎡⎣

⎤⎦

)

=

log(AX

β1

)

=

log(A)

+

β

₁

log(X)

This suggests a simple linear regression model which is linear in the logs:

Trick 3:

log(Y)

=

β

(44)

f. Nonlinearity

One important thing to note is that β₁ now has the interpretation of an elasticity.

The elasticity is defined as the percentage change in Y for a given percentage change in X.

Elasticity:

Some also consider a semi-log model in which only Y is transformed by the log.

(45)

f. Nonlinearity

Example: Mammals data

The dataset “Mammals” contains data on the body and brain weight of 62 different species of mammals.

Is there a relationship between body weight and brain weight?

(46)

f. Nonlinearity

Let’s log both Y and X and “break-up” clustering:

This looks much better!

Pretty Linear

If we model log(brainwgt) as linear in log(bodywgt), we are assuming a multiplicative model of the form:

-4 -2 0 2 4 6 8

(47)

f. Nonlinearity

(48)

g. Dummy Variables

Many variables or factors in the world are fundamentally discrete. Either you are male or female, have MBA or not,…

To accommodate these sorts of qualitative factors in regression models:

–  use dummy, binary or indicator variables to measure

these factors

A dummy variable is a variable that takes on the values of only 0 or 1. Examples:

–  Qualitative effects: (1 if on promotion, 0 if not)

–  Temporal effects : (1 if Holiday season, 0 if not)

(1 if Monday, 0 if not) –  Spatial (1 if in West Coast, 0 if not)

(49)

What does a dummy variable do if introduced into regression equation?

Dummy Variables allow the mean to shift. Let’s look at example from the detergent data. “promoflag” is a dummy variable which flags those weeks where there was a promotion on the item.

(50)

(51)

How do we interpret the coefficient on the promoflag variable?

It’s the expected change in the log of quantity for a one unit

change in promoflag (as always), holding all other indep variables constant!

What is change in log – it’s percentage change.

(52)

More Than Two Levels

We can use dummy variables in situations in which there are

more than two categories. Dummy variables are needed for each category (except a designated “base” category).

Example: scores in different sections of the same class by instructor Y_i = test score

“Instructor” factor takes on three levels: 1  = prof A

2  = prof B

3  = prof C

We introduce two dummy variables: X1 = 1 if B; 0 otherwise

(53)

The data would look like this

Y X1 X2

John 98 1 0

Nancy 87 1 0

Lester 81 0 1

Tom 92 0 1

Lisa 76 0 0

Sue 98 0 0

What sort of regression would we run?

(54)

How do we interpret the β’s?

E Y | X1⎡⎣ = 1⎤⎦ = β₀ + β₁

E Y | X2

⎡⎣

=

1

⎤⎦

=

β

₀

+

β

₂

Where is A? In the intercept:

E Y | X1⎡⎣ = X2= 0⎤⎦ =β₀

Thus, we should interpret the betas as measure differences against a base case which is in the intercept.

β₁ = A-B

(55)

We can generalize this to q different groups simply by putting in q-1 dummy variables for each of q-1 of the groups and “nominating” one group to be the intercept.

This always gives the regression coefficients an interpretation as the difference in means.

Let’s look at an example of this.

Example: Stock Returns and the Weekend Effect

K. R. French (Journal of Financial Economics [1980])

(56)

Overall F is 25.4 compare with F_4,6019 distribution – reject.

(57)

g. Categorical Variables and R

How does R handle variables that are purely categorical. The values of the variable indicate which category the

observation is from and have no ordinal meaning.

Example: store variable in detergent. This is just an

(58)

g. Categorical Variables and R

Put in dummy variables for each store. Do you have to sit there and create dummies? No, R knows what a factor is and creates them automatically.

(59)

Dummy Variables and Handling Nonlinearity

Consider the example of modeling the effect of years of education on Salaries. What’s wrong with just putting in Years of Ed in a regression?

(60)

The true regression or mean function might look like this:

Education Salary

12 16 18

High School

College

MBA

We could approximate this regression function fairly well by

(61)

h. Heteroskedasticity

As we have seen heteroskedasticity is a situation in which the error terms do not all have the same variance.

This is usually detected with the residuals vs. fitted plot.

1. Effects of Heteroskedasticity

On Standard Errors:

Standard error formulas are derived under the assumption of

(62)

0 50 100 150 200 250

10000

20000

30000

40000

time_trend

SAL

ES

Large Var(ε) implies Prediction Interval should

(63)

2.

Correcting for Heteroskedasticity

One popular approach is known as variance stabilizing

transformations.

Taking logs of the dependent variable is one type of

transformation often used with business oriented data to remove heteroskedasticity.

–  use this when the variance increases with the level of

the dependent variable

(64)

2.

Let’s take the log and re-plot

0 50 100 150 200 250

8.0

8.5

9.0

9.5

10.0

10.5

time_trend

lo

g(SAL

(65)

2.

Another popular approach is to correct the standard errors formulae to take into account heteroskedasticity. This will note adjust the prediction intervals for heteroskedasticity.

This is implemented in many econometric regression packages (the most popular being Stata). I have

(66)

(67)

i. More on Log-Transforms

1.  The Many Roles for the Log Transform

We have seen that there are three basic reasons for applying the log transform to the dependent variable:

i.  To accommodate nonlinearity in the regression

relationship

ii.  To reduce right skewness in the error distribution

iii.  To eliminate heteroskedasticity of the form in which the

(68)

2.  Prediction from log-transformed models

One drawback of using a log transform on Y is that the fitted and predicted values will be in the units of logs. We fit and predict log(Y) not Y!

(69)

Solution to prediction problem:

Point Predictor:

Prediction Interval:

Here P_l , P_u are the lower and upper limits from the log transformed model.

ˆY

_f

=

exp(b

₀

+

b

₁

X

_f

)

exp P

( )

_l

, exp P

( )

_u

(70)

ˆY

_f

=

exp(b

₀

+

b

₁

X

_f

)

Note: here we are

(71)

3.  Log Transforms of Independent Variables

Do you have to take the log of the X variables as well as the dependent variable?

NO! In particular, it doesn't ever make sense to take the log of dummy variables no matter what happens to Y! It is perfectly acceptable to enter the log of some X variables and leave others in raw form.

(72)

4.  Interpretation of coefficients in semi-log models

Consider the semi-log model:

How do we interpret the slope coefficient on X? This is the expected change in log Y for a one unit change in X. This means that the coefficient is:

(73)

j. Putting It All Together

A Strategy for Building Regression Models:

–  Select Universe of Possible X Variables

–  Put Them All In!

–  Perform Residual Diagnostics (norm plot, resids vs Yhat, resids

vs specific Xs)

–  Take Remedies As Necessary (for hetero/nonlinearity)

–  Reduce Set of X variables via partial F-testing

(74)

k.“Newfood” Example of Experimental Design

A new instant breakfast product was the subject of a 6 month market test study.

The goal of this test was to determine an optimal

introductory marketing plan for use when the product is introduced on a national scale and to obtain

improved sales forecasts.

This dataset is the outcome of a controlled

experiment in which the values of the independent

variables which affect sales are chosen by the

(75)

k. Design of the Experiment

A controlled store test was designed involving stores in four different cities.

Variables to be considered:

1.  Price

2.  Advertising

3.  Shelf-location

(76)

k. Design of the Experiment- Levels of the Variables

price levels: 24, 29, and 34 cents per oz (why three?)

Advertising levels: high and low (why two?)

All advertising was in the form of television commercials. The high level was used in two of the four cities and was meant to simulate the effect of a $6 million nationwide advertising campaign. The low level was used in the other two cities and was designed to simulate the effect of a $3 million nationwide advertising campaign.

(77)

k. Design of the Experiment- Design Matrix

The next problem confronting the designer is to choose particular combinations of prices,

advertising level, and shelf location to use in the study.

To simplify this discussion, let's consider a design involving only two prices and two levels of

advertising.

Adv\Price LOW HIGH

LOW X

HIGH X

Adv\Price LOW HIGH

LOW X X

HIGH X

Adv\Price LOW HIGH

LOW X X

(78)

k. Design of the Experiment- Solution

The solution adopted by the designers of this study is to field all possible combinations of price, adv, and location = 3 x 2 x 2 = 12 experimental cells. Such a design is called an orthogonal design since

the correlations between each variable are zero. To see this use dummy variable codings for both

price and adv in above designs.

(79)

k. Analysis of Data

The dataset is “newfood” in the class datasets area.

(80)

k. First Cut Regression

Is the price coef large?

What happened to the Adv

Effect?

Do we care about s in

(81)

k. Residual Diagnostics

Why are there only 3 fitted values? Is there

Right-skewed Errors 0 0 50 100 150

140 160 180 200 220 240 260

(82)

k. Why is the Advertising Effect so Small?

The stores in the cities with the “high” advertising treatment where lower total volume (All Commodity Volume) and with lower income in store trading area! Advertising pure effect would be to increase sales but this is confounded with an income or store volume

effect which depresses sales!

Correlations:

sales price adv locat income price -0.658

adv 0.001 0.000

locat -0.001 0.000 -0.000

income 0.163 -0.131 -0.746 -0.000

(83)

(84)

k. Log-Linear Regression

(85)

k. Log-Linear Regression- Residual Diagnostics

(86)

Fini

(87)

Glossary of Symbols

(88)

Important Equations

se b

( )

_j

=

s

SSE

_{j | others}

r

_i

=

e

i

s 1

( )

−

h

_i

≈

ε

_i

σ

~ N 0,1

( )

For SLR : h

_i

=

1

N

+

X

_i

−

X

(

)

2

X

_j

−

X

(

)

2

j=1 N

∑

std error of multiple regression coef

leverage value for Simple LR Model

(89)

Important Equations

Cook’s Distance (D for

“Distance”)

(90)

Glossary of R Commands

•  hatvalues(model): Return the h_i values of a model •  cooks.distances(model): Returns Cooks Distance

for model.

•  Name[-2,]: the second row is deleted from a matrix or

data frame. !

•  lmSumm(lmfit,HAC=TRUE): compute

ChVIIIAdvancedMRTopics.pdf

Now we need to regress Canada on Australia, Germany, France, Japan

What is not true about multicolinearity :

How Should Multicolinearity Be Detected?:

value

For SLR : h

What is a large hi value?

What is not true:

Large hi

i. Adding additional independent variables

Effect of X1 higher at high X2

ii. Take transformations of the dependent variable

Negative β1 Positive β1

log E Y| X

Elasticity:

More Than Two Levels

1. Effects of Heteroskedasticity

Point Predictor:

A Strategy for Building Regression Models:

Adv\Price LOW HIGH

Correlations:

For SLR : h

i.  Adding additional independent variables

ii.  Take transformations of the dependent variable

1. Effects of Heteroskedasticity