VIII. MR Topics: Multicolinearity, Leverage, Nonlinearity
a. MR Standard Errors
b. Multicolinearity
c. Standardized Residuals and Leverage
d. Outliers
e. Influence
f. Nonlinearity
g. Dummy Variables
h. Heteroskedasticity
a. MR Standard Errors
In Chapter V, we developed an intuition regarding the computation of standard errors in multiple regression.
We decided that the SLR formula could be modified by inserting a measure of the independent variation of the X variable.
Now let’s look at this formula for the general case:
a. MR Standard Errors
We estimate this quantity to form the standard error of bj as follows:
Where:
Is from:
se b
( )
j=
s
a. MR Standard Errors
Back to Country Returns data:
Let’s do this for the standard error of the coefficient on canada.
The regression equation is
usa = 0.0064 +.478canada + .0007 australia -.0181 germany + .213 france - . 0187 japan
S = 0.02378
Now we need to regress Canada on Australia, Germany, France, Japan
se b
( )
j = sSSE =
0.02378
b. Multicolinearity
In certain situations, the dependence among the X variables can be so strong that it may difficult to estimate the
regression coefficients.
In this situation, we say that the dataset exhibits
multicolinearity.
Key Concept:
b. Multicolinearity
What is true about multicolinearity:
i. colinear variables can have coefficients with large
standard errors
ii. colinear variables can have insignificant t’s but very
significant F’s
iii. getting a larger sample doesn’t necessarily help much
What is not true about multicolinearity :
i. multicolinearity is not a “disease”
ii. it is not a violation of the model assumptions
iii. Least squares and the least squares standard errors are
b. Multicolinearity
Example with simulated data:
Overall F is very
-2 -1 0 1
-2
-1
0
1
x1
b. Multicolinearity
Y is clearly related to each X when considered alone.
-2 -1 0 1
-6 -4 -2 0 2 4 x1 y
-2 -1 0 1
b. Multicolinearity
How Should Multicolinearity Be Detected?:
– Inter-correlations of X’s
– Regress Xj on other X’s and get very high R2
What Can Be Done?
– delete some X’s (give up on those partial effects)
– combine the highly correlated indep variables into indices
b. Multicolinearity
What Should Not Be Done:
- delete X variables just because they are highly inter-correlated!
c. Standardized Residuals and Leverage
As in the SLR, the basic diagnostic tools we use plots involving the least squares residuals.
Recall that the basic model assumptions are stated in terms of the true regression lines errors (ε) and we use the least squares residuals as proxies for these true errors.
To what extent are e and ε are similar and different?
c. Standardized Residuals and Leverage
Let’s look at some plots to help develop some intuition for this. (blue line is the “true” line)
Y
X
Y
X
c. Standardized Residuals and Leverage
Now for some plots where ε and e are not at all alike. (blue line is the “true” line)
Y Y
c. Standardized Residuals and Leverage
When the X value is unusually far away from the center of the X data, the least squares line will go out of its way to “chase the point”.
c. Standardized Residuals and Leverage
What about more than one regressor? Let’s look at a dangerous situations with 2 variables
X2 unusual X
value
We need a measure of
remoteness relative to the bulk of the X data.
c. Standardized Residuals and Leverage
The basic idea behind the measure used most commonly is to measure the distance from point of means.
The idea is that we measure distance from the point of means and adjust for the
extent of variability along the principle axes of variation in the X data.
X2
X1
c. Standardized Residuals and Leverage
The distance measure motivated by the above diagram can be computed for each observation.
We can’t write down any easy formula for the leverage except in the case of SLR:
This formula makes good sense since it is larger the larger
For SLR : h
i=
1
N
+
X
i−
X
(
)
2X
j−
X
(
)
2j=1
N
c. Standardized Residuals and Leverage
We can relate the variance of the least squares residuals to the hi.
Var e
( )
i=
σ
2( )
1
−
h
iObservations with large leverage (hi) will have smaller
residuals which is exactly the intuition we developed using the plots on the earlier slides.
We use the leverage measures to standardize the least squares residuals:
r
i=
e
is 1
( )
−
h
i≈
ε
iWhat is a large hi value?
It turns out that and
Thus, the average hi value is (k+1)/N
We can use the following useful rule:
i.e. flag observations with h values > 3 x average h c. Standardized Residuals and Leverage
c. Standardized Residuals and Leverage
What is true:
– observations with large hi values are remote in X space
– observations with large hi values are the potential to
influence the fitted line
What is not true:
– observations with large hi values have actually exerted
influence
d. Outlying Observations
I prefer the term “unusual” observations. How can we find unusual observations?
– Large hi values
– Large ri values
Observations with large ri values are often called “outliers”
We should look for an “assignable cause” for outliers
– error in data entry
d. Outlying Observations
Sometimes outliers are the most interesting observations.
If y is a performance measure, then observations with large positive ri are super-performers.
Residuals from the Market Model can be used for “event” studies: – changes in accounting standards
– regulatory changes
– opening of markets
– effect of tender offers on a firm
Don’t routinely delete observations without cause – this will bias the estimate of σ downward
– make you overly confident in your predictions and
d. Outlying Observations
d. Outlying Observations
In fact, there is only a couple of really large residuals. Here is the result of residualPlots.
Standardized Residuals F re qu en cy
-4 -2 0 2 4
0
20
40
60
-0.10 -0.05 0.00 0.05 0.10 0.15
-0 .0 6 -0 .0 2 0.02 0.06 Fitted R esi du al s
-2 -1 0 1 2
-0 .0 6 -0 .0 2 0.02 0.06 R esi du al s
Quantiles Under Normality
A-D p-value= 0
0.01 0.02 0.03 0.04 0.05 0.06
Standardized Residuals F re qu en cy
-4 -2 0 2 4
0
10
30
50
70
-0.10 -0.05 0.00 0.05 0.10 0.15
-0 .0 6 -0 .0 2 0.02 0.06 Fitted R esi du al s
-2 -1 0 1 2
-0 .0 6 -0 .0 2 0.02 0.06 R esi du al s
Quantiles Under Normality
A-D p-value= 0
0.01 0.03 0.05
-4 -3 -2 -1 0 1 2 3 Leverage St an da rd ize d R esi du al s
d. Outlying Observations
Zoom in on the bottom right:
Outside the yellow: large std res
Right of green: large leverage One large std res (around -4)
Histogram of h_i h_i F re qu en cy
0.01 0.02 0.03 0.04 0.05 0.06 0.07
0 20 40 60 80 100
d. Outlying Observations
There a number of observations flagged for “influence.” We know that means leverage. Let’s plot the hi values...
e. Influence Measures
An observation is influential if deleting it substantially changes the fitted line (least squares coefficients).
A natural measure of influence would be to: i. delete each observation one by one
e. Influence Measures
When will observations cause large changes in the fitted line when deleted.
Obviously, large r, or large h in and of themselves are not sufficient as shown in the plots below:
Y
X
Large hi
Y
X
Large ri
e. Influence Measures
What about the situation in which both hi and ri are large?
Y
X
e. Influence Measures
Cook’s distance combines both r and h as follows:
Most use a rule that if D > 1, then we should examine further.
BUT It is best to:
i. plot Cook’s distance values for every observation in the
dataset
ii. look for “outlying” values of D.
e. Influence Measures
Example: Rats Liver Data – dataset(rat)
Goal of Study: predict amount of drug absorbed by Rat liver given observable (without the scalpel) variables:
Y = percent dose in liver
5 10 15 0.0 0.2 0.4 0.6 0.8 Index co oks. di st an ce (lm(Y ~ x1 + x2 + x3 , d at a = ra t))
e. Influence Measures
Now let’s look at the Cook’s distance numbers.
e. Influence Measures
f. Nonlinearity
We have emphasized that the standard regression model is a linear conditional mean model.
We can check this assumption by:
a. plot residuals vs. fitted (overall non-linearity)
b. plot residuals vs. specific X vars (is non-linearity related to specific variables?)
Why should I care?
f. Nonlinearity
In many situations in practice, it is desirable to have some flexibility to specify non-linear regression functions.
The standard linear regression model can be "tricked" into displaying non-linearity by two techniques:
i. Adding additional independent variables
f. Nonlinearity
i. Adding additional independent variables
Trick 1:
Here we add the additional variable X2 to the regression.
This allows the regression function to bend in a parabolic or quadratic curve
Remember that:
– If β2 is negative, it "sheds water";
f. Nonlinearity
i. Adding additional independent variables cont.
With a multiple regression, we might consider adding the squares of two or more variables.
f. Nonlinearity
Here we add the interaction term which is a new variable that consists of the product of X1 and X2.
The effect of the interaction term is to make the derivative of the regression with respect to X(1 or 2) depend on the level of the other variable.
Note here: the first term is what you get from the standard linear model.
Trick 2:
f. Nonlinearity
What does this look like?
Effect of X1 higher at high X2
f. Nonlinearity
We can put both ideas together and include squares and interaction terms for all variables.
What should you be careful about?
too many coefficients (over-fitting)
nonlinear terms are dangerous in out-of-sample forecasting.
possible colinearity problems
Y
=
β
0+
β
1X
1+
β
2X
12+
β
3X
2+
β
4X
22f. Nonlinearity
ii. Take transformations of the dependent variable
The log transformation is the most common single transformation of the dependent variable.
Statisticians perform this by:
i. taking the log of the dependent variable
ii. then entering the log of the dependent variable in the
f. Nonlinearity
To see the effect of the log transform of Y. Let’s first start with a model that is “compatible” with the log transform:
E Y | X⎡⎣ ⎤⎦ = AXβ1
Basic multiplicative model given by:
0.5 1.0 1.5 2.0 2.5
1
2
3
0.5 1.0 1.5 2.0 2.5
1 2 3 4 5 6 β
1= 2
β1= 1
β1= 0.5
Negative β1 Positive β1
β1 = -2
β1 = -1
f. Nonlinearity
If we take the logs of both sides of the multiplicative model, we get a model that is linear in the logs.
log E Y| X
(
⎡⎣
⎤⎦
)
=
log(AX
β1)
=
log(A)
+
β
1log(X)
This suggests a simple linear regression model which is linear in the logs:
Trick 3:
log(Y)
=
β
f. Nonlinearity
One important thing to note is that β1 now has the interpretation of an elasticity.
The elasticity is defined as the percentage change in Y for a given percentage change in X.
Elasticity:
Some also consider a semi-log model in which only Y is transformed by the log.
f. Nonlinearity
Example: Mammals data
The dataset “Mammals” contains data on the body and brain weight of 62 different species of mammals.
Is there a relationship between body weight and brain weight?
f. Nonlinearity
Let’s log both Y and X and “break-up” clustering:
This looks much better!
Pretty Linear
If we model log(brainwgt) as linear in log(bodywgt), we are assuming a multiplicative model of the form:
-4 -2 0 2 4 6 8
f. Nonlinearity
g. Dummy Variables
Many variables or factors in the world are fundamentally discrete. Either you are male or female, have MBA or not,…
To accommodate these sorts of qualitative factors in regression models:
– use dummy, binary or indicator variables to measure
these factors
A dummy variable is a variable that takes on the values of only 0 or 1. Examples:
– Qualitative effects: (1 if on promotion, 0 if not)
– Temporal effects : (1 if Holiday season, 0 if not)
(1 if Monday, 0 if not) – Spatial (1 if in West Coast, 0 if not)
g. Dummy Variables
What does a dummy variable do if introduced into regression equation?
Dummy Variables allow the mean to shift. Let’s look at example from the detergent data. “promoflag” is a dummy variable which flags those weeks where there was a promotion on the item.
g. Dummy Variables
How do we interpret the coefficient on the promoflag variable?
It’s the expected change in the log of quantity for a one unit
change in promoflag (as always), holding all other indep variables constant!
What is change in log – it’s percentage change.
g. Dummy Variables
More Than Two Levels
We can use dummy variables in situations in which there are
more than two categories. Dummy variables are needed for each category (except a designated “base” category).
Example: scores in different sections of the same class by instructor Yi = test score
“Instructor” factor takes on three levels: 1 = prof A
2 = prof B
3 = prof C
We introduce two dummy variables: X1 = 1 if B; 0 otherwise
g. Dummy Variables
The data would look like this
Y X1 X2
John 98 1 0
Nancy 87 1 0
Lester 81 0 1
Tom 92 0 1
Lisa 76 0 0
Sue 98 0 0
What sort of regression would we run?
g. Dummy Variables
How do we interpret the β’s?
E Y | X1⎡⎣ = 1⎤⎦ = β0 + β1
E Y | X2
⎡⎣
=
1
⎤⎦
=
β
0+
β
2Where is A? In the intercept:
E Y | X1⎡⎣ = X2= 0⎤⎦ =β0
Thus, we should interpret the betas as measure differences against a base case which is in the intercept.
β1 = A-B
g. Dummy Variables
We can generalize this to q different groups simply by putting in q-1 dummy variables for each of q-1 of the groups and “nominating” one group to be the intercept.
This always gives the regression coefficients an interpretation as the difference in means.
Let’s look at an example of this.
Example: Stock Returns and the Weekend Effect
K. R. French (Journal of Financial Economics [1980])
g. Dummy Variables
Overall F is 25.4 compare with F4,6019 distribution – reject.
g. Categorical Variables and R
How does R handle variables that are purely categorical. The values of the variable indicate which category the
observation is from and have no ordinal meaning.
Example: store variable in detergent. This is just an
g. Categorical Variables and R
Put in dummy variables for each store. Do you have to sit there and create dummies? No, R knows what a factor is and creates them automatically.
g. Dummy Variables
Dummy Variables and Handling Nonlinearity
Consider the example of modeling the effect of years of education on Salaries. What’s wrong with just putting in Years of Ed in a regression?
g. Dummy Variables
The true regression or mean function might look like this:
Education Salary
12 16 18
High School
College
MBA
We could approximate this regression function fairly well by
h. Heteroskedasticity
As we have seen heteroskedasticity is a situation in which the error terms do not all have the same variance.
This is usually detected with the residuals vs. fitted plot.
1. Effects of Heteroskedasticity
On Standard Errors:
Standard error formulas are derived under the assumption of
0 50 100 150 200 250
10000
20000
30000
40000
time_trend
SAL
ES
h. Heteroskedasticity
Large Var(ε) implies Prediction Interval should
h. Heteroskedasticity
2.
Correcting for Heteroskedasticity
One popular approach is known as variance stabilizing
transformations.
Taking logs of the dependent variable is one type of
transformation often used with business oriented data to remove heteroskedasticity.
– use this when the variance increases with the level of
the dependent variable
h. Heteroskedasticity
2.
Correcting for Heteroskedasticity
Let’s take the log and re-plot
0 50 100 150 200 250
8.0
8.5
9.0
9.5
10.0
10.5
time_trend
lo
g(SAL
h. Heteroskedasticity
2.
Correcting for Heteroskedasticity
Another popular approach is to correct the standard errors formulae to take into account heteroskedasticity. This will note adjust the prediction intervals for heteroskedasticity.
This is implemented in many econometric regression packages (the most popular being Stata). I have
h. Heteroskedasticity
i. More on Log-Transforms
1. The Many Roles for the Log Transform
We have seen that there are three basic reasons for applying the log transform to the dependent variable:
i. To accommodate nonlinearity in the regression
relationship
ii. To reduce right skewness in the error distribution
iii. To eliminate heteroskedasticity of the form in which the
i. More on Log-Transforms
2. Prediction from log-transformed models
One drawback of using a log transform on Y is that the fitted and predicted values will be in the units of logs. We fit and predict log(Y) not Y!
i. More on Log-Transforms
Solution to prediction problem:
Point Predictor:
Prediction Interval:
Here Pl , Pu are the lower and upper limits from the log transformed model.
ˆY
f=
exp(b
0+
b
1X
f)
exp P
( )
l, exp P
( )
ui. More on Log-Transforms
ˆY
f=
exp(b
0+
b
1X
f)
Note: here we are
i. More on Log-Transforms
3. Log Transforms of Independent Variables
Do you have to take the log of the X variables as well as the dependent variable?
NO! In particular, it doesn't ever make sense to take the log of dummy variables no matter what happens to Y! It is perfectly acceptable to enter the log of some X variables and leave others in raw form.
i. More on Log-Transforms
4. Interpretation of coefficients in semi-log models
Consider the semi-log model:
How do we interpret the slope coefficient on X? This is the expected change in log Y for a one unit change in X. This means that the coefficient is:
j. Putting It All Together
A Strategy for Building Regression Models:
– Select Universe of Possible X Variables
– Put Them All In!
– Perform Residual Diagnostics (norm plot, resids vs Yhat, resids
vs specific Xs)
– Take Remedies As Necessary (for hetero/nonlinearity)
– Reduce Set of X variables via partial F-testing
k.“Newfood” Example of Experimental Design
A new instant breakfast product was the subject of a 6 month market test study.
The goal of this test was to determine an optimal
introductory marketing plan for use when the product is introduced on a national scale and to obtain
improved sales forecasts.
This dataset is the outcome of a controlled
experiment in which the values of the independent
variables which affect sales are chosen by the
k. Design of the Experiment
A controlled store test was designed involving stores in four different cities.
Variables to be considered:
1. Price
2. Advertising
3. Shelf-location
k. Design of the Experiment- Levels of the Variables
price levels: 24, 29, and 34 cents per oz (why three?)
Advertising levels: high and low (why two?)
All advertising was in the form of television commercials. The high level was used in two of the four cities and was meant to simulate the effect of a $6 million nationwide advertising campaign. The low level was used in the other two cities and was designed to simulate the effect of a $3 million nationwide advertising campaign.
k. Design of the Experiment- Design Matrix
The next problem confronting the designer is to choose particular combinations of prices,
advertising level, and shelf location to use in the study.
To simplify this discussion, let's consider a design involving only two prices and two levels of
advertising.
Adv\Price LOW HIGH
LOW X
HIGH X
Adv\Price LOW HIGH
LOW X X
HIGH X
Adv\Price LOW HIGH
LOW X X
k. Design of the Experiment- Solution
The solution adopted by the designers of this study is to field all possible combinations of price, adv, and location = 3 x 2 x 2 = 12 experimental cells. Such a design is called an orthogonal design since
the correlations between each variable are zero. To see this use dummy variable codings for both
price and adv in above designs.
k. Analysis of Data
The dataset is “newfood” in the class datasets area.
k. First Cut Regression
Is the price coef large?
What happened to the Adv
Effect?
Do we care about s in
k. Residual Diagnostics
Why are there only 3 fitted values? Is there
Right-skewed Errors 0 0 50 100 150
140 160 180 200 220 240 260
k. Why is the Advertising Effect so Small?
The stores in the cities with the “high” advertising treatment where lower total volume (All Commodity Volume) and with lower income in store trading area! Advertising pure effect would be to increase sales but this is confounded with an income or store volume
effect which depresses sales!
Correlations:
sales price adv locat income price -0.658
adv 0.001 0.000
locat -0.001 0.000 -0.000
income 0.163 -0.131 -0.746 -0.000
k. Log-Linear Regression
k. Log-Linear Regression- Residual Diagnostics
Fini
Glossary of Symbols
Important Equations
se b
( )
j=
s
SSE
j | othersr
i=
e
is 1
( )
−
h
i≈
ε
iσ
~ N 0,1
( )
For SLR : h
i=
1
N
+
X
i−
X
(
)
2X
j−
X
(
)
2j=1 N
∑
std error of multiple regression coef
leverage value for Simple LR Model
Important Equations
Cook’s Distance (D for
“Distance”)
Glossary of R Commands
• hatvalues(model): Return the hi values of a model • cooks.distances(model): Returns Cooks Distance
for model.
• Name[-2,]: the second row is deleted from a matrix or
data frame. !
• lmSumm(lmfit,HAC=TRUE): compute