IV. Prediction and Diagnostics
a.
Prediction
b.
Why Regression Diagnostics?
c.
Residuals Plots
a. Prediction
Model:
The
conditional forecasting problem
can be succinctly stated
as:
–
Predict a “future” observation, y
f–
Given X
fand the sample data {X
i, Y
i} i = 1, …, N
The only practical solution to the prediction problem is to use
estimated parameters:
Y
i=
β
0+
β
1X
i+
ε
ii
=
1,
…
,N
a. Prediction
If we use this predictor, we will
make a prediction error:
e
f
=
Y
f−
ˆY
f=
Y
f−
b
0−
b
1X
fLet’s draw this:
E[Yf|Xf ] = β0 + β1 X
b0 + b1 X
X
Y
fSampling error
a. Prediction
Let’s write our prediction error in such a way so that we can see the influence of two factors:
i. the model error term or the inherent randomness
ii. estimation error in the model parameters
Inherent Randomness
Y
f−
ˆY
f=
e
f=
Y
f−
E Y
⎡⎣
f| X
f⎤⎦ −
(
ˆY
f−
E Y
⎡⎣
f| X
f⎤⎦
)
a. Prediction
Now let’s compute a prediction interval for Y
fThe predictive
standard error
, denoted s
pred, is then
s
pred=
s 1
+
1
N
+
X
f−
X
(
)
2N
−
1
( )
s
X2⎛
⎝
⎜
⎜
⎞
⎠
⎟
⎟
.5Standard Error of the Regression
Var e
(
f=
Y
f−
ˆY
f)
=
Var
( )
ε
f+
Var ˆY
( )
f=
σ
2+
σ
21
N
+
X
f−
X
(
)
2N
−
1
( )
s
2X⎛
⎝
⎜
⎜
⎞
⎠
⎟
⎟
=
σ
2
1
+
1
N
+
X
f−
X
(
)
2N
−
1
( )
s
X2a. Prediction
Let’s return to the printout and fill-in the formula for the prediction
interval
b. Why Regression Diagnostics?
Up to now, we have assumed that the data are generated by a
linear regression model
What are the basic assumptions of the model?
1.
linear conditional mean
2.
constant variance (
homoskedasticity
),
3.
normal errors
So we should see:
–
a pattern of constant variation around a line
–
very few points more than 2 standard deviations away
b. Why Regression Diagnostics?
Why Should We Care
?
If the model assumptions are violated:
–
Prediction can be systematically biased
–
Standard errors and t-tests wrong
–
someone may be able to beat you with a different and better
model
How can we detect violations of the model?
–
We must use graphical methods
To drive this point home, let’s look at the “famous” Anscomb data
b. Why Regression Diagnostics?
Data Set 1
4 6 8 10 12 14
4
5
6
7
8
9
10
11
x1
b. Why Regression Diagnostics?
Data Set 2
4 6 8 10 12 14
3
4
5
6
7
8
9
x2
b. Why Regression Diagnostics?
Data Set 3
4 6 8 10 12 14
6
8
10
12
x3
b. Why Regression Diagnostics?
Data Set 4
8 10 12 14 16 18
6
8
10
12
c. Residual Diagnostic Plots
Two basic plots are very useful:
i.
Plot of
Residuals vs. Fitted Values
ii.
A
Normal Probability Plot
When Model Assumptions Hold
A First Cut: plot Y against X
(works only when you have one X)This data looks great!
Linear association with constant variance.
Normal?
c. Residual Diagnostic Plots
i. Plot of Residuals vs. Fitted Values
What should this look like?
1.
Residuals should be evenly distributed around the
mean
2.
No relationship between the mean of the residual and
c. Residual Diagnostic Plots
A key assumption is that the regression model is a linear function.
This is not always true.
This will show up even more prominently in the residuals vs. fitted plot…
c. Residual Diagnostic Plots
There should be no
relationship between the average value of the
residuals and fitted (X)
c. Residual Diagnostic Plots
A constant elasticity relationship implies a curved regression function.
Let’s look at some data generated from a log-log model of
demand. This is non-linear model with a constant elasticity.
3 4 5 6 7
100 200 300 400 price U ni t Sa le s
0 50 100 150 200 250 300
c. Residual Diagnostic Plots
ii. A Normal Probability Plots
Use to test normality of residuals. Non-normal residuals cause
the following sorts of headaches:
–
"t-tests" and other associated statistics may no longer be t
distributed
–
Least squares estimates are extremely sensitive to large
ε
iand it may be possible to improve on least squares
–
The linear functional form may be incorrect and various
c. Residual Diagnostic Plots
Remember that the salient characteristics of the normal
distribution are thin tails and symmetry.
How can we detect departures from normality?
The most basic analysis would be to graph the histogram of the standardized residuals
Neither of these plots look
particularly symmetric
c. Residual Diagnostic Plots
Let’s compute a norm probablity plot using the
normPlot()
function
.-2 -1 0 1 2
-1
0
1
2
N=30
Quantiles Under Normality
A-D p-value= 0.13
-2 -1 0 1 2
-2 -1 0 1 2 N=100
Quantiles Under Normality
c. Residual Diagnostic Plots
The
normal probability plot
is a plot of the sample CDF
on a coordinate system in which the normal CDF appears
as a straight line. The sample CDF will appear as a scatter
of points around the normal CDF straight line.
d. Putting It All Together- The Shock Absorber Example
Suppliers for very large manufacturing firms are facing increasing
pressure to assure their parts customers that the parts they produce meet high quality standards.
This supplier is supplying gas-filled shock absorbers.
The data are measurements on the rebound force of the shock absorber. Measurements can be taken both before and after the shock
absorber was fully assembled. It is cheaper to take measurements of the shock absorber performance before, rather than after, assembly.
See dataset shock.
Basic Model
We must formulate a statistical model to predict rebound force after assembly using the before assembly measurement.
Descriptive Statistics
These are statistics formed from the marginal distributions of
Marginal Distribution of Y
Doesn’t look normal! Three clumps!
Can we run a regression even though Y is not normal? YES!
Histogram of rebounda
rebounda F re qu en cy
500 520 540 560 580 600
0 1 2 3 4 5
-2 -1 0 1 2
500 520 540 560 580 600
Quantiles Under Normality
Joint or Bivariate Distribution
Let’s do a scatter plot. Which variable should be on the Y axis?
Note three clusters of points. This implies room for improvement in quality!
500 520 540 560 580 600 620
500
520
540
560
580
600
reboundb
re
bo
un
Regression Analysis
Residual Diagnostics
Residuals are much more
normal than marginal dist of Y
Residuals vs. Fitted looks
pretty good
-2 -1 0 1 2
-1 5 -5 0 5 10 15
Quantiles Under Normality A-D p-value= 0.35
500 520 540 560 580 600
T-tests
Suppose before measurements were “perfect” predictors. What would this mean?
One School of Thought:
All you need is very accurate predictions
T-tests continued
Let’s test a slight modification:
Since N is relatively small, let’s test at the 10 percent significance level.
Step 2: Compute t statistic
> t=(.94946-1)/.0438 > t
[1] -1.153881
Step 1: Compute t critical value
> qt(.05,df=33) [1] -1.692360
Step 3: Compute p-value
> pt(-1.153881,df=33)*2
Prediction
Glossary of Symbols
X
f- future value of X for forecasting
Y
f- value of Y to be forecasted
Important Equations
s
pred=
s 1
+
1
N
+
X
f−
X
(
)
2N
−
1
( )
s
X2⎛
⎝
⎜
⎜
⎞
⎠
⎟
⎟
.5definition of
prediction error
decomposition of
prediction error
predictive standard
error
Glossary of R Commands
•
normPlot(variable name)
: This is a customized