ChIVPredictionandDiagnostics.pdf

(1)

IV. Prediction and Diagnostics

a. 

Prediction

b. 

Why Regression Diagnostics?

c. 

Residuals Plots

(2)

a. Prediction

Model:

The

conditional forecasting problem

can be succinctly stated

as:

–

Predict a “future” observation, y

_f

–

Given X

_f

and the sample data {X

_i

, Y

_i

} i = 1, …, N

The only practical solution to the prediction problem is to use

estimated parameters:

Y

_i

=

β

₀

+

β

₁

X

_i

+

ε

_i

i

=

1,

…

,N

(3)

a. Prediction

If we use this predictor, we will

make a prediction error:

e

f

=

Y

f

−

ˆY

f

=

Y

f

−

b

0

−

b

1

X

f

Let’s draw this:

E[Y_f|X_f] = β₀ + β₁ X

b₀ + b₁ X

X

Y

_f

Sampling error

(4)

a. Prediction

Let’s write our prediction error in such a way so that we can see the influence of two factors:

i.  the model error term or the inherent randomness

ii.  estimation error in the model parameters

Inherent Randomness

Y

_f

−

ˆY

_f

=

e

_f

=

Y

_f

−

E Y

⎡⎣

_f

| X

_f

⎤⎦ −

(

ˆY

_f

−

E Y

⎡⎣

_f

| X

_f

⎤⎦

)

(5)

a. Prediction

Now let’s compute a prediction interval for Y

_f

The predictive

standard error

, denoted s

_pred

, is then

s

_pred

=

s 1

+

1

N

+

X

_f

−

X

(

)

2

N

−

1

( )

s

_X2

⎛

⎝

⎜

⎞

⎠

⎟

.5

Standard Error of the Regression

Var e

(

_f

=

Y

_f

−

ˆY

_f

)

=

Var

( )

ε

_f

+

Var ˆY

( )

_f

=

σ

2

+

σ

2

1

N

+

X

_f

−

X

(

)

2

N

−

1

( )

s

2_X

⎛

⎝

⎜

⎞

⎠

⎟

=

σ

2

1

+

1

N

+

X

_f

−

X

(

)

2

N

−

1

( )

s

_X2

(6)

a. Prediction

Let’s return to the printout and fill-in the formula for the prediction

interval

(7)

b. Why Regression Diagnostics?

Up to now, we have assumed that the data are generated by a

linear regression model

What are the basic assumptions of the model?

1.

linear conditional mean

2.

constant variance (

homoskedasticity

),

3.

normal errors

So we should see:

–

a pattern of constant variation around a line

–

very few points more than 2 standard deviations away

(8)

Why Should We Care

?

If the model assumptions are violated:

–

Prediction can be systematically biased

–

Standard errors and t-tests wrong

–

someone may be able to beat you with a different and better

model

How can we detect violations of the model?

–

We must use graphical methods

To drive this point home, let’s look at the “famous” Anscomb data

(9)

b. Why Regression Diagnostics?

Data Set 1

4 6 8 10 12 14

4

5

6

7

8

9

10

11

x1

(10)

Data Set 2

4 6 8 10 12 14

3

4

5

6

7

8

9

x2

(11)

Data Set 3

4 6 8 10 12 14

6

8

10

12

x3

(12)

Data Set 4

8 10 12 14 16 18

6

8

10

12

(13)

c. Residual Diagnostic Plots

Two basic plots are very useful:

i.

Plot of

Residuals vs. Fitted Values

ii.

A

Normal Probability Plot

When Model Assumptions Hold

A First Cut: plot Y against X

(works only when you have one X)

This data looks great!

Linear association with constant variance.

Normal?

(14)

i. Plot of Residuals vs. Fitted Values

What should this look like?

1. 

Residuals should be evenly distributed around the

mean

2. 

No relationship between the mean of the residual and

(15)

A key assumption is that the regression model is a linear function.

This is not always true.

This will show up even more prominently in the residuals vs. fitted plot…

(16)

There should be no

relationship between the average value of the

residuals and fitted (X)

(17)

c. Residual Diagnostic Plots

A constant elasticity relationship implies a curved regression function.

Let’s look at some data generated from a log-log model of

demand. This is non-linear model with a constant elasticity.

3 4 5 6 7

100 200 300 400 price U ni t Sa le s

0 50 100 150 200 250 300

(18)

ii. A Normal Probability Plots

Use to test normality of residuals. Non-normal residuals cause

the following sorts of headaches:

–

"t-tests" and other associated statistics may no longer be t

distributed

–

Least squares estimates are extremely sensitive to large

ε

_i

and it may be possible to improve on least squares

–

The linear functional form may be incorrect and various

(19)

Remember that the salient characteristics of the normal

distribution are thin tails and symmetry.

How can we detect departures from normality?

The most basic analysis would be to graph the histogram of the standardized residuals

Neither of these plots look

particularly symmetric

(20)

Let’s compute a norm probablity plot using the

normPlot()

function

.

-2 -1 0 1 2

-1

0

1

2

N=30

Quantiles Under Normality

A-D p-value= 0.13

-2 -1 0 1 2

-2 -1 0 1 2 N=100

(21)

The

normal probability plot

is a plot of the sample CDF

on a coordinate system in which the normal CDF appears

as a straight line. The sample CDF will appear as a scatter

of points around the normal CDF straight line.

(22)

d. Putting It All Together- The Shock Absorber Example

Suppliers for very large manufacturing firms are facing increasing

pressure to assure their parts customers that the parts they produce meet high quality standards.

This supplier is supplying gas-filled shock absorbers.

The data are measurements on the rebound force of the shock absorber. Measurements can be taken both before and after the shock

absorber was fully assembled. It is cheaper to take measurements of the shock absorber performance before, rather than after, assembly.

See dataset shock.

(23)

Basic Model

We must formulate a statistical model to predict rebound force after assembly using the before assembly measurement.

(24)

Descriptive Statistics

These are statistics formed from the marginal distributions of

(25)

Marginal Distribution of Y

Doesn’t look normal! Three clumps!

Can we run a regression even though Y is not normal? YES!

Histogram of rebounda

rebounda F re qu en cy

500 520 540 560 580 600

0 1 2 3 4 5

-2 -1 0 1 2

500 520 540 560 580 600

(26)

Joint or Bivariate Distribution

Let’s do a scatter plot. Which variable should be on the Y axis?

Note three clusters of points. This implies room for improvement in quality!

500 520 540 560 580 600 620

500

520

540

560

580

600

reboundb

re

bo

un

(27)

Regression Analysis

(28)

Residual Diagnostics

Residuals are much more

normal than marginal dist of Y

Residuals vs. Fitted looks

pretty good

-2 -1 0 1 2

-1 5 -5 0 5 10 15

Quantiles Under Normality A-D p-value= 0.35

500 520 540 560 580 600

(29)

T-tests

Suppose before measurements were “perfect” predictors. What would this mean?

One School of Thought:

All you need is very accurate predictions

(30)

T-tests continued

Let’s test a slight modification:

Since N is relatively small, let’s test at the 10 percent significance level.

Step 2: Compute t statistic

> t=(.94946-1)/.0438 > t

[1] -1.153881

Step 1: Compute t critical value

> qt(.05,df=33) [1] -1.692360

Step 3: Compute p-value

> pt(-1.153881,df=33)*2

(31)

Prediction

(32)

Glossary of Symbols

X

_f

- future value of X for forecasting

Y

_f

- value of Y to be forecasted

(33)

Important Equations

s

_pred

=

s 1

+

1

N

+

X

_f

−

X

(

)

2

N

−

1

( )

s

_X2

⎛

⎝

⎜

⎞

⎠

⎟

.5

definition of

prediction error

decomposition of

prediction error

predictive standard

error

(34)

Glossary of R Commands

•

normPlot(variable name)

: This is a customized