• No results found

Computing interval predictions

4.6 Computing residuals and predicted values

4.6.1 Computing interval predictions

Besides predicted values and least-squares residuals, we may want to use predictto compute other observation-specific quantities from the fitted mod-el.40 I will discuss some of the more specialized quantities available after regress in the next section. First, I discuss how predict may provide inter-val predictions to complement the point predictions.

The interval prediction is simply a confidence interval for the prediction.

There are two commonly used definitions of “prediction”, the predicted value and the forecast. The predicted value estimates the average value of the de-pendent variable for given values of the regressors. The forecast estimates the value of the dependent variable for a given set of regressors. The mechanics of OLS implies that the point estimates are the same, but the variances of the predicted value and the forecast are different. As is intuitive, the variance of the forecast is higher than the variance of the predicted value.

Given regressor values x0, the predicted value is E[y|x0] = by0 = x0β

A consistent estimator of the variance of the predicted value is Vbp = s2∗ x0(XX)−1x0

Given the regressor values x0, the forecast error for a particular y0 is b

e0 = y0−by0 = x0β + u0−by0

predict performs this calculation for each observation when the stdp option is specified.

The zero covariance between u0 and βbimplies41 that Var[be0] = Var[by0] + Var[u0] for which

f = s2∗ x0(XX)−1x0+ s2

is a consistent estimator. predict performs this calculation for each observa-tion when the stdf opobserva-tion is specified. As one would expect, the variance of the forecast is higher than the variance of the predicted value.

40Documentation for the capabilities of predict after regress is presented in [R] regress postestimation.

41See Wooldridge (2006) for a discussion of this point.

An interval prediction is an upper and lower bound that contain the true value with a given probability in repeated samples.42 Here I present a method for finding the bounds for the forecast. Given that the standardized-prediction error has an approximate Student t distribution, the interval prediction begins by choosing bounds that enclose it with probability 1 − α:

Pr

where α is the significance level43 and t1−α/2 is the inverse of the Student t at 1 − α/2. Standard manipulations of this condition yield

Prn Plugging in our consistent estimators yields the bounds

b

y0± t1−α/2∗ qVbf

Substituting E[y|x0] for y0and using the variance of predicted value presented in the text yields a prediction interval for the predicted value yb0.

The variance of the predicted value increases as we consider an x value farther from the mean of the estimation sample. The interval predictions for the predicted value lie on a pair of parabolas with the narrowest interval at x, widening as we diverge from the sample point of means. To compute this confidence interval, we use predict’s stdp option (see [R] regress postes-timation). An appropriate confidence interval may be constructed from [±t stdp], where t would be 1.96 for a 95% confidence interval for a sample with a large N. You may then construct two more variables to hold the lower-limit and upper-limit values and graph the point and interval predictions.

We consider a bivariate regression of log median housing price on lnox.

For illustration, we fit only the model to 100 communities of the 506 in the dataset The two predict commands generate the predicted values of lprice as xb and the standard error of prediction and stdpred, respectively

. use http://www.stata-press.com/data/imeus/hprice2a, clear (Housing price data for Boston-area communities)

. quietly regress lprice lnox if _n<=100

42See Wooldridge (2006, section 6.4) for more about forming and interpreting interval predictions.

43Loosely speaking, the significance level is the error rate that we are willing to tolerate in repeated samples. The often-chosen significance level of 5% yields a 95% confidence interval.

. predict double xb if e(sample) (option xb assumed; fitted values) (406 missing values generated)

. predict double stdpred if e(sample), stdp (406 missing values generated)

To calculate the prediction interval, we use the invttail() function to gen-erate the correct t-value for the sample size and a 95% prediction interval as a scalar. The variables uplim and lowlim can then be computed:

. scalar tval = invttail(e(df_r), 0.025) . generate double uplim = xb + tval * stdpred (406 missing values generated)

. generate double lowlim = xb - tval * stdpred (406 missing values generated)

Finally, we want to highlight the mean value of lnox (calculated by the summarize command, storing that value as local macro lnoxbar) and la-bel the variables appropriately for the graph:

. summarize lnox if e(sample), meanonly . local lnoxbar = r(mean)

. label var xb "Pred"

. label var uplim "95% prediction interval"

. label var lowlim "96% prediction interval"

We may now generate the figure by using three graph twoway types: scatter for the scatterplot, connected for the predicted values, and rline for the prediction interval limits:4445

. twoway (scatter lprice lnox if e(sample),

> sort ms(Oh) xline(‘lnoxbar’))

> (connected xb lnox if e(sample), sort msize(small))

> (rline uplim lowlim lnox if e(sample), sort),

> ytitle(Actual and predicted log price) legend(cols(3))

Figure 4.3 plots the actual values of the response variable against their point and interval predictions. The prediction interval is narrowest at the mean value of the regressor. The vertical line (calculated by summarize lnox if e(sample), storing that value as local macro lnoxbar), marks the sample mean of lnox observations used in the regression.

44For an introduction to Stata graphics, please see [G] graph intro and help graph intro, For an in-depth presentation of Stata’s graphics capabilities, please see A Visual Guide to Stata Graphics (Mitchell 2004).

45See [G] graph twoway lfitci or help graph twoway lfitci for another way to plot regression lines and prediction intervals.

9.51010.511Actual and predicted log price

1.4 1.5 1.6 1.7

log(nox)

log(price) Fitted values uplim/lowlim

Figure 4.3: Point and interval predictions from bivariate regression

We can compute residuals and predicted values of the dependent variable from the data and the regression point estimates. The residuals and in-sample predictions are used to assess how well the model explains the dependent vari-able. Whereas the goal of some studies is to obtain out-of-sample predictions for the dependent variable, using either actual or hypothetical values for the regressors, in other cases these out-of-sample predictions can be used to evalu-ate a model’s usefulness. For example, we may apply the estimevalu-ated coefficients to a separate sample (e.g., the Springfield-area communities rather than the Boston-area communities) to evaluate its out-of-sample applicability. If a re-gression model is well specified, it should generate reasonable predictions for any sample from the population. If out-of-sample predictions are poor, the model’s specification may be too specific to the original sample.

A prediction interval for the forecast may be computed with predict’s stdf (standard error of forecast) option (see [R] regress postestimation).

Unlike stdp, which calculates an interval around the expected value of y for a given set of X values (in or out of sample), stdf accounts for the additional uncertainty associated with the prediction of one y value (i.e., σu2). We can use a confidence interval formed with stdf to evaluate an out-of-sample data point, y0, and formally test whether it could have been generated by the

process generating the fitted model. The null hypothesis for that test implies that data point should lie within the interval yb0± t stdf.