STANDARD ERROR OF THE ESTIMATE

LEAST-SQUARES ESTIMATES

SST=SSR, so thatr2would equal SSR/SST=1. Thus, the maximum value forr2 is 1, which occurs when the regression is a perfect ﬁt.

What is the minimum value thatr2_{can take? Suppose that the regression showed} no improvement at all; that is, suppose that the regression explained none of the variability iny. This would result in SSR equaling zero, and consequently,r2_would equal zero as well. Thus,r2_{is bounded between zero and 1, inclusive. How are we} to interpret the value thatr2_{takes? Essentially, the higher the value of}_r2_{, the better} the fit of the regression to the data set. Values ofr2_{near 1 denote an extremely good} fit of the regression to the data; values near zero denote an extremely poor fit.

A very rough rule of thumb for the interpretation ofr2_{might be to imagine it} as a grade received on a very difficult exam. One might be quite pleased to get higher than 90%, happy with higher than 80%, somewhat satisfied with better than 70%, but disappointed with less than 50%. This heuristic must be applied carefully, however, since the interpretation ofr2 varies from field to field. In the physical sciences, for example, one encounters relationships that elicit very high values ofr2, whereas in the social sciences one may need to be content with lower values ofr2, because of person-to-person variability. As usual, the analyst’s judgment should be tempered with the domain expert’s experience.

STANDARD ERROR OF THE ESTIMATE

We have seen how ther2_{statistic measures the goodness of fit of the regression to} the data set. Next, thes statistic, known as thestandard error of the estimate, is a measure of the accuracy of the estimates produced by the regression. Clearly,sis one of the most important statistics to consider when performing a regression analysis. To find the value ofs, we first find themean squared error:

MSE= SSE

n−m−1

wheremindicates the number of predictor variables, which is 1 for the simple linear regression case and greater than 1 for the multiple regression case. Like SSE, MSE represents a measure of the variability in the response variable left unexplained by the regression.

Then the standard error of the estimate is given by

s=√MSE=

SSE n−m−1

The value ofsprovides an estimate of the “typical” residual, much as the value of the standard deviation in univariate analysis provides an estimate of the typical deviation. In other words,sis a measure of the typical error in estimation, the typical difference between the response value predicted and the actual response value. In this way, the standard error of the estimatesrepresents the precision of the predictions generated by the regression equation estimated. Smaller values ofsare better.

SPH

JWDD006-02 JWDD006-Larose November 23, 2005 14:50 Char Count= 0

44 CHAPTER 2 REGRESSION MODELING

For the orienteering example, we have

s=√MSE=

10−1−1 =1.2

Thus, the typical estimation error when using the regression model to predict distance is 1.2 kilometers. That is, if we are told how long a hiker has been traveling, our estimate of the distance covered will typically differ from the actual distance by about 1.2 kilometers. Note from Table 2.3 that all of the residuals lie between zero and 2 in absolute value, so that 1.2 may be considered a reasonable estimate of the typical residual. (Other measures, such as the mean absolute deviation of the residuals, may also be considered but are not widely reported in commercial software packages.)

We may compares=1.2 kilometers against the typical estimation error obtained from ignoring the predictor data, obtained from the standard deviation of the response,

SDy=

i=1(y−y¯)2 n−1 =5.0

The typical prediction error when ignoring the time data is 5 kilometers. Using the regression has reduced the typical prediction error from 5 kilometers to 1.2 kilometers. In the absence of software, one may use the following computational formulas for calculating the values of SST and SSR. The formula for SSR is exactly the same as for the slopeb1except that the numerator is squared.

SST=y2− y 2 n SSR= x y−xy n2 x2−_x2_n

Let us use these formulas to ﬁnd the values of SST and SSR for the orienteering example. You should verify that we havex=50,y=160,x y=908, x2 ₌_{304, and}_y2₌_{2788. Then} SST=y2− y 2 n =2788−(160)2/10=2478−2560=228 and SSR= x y−x y n2 x2−_x2_n = [908−(50)(160)/10]2 304−(50)2/₁₀ = 1082 54 =216 Of course these are the same values that we found earlier using the more onerous tabular method. Finally, we calculate the value of the coefﬁcient of determinationr2 to be

r2=SSR SST =

216

228 =0.9474

In other words, the linear relationship between time and distance accounts for 94.74% of the variability in the distances traveled. The regression model ﬁts the data very nicely.

In document Data Mining Methods And Models Larose DT (2006) pdf (Page 61-63)