VERIFYING THE REGRESSION ASSUMPTIONS 67 (a)

Conﬁdence Interval for the Slope of the Regression Line

VERIFYING THE REGRESSION ASSUMPTIONS 67 (a)

Residual (c) Residual (d) Residual (b) Residual y^ y^ _y^ y ^

Figure 2.13 Four possible patterns in the plot of residuals versus ﬁts.

The regression line from Figure 2.2 is now the horizontal zero line in Figure 2.12. Points that were either above/below/on the regression line in Figure 2.2 now lie either above/below/on the horizontal zero line in Figure 2.12.

We evaluate the validity of the regression assumptions by observing whether certain patterns exist in the plot of the residuals versus ﬁts, in which case one of the assumptions has been violated, or whether no such discernible patterns exists, in which case the assumptions remain intact. The 10 data points in Figure 2.12 are really too few to try to determine whether any patterns exist. In data mining applications, of course, paucity of data is rarely the issue.

Let us see what types of patterns we should watch out for. Figure 2.13 shows four pattern “archetypes” that may be observed in residual-ﬁt plots. Plot (a) shows a “healthy” plot, where no noticeable patterns are observed and the points display an essentially rectangular shape from left to right. Plot (b) exhibits curvature, which violates the independence assumption. Plot (c) displays a “funnel” pattern, which violates the constant-variance assumption. Finally, plot (d) exhibits a pattern that increases from left to right, which violates the zero-mean assumption.

Why does plot (b) violate the independence assumption? Because the errors are assumed to be independent, the residuals (which estimate the errors) should exhibit independent behavior as well. However, if the residuals form a curved pattern, then,

SPH

JWDD006-02 JWDD006-Larose November 23, 2005 14:50 Char Count= 0

68 CHAPTER 2 REGRESSION MODELING

for a given residual, we may predict where its neighbors to the left and right will fall, within a certain margin of error. If the residuals were truly independent, such a prediction would not be possible.

Why does plot (c) violate the constant-variance assumption? Note from plot (a) that the variability in the residuals, as shown by the vertical distance, is fairly constant regardless of the value ofx. On the other hand, in plot (c), the variability of the residuals is smaller for smaller values of x and larger for larger values of x. Therefore, the variability is nonconstant, which violates the constant-variance assumption.

Why does plot (d) violate the zero-mean assumption? The zero-mean assumption states that the mean of the error term is zero regardless of the value ofx. However, plot (d) shows that for small values ofx, the mean of the residuals is less than zero, whereas for large values ofx, the mean of the residuals is greater than zero. This is a violation of the zero-mean assumption as well as a violation of the independence assumption.

Apart from these graphical methods, there are several diagnostic hypothesis tests that may be carried out to assess the validity of the regression assumptions. As mentioned above, the Anderson–Darling test may be used to indicate the ﬁt of residuals to a normal distribution. For assessing whether the constant variance assumption has been violated, either Bartlett’s or Levene’s test may be used. For determining whether the independence assumption has been violated, either the Durban–Watson or runs test may be used. Information about all these diagnostic tests may be found in Draper and Smith [3].

If the normal probability plot shows no systematic deviations from linearity, and the residuals–ﬁts plot shows no discernible patterns, we may conclude that there is no graphical evidence for the violation of the regression assumptions, and we may then proceed with the regression analysis. However,what do we do if these graphs indicate violations of the assumptions? For example, suppose that our normal probability plot of the residuals looked something like plot (c) in Figure 2.13, indicating nonconstant variance? Then we may apply a transformation to the response variabley, such as the ln(natural log, log to the basee) transformation.

EXAMPLE:BASEBALLDATA SET

To illustrate the use of transformations, we turn to thebaseballdata set, a collection of batting statistics for 331 baseball players who played in the American League in 2002. In this case we are interested in whether there is a relationship between batting average and the number of home runs that a player hits. Some fans might argue, for example, that those who hit lots of home runs also tend to make a lot of strikeouts, so that their batting average is lower. Let’s check it out using a regression of the number of home runs against the player’s batting average (hits divided by at bats).

Because baseball batting averages tend to be highly variable for low numbers of at bats, we restrict our data set to those players who had at least 100 at bats for the 2002 season. This leaves us with 209 players. A scatter plot ofhome runsversusbatting averageis shown in Figure 2.14. The scatter plot indicates that there may be a positive

EXAMPLE:BASEBALLDATA SET 69

In document Data Mining Methods And Models Larose DT (2006) pdf (Page 85-87)