Box 5.6 Model comparisons in simple linear regression We can use the model relating CWD basal area to riparian tree density to illustrate
5.3.9 Regression diagnostics
So far we have emphasized the underlying assumptions for estimation and hypothesis testing with the linear regression model and pro- vided some guidelines on how to check whether these assumptions are met for a given bivariate data set. A proper interpretation of a linear regres- sion analysis should also include checks of how well the model fits the observed data. We will focus on two aspects in this section. First, is a straight-line model appropriate or should we investigate curvilinear models? Second, are there any unusual observations that might be outliers and could have undue influence on the parameter estimates and the fitted regression model? Influence can come from at least two sources – think of a regression line as a see-saw, balanced on the mean of X. An observation can influence, or tip, the regression line more easily if it is further from the mean (i.e. at the ends of the range of
X-values) or if it is far from the fitted regression line (i.e. has a large residual, analogous to a heavy person on the see-saw). We emphasized in Chapter 4 that it is really important to identify if the conclusions from any statistical analysis are influ- enced greatly by one or a few extreme observa- tions. A variety of “diagnostic measures” can be calculated as part of the analysis that identify extreme or influential points and detect nonline- arity. These diagnostics also provide additional ways of checking the underlying assumptions of normality, homogeneity of variance and indepen-
dence. We will illustrate some of the more common regression diagnostics that are standard outputs from most statistical software but others are available. Belsley et al. (1980) and Cook & Weisberg (1982) are the standard references, and other good discussions and illustrations include Bollen & Jackman (1990), Chatterjee & Price (1991) and Neter et al. (1996).
Leverage
Leverage is a measure of how extreme an observa- tion is for the X-variable, so an observation with high leverage is an outlier in the X-space (Figure 5.8). Leverage basically measures how much each
xiinfluences yˆi(Neter et al. 1996). X-values further from x¯influence the predicted Y-values more than those close to x¯. Leverage is often given the symbol
hibecause the values for each observation come from a matrix termed the hat matrix (H) that relates the yito the yˆi(see Box 6.1). The hat matrix is determined solely from the X-variable(s) so Y doesn’t enter into the calculation of leverage at all. Leverage values normally range between 1/n and 1 and a useful criterion is that any observa- tion with a leverage value greater than 2( p/n) (where p is the number of parameters in the model including the intercept; p⫽2 for simple linear regression) should be checked (Hoaglin & Welsch 1978). Statistical software may use other criteria for warning about observations with high leverage. The main use of leverage values is when they are incorporated in Cook’s Di statistic, a measure of influence described below.
Residuals
We indicated in Section 5.3.8 that patterns in residuals are an important way of checking regression assumptions and we will expand on this in Section 5.3.10. One problem with sample residuals is that their variance may not be con- stant for different xi, in contrast to the model error terms that we assume do have constant var- iance. If we could modify the residuals so they had constant variance, we could more validly compare residuals to one another and check if any seemed unusually large, suggesting an outlying observa- tion from the fitted model. There are a number of modifications that try to make residuals more useful for detecting outliers (Table 5.4).
Standardized residuals use the 兹MSResidual as
an approximate standard error for the residuals. These are also called semistudentized residuals by Neter et al. (1996). Unfortunately, this standard error doesn’t solve the problem of the variances of the residuals not being constant so a more sophis- ticated modification is needed. Studentized resid- uals incorporate leverage (hi) as defined earlier. These studentized residuals do have constant var- iance so different studentized residuals can be validly compared. Large (studentized) residuals for a particular observation indicate that it is an outlier from the fitted model compared to the other observations. Studentized residuals also follow a t distribution with (n⫺1) df if the regres- sion assumptions hold. We can determine the probability of getting a specific studentized resid- ual, or one more extreme, by comparing the stu- dentized residual to a t distribution. Note that we would usually test all residuals in this way, which will result in very high family-wise Type I error rates (the multiple testing problem; see Chapter 3) so some type of P value adjustment might be required, e.g. sequential Bonferroni.
The deleted residual for observation i, also called the PRESS residual, is defined as the differ- ence between the observed Y-values and those pre- dicted by the regression model fitted to all the observations except i. These deleted residuals are usually calculated for studentized residuals. These studentized deleted residuals can detect outliers that might be missed by usual residuals (Neter et al. 1996). They can also be compared to a
tdistribution as we described above for the usual studentized residual.
Influence
A measure of the influence each observation has on the fitted regression line and the estimates of the regression parameters is Cook’s distance statistic, denoted Di. It takes into account both the size of leverage and the residual for each observation and basically measures the influence of each observa- tion on the estimate of the regression slope (Figure 5.8). A large Diindicates that removal of that obser- vation would change the estimates of the regres- sion parameters considerably. Cook’s Dican be used in two ways. First, informally by scanning the Dis of all observations and noting if any values are much larger than the rest. Second, by comparing Dito an
F1,ndistribution; an approximate guideline is that LINEAR REGRESSION ANALYSIS 95
an observation with a Digreater than one is partic- ularly influential (Bollen & Jackman 1990). An alternative measure of influence that also incorpo- rates both the size of leverage and the residual for each observation is DFITSi, which measures the influence of each observation (i) on its predicted value ( yˆi).
We illustrate leverage and influence in Figure 5.8. Note that observations one and three have large leverage and observations two and three have large residuals. However, only observation three is very influential, because omitting obser- vations one or two would not change the fitted regression line much.
Transformations of Y that overcome problems of non-normality or heterogeneity of variance might also reduce the influence of outliers from the fitted model. If not, then the strategies for dealing with outliers discussed in Chapter 4 should be considered.