Regression diagnostics - Box 5.6 Model comparisons in simple linear regression We can use the m

Box 5.6 Model comparisons in simple linear regression We can use the model relating CWD basal area to riparian tree density to illustrate

5.3.9 Regression diagnostics

So far we have emphasized the underlying assumptions for estimation and hypothesis testing with the linear regression model and pro- vided some guidelines on how to check whether these assumptions are met for a given bivariate data set. A proper interpretation of a linear regression analysis should also include checks of how well the model fits the observed data. We will focus on two aspects in this section. First, is a straight-line model appropriate or should we investigate curvilinear models? Second, are there any unusual observations that might be outliers and could have undue influence on the parameter estimates and the fitted regression model? Influence can come from at least two sources – think of a regression line as a see-saw, balanced on the mean of X. An observation can influence, or tip, the regression line more easily if it is further from the mean (i.e. at the ends of the range of

X-values) or if it is far from the fitted regression line (i.e. has a large residual, analogous to a heavy person on the see-saw). We emphasized in Chapter 4 that it is really important to identify if the conclusions from any statistical analysis are influ- enced greatly by one or a few extreme observations. A variety of “diagnostic measures” can be calculated as part of the analysis that identify extreme or influential points and detect nonline- arity. These diagnostics also provide additional ways of checking the underlying assumptions of normality, homogeneity of variance and indepen-

dence. We will illustrate some of the more common regression diagnostics that are standard outputs from most statistical software but others are available. Belsley et al. (1980) and Cook & Weisberg (1982) are the standard references, and other good discussions and illustrations include Bollen & Jackman (1990), Chatterjee & Price (1991) and Neter et al. (1996).

Leverage

Leverage is a measure of how extreme an observa- tion is for the X-variable, so an observation with high leverage is an outlier in the X-space (Figure 5.8). Leverage basically measures how much each

x_iinﬂuences yˆ_i(Neter et al. 1996). X-values further from x¯inﬂuence the predicted Y-values more than those close to x¯. Leverage is often given the symbol

h_ibecause the values for each observation come from a matrix termed the hat matrix (H) that relates the y_ito the yˆ_i(see Box 6.1). The hat matrix is determined solely from the X-variable(s) so Y doesn’t enter into the calculation of leverage at all. Leverage values normally range between 1/n and 1 and a useful criterion is that any observa- tion with a leverage value greater than 2( p/n) (where p is the number of parameters in the model including the intercept; p⫽2 for simple linear regression) should be checked (Hoaglin & Welsch 1978). Statistical software may use other criteria for warning about observations with high leverage. The main use of leverage values is when they are incorporated in Cook’s D_i statistic, a measure of inﬂuence described below.

Residuals

We indicated in Section 5.3.8 that patterns in residuals are an important way of checking regression assumptions and we will expand on this in Section 5.3.10. One problem with sample residuals is that their variance may not be con- stant for different x_i, in contrast to the model error terms that we assume do have constant variance. If we could modify the residuals so they had constant variance, we could more validly compare residuals to one another and check if any seemed unusually large, suggesting an outlying observation from the ﬁtted model. There are a number of modiﬁcations that try to make residuals more useful for detecting outliers (Table 5.4).

Standardized residuals use the 兹MS_Residual as

an approximate standard error for the residuals. These are also called semistudentized residuals by Neter et al. (1996). Unfortunately, this standard error doesn’t solve the problem of the variances of the residuals not being constant so a more sophis- ticated modification is needed. Studentized resid- uals incorporate leverage (h_i) as defined earlier. These studentized residuals do have constant variance so different studentized residuals can be validly compared. Large (studentized) residuals for a particular observation indicate that it is an outlier from the fitted model compared to the other observations. Studentized residuals also follow a t distribution with (n⫺1) df if the regression assumptions hold. We can determine the probability of getting a specific studentized residual, or one more extreme, by comparing the stu- dentized residual to a t distribution. Note that we would usually test all residuals in this way, which will result in very high family-wise Type I error rates (the multiple testing problem; see Chapter 3) so some type of P value adjustment might be required, e.g. sequential Bonferroni.

The deleted residual for observation i, also called the PRESS residual, is deﬁned as the differ- ence between the observed Y-values and those pre- dicted by the regression model ﬁtted to all the observations except i. These deleted residuals are usually calculated for studentized residuals. These studentized deleted residuals can detect outliers that might be missed by usual residuals (Neter et al. 1996). They can also be compared to a

tdistribution as we described above for the usual studentized residual.

Inﬂuence

A measure of the influence each observation has on the fitted regression line and the estimates of the regression parameters is Cook’s distance statistic, denoted D_i. It takes into account both the size of leverage and the residual for each observation and basically measures the influence of each observation on the estimate of the regression slope (Figure 5.8). A large D_iindicates that removal of that observation would change the estimates of the regres- sion parameters considerably. Cook’s D_ican be used in two ways. First, informally by scanning the D_is of all observations and noting if any values are much larger than the rest. Second, by comparing D_ito an

F_1,ndistribution; an approximate guideline is that LINEAR REGRESSION ANALYSIS 95

an observation with a D_igreater than one is partic- ularly influential (Bollen & Jackman 1990). An alternative measure of influence that also incorpo- rates both the size of leverage and the residual for each observation is DFITS_i, which measures the influence of each observation (i) on its predicted value ( yˆ_i).

We illustrate leverage and influence in Figure 5.8. Note that observations one and three have large leverage and observations two and three have large residuals. However, only observation three is very influential, because omitting observations one or two would not change the fitted regression line much.

Transformations of Y that overcome problems of non-normality or heterogeneity of variance might also reduce the inﬂuence of outliers from the ﬁtted model. If not, then the strategies for dealing with outliers discussed in Chapter 4 should be considered.

In document Experimental Design and Data Analysis for Biologists - Quinn & Keough - Cambridge 2002 (Page 114-116)