Diagnostic graphics - Box 5.6 Model comparisons in simple linear regression We can use the mode

Box 5.6 Model comparisons in simple linear regression We can use the model relating CWD basal area to riparian tree density to illustrate

5.3.10 Diagnostic graphics

We cannot over-emphasize the importance of preliminary inspection of your data. The diagnostics and checks of assumptions we have just described are best used in graphical explorations of your data before you do any formal analyses. We will describe the two most useful graphs for linear regression analysis, the scatterplot and the residual plot.

Scatterplots

A scatterplot of Y against X, just as we used in simple correlation analysis, should always be the first step in any regression analysis. Scatterplots can indicate unequal variances, nonlinearity and outlying observations, as well as being used in conjunction with smoothing functions (Section 5.5) to explore the relationship between Y and X without being constrained by a specific linear model. For example, the scatterplot of number of species of invertebrates against area of mussel clump from Peake & Quinn (1993) clearly indicates nonlinearity (Figure 5.17(a)), while the plot of number of individuals against area of mussel clump indicates increasing variance in number of individuals with increasing clump area (Figure 5.19(a)). While we could write numerous para- graphs on the value of scatterplots as a preliminary check of the data before a linear regression analysis, the wonderful and oft-used example data from Anscombe (1973) emphasize how easily linear regression models can be fitted to inappropriate data and why preliminary scatterplots are so important (Figure 5.9).

Residual plots

The most informative way of examining residuals (raw or studentized) is to plot them against x_ior, equivalently in terms of the observed pattern, yˆ_i

(Figure 5.10). These plots can tell us whether the assumptions of the model are met and whether there are unusual observations that do not match the model very well.

If the distribution of Y-values for each x_iis pos- itively skewed (e.g. lognormal, Poisson), we would expect larger yˆ_i (an estimate of the population mean of y_i) to be associated with larger residuals. A wedge-shaped pattern of residuals, with a larger spread of residuals for larger x_i or yˆ_i as shown for the model relating number of individuals of macroinvertebrates to mussel clump area

96 CORRELATION AND REGRESSION

Figure 5.8. Residuals, leverage, and inﬂuence.The solid

regression line is fitted through the observations with open symbols. Observation 1 is an outlier for both Y and X (large leverage) but not from the fitted model and is not influential. Observation 2 is not an outlier for either Y or X but is an outlier from the fitted model (large residual). Regression line 2 includes this observation and its slope is only slightly less than the original regression line so observation 2 is not particularly influential (small Cook’s D_i). Observation 3 is not an outlier for Y but it does have large leverage and it is an outlier from the fitted model (large residual). Regression line 3 includes this observation and its slope is markedly different from the original regression line so observation 3 is very influential (large Cook’s D_i, combining leverage and residual).

in our worked example (Box 5.4 and Figure 5.19(b)), indicates increasing variance in␧iand y_i with increasing x_iassociated with non-normality in Y-values and a violation of the assumption of homogeneity of variance. Transformation of Y (Section 5.3.11) will usually help. The ideal pattern in the residual plot is a scatter of points with no obvious pattern of increasing or decreas- ing variance in the residuals. Nonlinearity can be detected by a curved pattern in the residuals (Figure 5.17b) and outliers also stand out as having large residuals. These outliers might be different from the outliers identiﬁed in simple boxplots of Y, with no regard for X (Chapter 4). The latter are Y-values very different from the rest of the sample, whereas the former are observa- tions with Y-values very different from that pre- dicted by the ﬁtted model.

Searle (1988) pointed out a commonly observed pattern in residual plots where points fall along parallel lines each with a slope of minus one (Figure 5.11). This results from a number of observations having similar values for one of the variables (e.g. a number of zeros). These parallel lines are not a problem, they just look a little unusual. If the response variable is binary (dichotomous),

LINEAR REGRESSION ANALYSIS 97

Figure 5.9. Scatterplots of four data sets provided in

Anscombe (1973). Note that despite the marked differences in the nature of the relationships between Y and X, the OLS regression line, the r2_{and the test of the H}

0that ␤1equals

zero are identical in all four cases: y_i⫽3.0⫹0.5x_i, n⫽11,

r2_{⫽0.68, H}

0:␤1⫽0, t⫽4.24, P⫽0.002.

Figure 5.10. Diagrammatic representation of residual plots

from linear regression: (a) regression showing even spread around line, (b) associated residual plot, (c) regression showing increasing spread around line, and (d) associated residual plot showing characteristic wedge-shape typical of skewed distribution.

Figure 5.11. Example of parallel lines in a residual plot.

Data from Peake & Quinn (1993), where the abundance of the limpets (Cellana tramoserica) was the response variable, area of mussel clump was the predictor variable and there were n⫽25 clumps.

then the points in the residual plot will fall along two such parallel lines although OLS regression is probably an inappropriate technique for these data and a generalized linear model with a bino- mial error term (e.g. logistic regression) should be used (Chapter 13). The example in Figure 5.11 is from Peake & Quinn (1993), where the response variable (number of limpets per mussel clump) only takes three values: zero, one or two.

5.3.11 Transformations

When continuous variables have particular skewed distributions, such as lognormal or Poisson, transformations of those variables to a different scale will often render their distributions closer to normal (Chapter 4). When ﬁtting linear regression models, the assumptions under- lying OLS interval estimation and hypothesis testing of model parameters refer to the error terms from the model and, therefore, the response variable (Y). Transformations of Y can often be effective if the distribution of Y is non- normal and the variance of y_idiffers for each x_i, especially when variance clearly increases as x_i increases. For example, variance heterogeneity for the linear model relating number of individuals of macroinvertebrates to mussel clump area was greatly reduced after transformation of Y (and also X – see below and compare Figure 5.19 and Figure 5.20). Our comments in Chapter 4 about the choice of transformations and the interpretation of analyses based on transformed data are then relevant to the response variable.

The assumption that the x_i are fixed values chosen by the investigator suggests that transformations of the predictor variable would not be warranted. However, regression analyses in biology are nearly always based on both Y and X being random variables, with our conclusions conditional on the x_iobserved in our sample or we use a Model II analysis (Section 5.3.14). Additionally, our discussion of regression diag- nostics shows us that unusual X-values determine leverage and can cause an observation to have undue influence on the estimated regression coef- ficient. Transformations of X should also be con- sidered to improve the fit of the model and transforming both Y and X is sometimes more effective than just transforming Y.

The other use of transformations in linear regression analysis is to linearize a nonlinear rela- tionship between Y and X (Chapter 4). When we have a clear nonlinear relationship, we can use nonlinear regression models or we can approxi- mate the nonlinearity by including polynomial terms in a linear model (Chapter 6). An alternative approach that works for some nonlinear relationships is to transform one or both variables to make a simple linear model an appropriate ﬁt to the data. Nonlinear relationships that can be made linear by simple transformations of the variables are sometimes termed “intrinsically linear” (Rawlings et al. 1998); for example, the relationship between the number of species and area of an island can be modeled with a nonlinear power function or a simple linear model after log transformation of both variables (Figure 5.17 and Figure 5.18). If there is no evidence of variance heteroge- neity, then it is best just to transform X to try and linearize the relationship (Neter et al. 1996). Transforming Y in this case might actually upset error terms that are already normally distributed with similar variances. The relationship between number of species and area of mussel clump from Peake & Quinn (1993) illustrates this point, as a log transformation of just clump area (X) results in a linear model that best ﬁts the data although both variables were transformed in the analysis (Box 5.4). However, nonlinearity is often associated with non-normality of the response variable and trans- formations of Y and/or Y and X might be required. Remember that the interpretation of our regression model based on transformed variables, and any predictions from it, must be in terms of transformed Y and/or X, e.g. predicting log number of species from log clump area, although predictions can be back-transformed to the original scale of measurement if required.

In document Experimental Design and Data Analysis for Biologists - Quinn & Keough - Cambridge 2002 (Page 116-118)