Box 5.6 Model comparisons in simple linear regression We can use the model relating CWD basal area to riparian tree density to illustrate
5.3.10 Diagnostic graphics
We cannot over-emphasize the importance of pre- liminary inspection of your data. The diagnostics and checks of assumptions we have just described are best used in graphical explorations of your data before you do any formal analyses. We will describe the two most useful graphs for linear regression analysis, the scatterplot and the residual plot.
Scatterplots
A scatterplot of Y against X, just as we used in simple correlation analysis, should always be the first step in any regression analysis. Scatterplots can indicate unequal variances, nonlinearity and outlying observations, as well as being used in conjunction with smoothing functions (Section 5.5) to explore the relationship between Y and X without being constrained by a specific linear model. For example, the scatterplot of number of species of invertebrates against area of mussel clump from Peake & Quinn (1993) clearly indicates nonlinearity (Figure 5.17(a)), while the plot of number of individuals against area of mussel clump indicates increasing variance in number of individuals with increasing clump area (Figure 5.19(a)). While we could write numerous para- graphs on the value of scatterplots as a prelimi- nary check of the data before a linear regression analysis, the wonderful and oft-used example data from Anscombe (1973) emphasize how easily linear regression models can be fitted to inappro- priate data and why preliminary scatterplots are so important (Figure 5.9).
Residual plots
The most informative way of examining residuals (raw or studentized) is to plot them against xior, equivalently in terms of the observed pattern, yˆi
(Figure 5.10). These plots can tell us whether the assumptions of the model are met and whether there are unusual observations that do not match the model very well.
If the distribution of Y-values for each xiis pos- itively skewed (e.g. lognormal, Poisson), we would expect larger yˆi (an estimate of the population mean of yi) to be associated with larger residuals. A wedge-shaped pattern of residuals, with a larger spread of residuals for larger xi or yˆi as shown for the model relating number of individ- uals of macroinvertebrates to mussel clump area
96 CORRELATION AND REGRESSION
Figure 5.8. Residuals, leverage, and influence.The solid
regression line is fitted through the observations with open symbols. Observation 1 is an outlier for both Y and X (large leverage) but not from the fitted model and is not influential. Observation 2 is not an outlier for either Y or X but is an outlier from the fitted model (large residual). Regression line 2 includes this observation and its slope is only slightly less than the original regression line so observation 2 is not particularly influential (small Cook’s Di). Observation 3 is not an outlier for Y but it does have large leverage and it is an outlier from the fitted model (large residual). Regression line 3 includes this observation and its slope is markedly different from the original regression line so observation 3 is very influential (large Cook’s Di, combining leverage and residual).
in our worked example (Box 5.4 and Figure 5.19(b)), indicates increasing variance iniand yi with increasing xiassociated with non-normality in Y-values and a violation of the assumption of homogeneity of variance. Transformation of Y (Section 5.3.11) will usually help. The ideal pattern in the residual plot is a scatter of points with no obvious pattern of increasing or decreas- ing variance in the residuals. Nonlinearity can be detected by a curved pattern in the residuals (Figure 5.17b) and outliers also stand out as having large residuals. These outliers might be different from the outliers identified in simple boxplots of Y, with no regard for X (Chapter 4). The latter are Y-values very different from the rest of the sample, whereas the former are observa- tions with Y-values very different from that pre- dicted by the fitted model.
Searle (1988) pointed out a commonly observed pattern in residual plots where points fall along parallel lines each with a slope of minus one (Figure 5.11). This results from a number of obser- vations having similar values for one of the vari- ables (e.g. a number of zeros). These parallel lines are not a problem, they just look a little unusual. If the response variable is binary (dichotomous),
LINEAR REGRESSION ANALYSIS 97
Figure 5.9. Scatterplots of four data sets provided in
Anscombe (1973). Note that despite the marked differences in the nature of the relationships between Y and X, the OLS regression line, the r2and the test of the H
0that 1equals
zero are identical in all four cases: yi⫽3.0⫹0.5xi, n⫽11,
r2⫽0.68, H
0:1⫽0, t⫽4.24, P⫽0.002.
Figure 5.10. Diagrammatic representation of residual plots
from linear regression: (a) regression showing even spread around line, (b) associated residual plot, (c) regression showing increasing spread around line, and (d) associated residual plot showing characteristic wedge-shape typical of skewed distribution.
Figure 5.11. Example of parallel lines in a residual plot.
Data from Peake & Quinn (1993), where the abundance of the limpets (Cellana tramoserica) was the response variable, area of mussel clump was the predictor variable and there were n⫽25 clumps.
then the points in the residual plot will fall along two such parallel lines although OLS regression is probably an inappropriate technique for these data and a generalized linear model with a bino- mial error term (e.g. logistic regression) should be used (Chapter 13). The example in Figure 5.11 is from Peake & Quinn (1993), where the response variable (number of limpets per mussel clump) only takes three values: zero, one or two.
5.3.11 Transformations
When continuous variables have particular skewed distributions, such as lognormal or Poisson, transformations of those variables to a different scale will often render their distribu- tions closer to normal (Chapter 4). When fitting linear regression models, the assumptions under- lying OLS interval estimation and hypothesis testing of model parameters refer to the error terms from the model and, therefore, the response variable (Y). Transformations of Y can often be effective if the distribution of Y is non- normal and the variance of yidiffers for each xi, especially when variance clearly increases as xi increases. For example, variance heterogeneity for the linear model relating number of individuals of macroinvertebrates to mussel clump area was greatly reduced after transformation of Y (and also X – see below and compare Figure 5.19 and Figure 5.20). Our comments in Chapter 4 about the choice of transformations and the interpreta- tion of analyses based on transformed data are then relevant to the response variable.
The assumption that the xi are fixed values chosen by the investigator suggests that transfor- mations of the predictor variable would not be warranted. However, regression analyses in biology are nearly always based on both Y and X being random variables, with our conclusions conditional on the xiobserved in our sample or we use a Model II analysis (Section 5.3.14). Additionally, our discussion of regression diag- nostics shows us that unusual X-values determine leverage and can cause an observation to have undue influence on the estimated regression coef- ficient. Transformations of X should also be con- sidered to improve the fit of the model and transforming both Y and X is sometimes more effective than just transforming Y.
The other use of transformations in linear regression analysis is to linearize a nonlinear rela- tionship between Y and X (Chapter 4). When we have a clear nonlinear relationship, we can use nonlinear regression models or we can approxi- mate the nonlinearity by including polynomial terms in a linear model (Chapter 6). An alternative approach that works for some nonlinear relation- ships is to transform one or both variables to make a simple linear model an appropriate fit to the data. Nonlinear relationships that can be made linear by simple transformations of the variables are sometimes termed “intrinsically linear” (Rawlings et al. 1998); for example, the relationship between the number of species and area of an island can be modeled with a nonlinear power function or a simple linear model after log trans- formation of both variables (Figure 5.17 and Figure 5.18). If there is no evidence of variance heteroge- neity, then it is best just to transform X to try and linearize the relationship (Neter et al. 1996). Transforming Y in this case might actually upset error terms that are already normally distributed with similar variances. The relationship between number of species and area of mussel clump from Peake & Quinn (1993) illustrates this point, as a log transformation of just clump area (X) results in a linear model that best fits the data although both variables were transformed in the analysis (Box 5.4). However, nonlinearity is often associated with non-normality of the response variable and trans- formations of Y and/or Y and X might be required. Remember that the interpretation of our regression model based on transformed variables, and any predictions from it, must be in terms of transformed Y and/or X, e.g. predicting log number of species from log clump area, although predictions can be back-transformed to the origi- nal scale of measurement if required.