Assumptions of regression analysis - Box 5.6 Model comparisons in simple linear regression We c

Box 5.6 Model comparisons in simple linear regression We can use the model relating CWD basal area to riparian tree density to illustrate

5.3.8 Assumptions of regression analysis

The assumptions of the linear regression model strictly concern the error terms (␧i) in the model, as described in Section 5.3.2. Since these error terms are the only random ones in the model, then the assumptions also apply to observations of the response variable y_i. Note that these assumptions are not required for the OLS estima-

SSResidual(Full) SSResidual(Reduced) SSResidual SSTotal SSRegression SSTotal

tion of model parameters but are necessary for reliable conﬁdence intervals and hypothesis tests based on t distributions or F distributions.

The residuals from the fitted model (Table 5.4) are important for checking whether the assumptions of linear regression analysis are met. Residuals indicate how far each observation is from the fitted OLS regression line, in Y-variable space (i.e. vertically). Observations with larger residuals are further from the fitted line that those with smaller residuals. Patterns of residuals represent patterns in the error terms from the linear model and can be used to check assumptions and also the influence each observation has on the fitted model.

Normality

This assumption is that the populations of

Y-values and the error terms (␧i) are normally dis- tributed for each level of the predictor variable x_i. Conﬁdence intervals and hypothesis tests based on OLS estimates of regression parameters are robust to this assumption unless the lack of normality results in violations of other assumptions. In particular, skewed distributions of y_ican cause problems with homogeneity of variance and line- arity, as discussed below.

Without replicate Y-values for each x_i, this assumption is difﬁcult to verify. However, reason- able checks can be based on the residuals from the ﬁtted model (Bowerman & O’Connell 1990). The methods we described in Chapter 4 for checking normality, including formal tests or graphical methods such as boxplots and probability plots, can be applied to these residuals. If the assumption is not met, then there are at least two options. First, a transformation of Y (Chapter 4

92 CORRELATION AND REGRESSION

Table 5.4 Types of residual for linear regression models, where hiis the leverage for observation i

Residual e_i⫽y_i⫺yˆ_i

Standardized residual Studentized residual Studentized deleted residual ei

兹

n⫺1 SSResidual(1⫺hi)⫺ei2 ei 兹MSResidual(1⫺hi) ei 兹MSResidual

and Section 5.3.11) may be appropriate if the distribution is positively skewed. Second, we can ﬁt a linear model using techniques that allow other distributions of error terms other than normal. These generalized linear models (GLMs) will be described in Chapter 13. Note that non-normality of Y is very commonly associated with heterogene- ity of variance and/or nonlinearity.

Homogeneity of variance

This assumption is that the populations of Y- values, and the error terms (␧i), have the same var- iance for each x_i:

␴1 2⫽␴

2⫽. . .⫽␴i2⫽. . .⫽␴

␧2for i⫽1 to n (5.17)

The homogeneity of variance assumption is important, its violation having a bigger effect on the reliability of interval estimates of, and tests of hypotheses about, regression parameters (and parameters of other linear models) than non- normality. Heterogeneous variances are often a result of our observations coming from popula- tions with skewed distributions of Y-values at each

x_i and can also be due to a small number of extreme observations or outliers (Section 5.3.9).

Although without replicate Y-values for each

x_i, the homogeneity of variance assumption cannot be strictly tested, the general pattern of the residuals for the different x_ican be very infor- mative. The most useful check is a plot of residu- als against x_i or yˆ_i (Section 5.3.10). There are a couple of options for dealing with heterogeneous variances. If the unequal variances are due to skewed distributions of Y-values at each x_i, then appropriate transformations will always help (Chapter 4 and Section 5.3.11) and generalized linear models (GLMs) are always an option (Chapter 13). Alternatively, weighted least squares (Section 5.3.13) can be applied if there is a consis- tent pattern of unequal variance, e.g. increasing variance in Y with increasing X.

Independence

There is also the assumption that the Y-values and the ␧i are independent of each other, i.e. the

Y-value for any x_idoes not inﬂuence the Y-values for any other x_i. The most common situation in which this assumption might not be met is when the observations represent repeated measurements on sampling or experimental units. Such

data are often termed longitudinal, and arise from longitudinal studies (Diggle et al. 1994, Ware & Liang 1996). A related situation is when we have a longer time series from one or a few units and we wish to ﬁt a model where the predictor variable is related to a temporal sequence, i.e. a time series study (Diggle 1990). Error terms and

Y-values that are non-independent through time are described as autocorrelated. A common occur- rence in biology is positive first-order autocorrelation, where there is a positive relationship between error terms from adjacent observations through time, i.e. a positive error term at one time follows from a positive error term at the previous time and the same for negative error terms. The degree of autocorrelation is measured by the autocorrelation parameter, which is the correlation coefficient between successive error terms. More formal descriptions of autocorrelation structures can be found in many textbooks on linear regression models (e.g Bowerman & O’Connell 1990, Neter et al. 1996). Positive autocorrelation can result in underestimation of the true residual variance and seriously inflated Type I error rates for hypothesis tests on regression parameters. Note that autocorrelation can also be spatial rather than temporal, where observations closer together in space are more similar than those further apart (Diggle 1996).

If our Y-values come from populations in which the error terms are autocorrelated between adjacent x_i, then we would expect the residuals from the fitted regression line also to be corre- lated. An estimate of the autocorrelation parameter is the correlation coefficient between adjacent residuals, although some statistical software cal- culates this as the correlation coefficient between adjacent Y-values. Autocorrelation can therefore be detected in plots of residuals against x_iby an obvious positive, negative or cyclical trend in the residuals. Some statistical software also provides the Durbin–Watson test of the H₀that the autocorrelation parameter equals zero. Because we might expect positive autocorrelation, this test is often one-tailed against the alternative hypothesis that the autocorrelation parameter is greater than zero. Note that the Durbin–Watson test is specifically designed for first-order autocorrela- tions and may not detect other patterns of non- independence (Neter et al. 1996).

There are a number of approaches to modeling a repeated series of observations on sampling or experimental units. These approaches can be used with both continuous (this chapter and Chapter 6) and categorical (Chapters 8–12) predictor variables and some are applicable even when the response variable is not continuous. Commonly, repeated measurements on individual units occur in studies that also incorporate a treatment struc- ture across units, i.e. sampling or experimental units are allocated to a number of treatments (representing one or more categorical predictor variables or factors) and each unit is also recorded repeatedly through time or is subject to different treatments through time. Such “repeated meas- ures” data are usually modeled with analysis of variance type models (partly nested models incor- porating a random term representing units; see Chapter 11). Alternative approaches, including uniﬁed mixed linear models (Laird & Ware 1982, see also Diggle et al. 1994, Ware & Liang 1996) and generalized estimating equations (GEEs; see Liang & Zeger 1986, Ware & Liang 1996), based on the generalized linear model, will be described brieﬂy in Chapter 13.

When the data represent a time series, usually on one or a small number of sampling units, one approach is to adjust the usual OLS regression analysis depending on the level of autocorrelation. Bence (1995) discussed options for this adjustment, pointing out that the usual estimates of the autocorrelation parameter are biased and recommending bias-correction estimates. Usually, however, data forming a long time series require more sophisticated modeling procedures, such as formal time-series analyses. These can be linear, as described by Neter et al. (1996) but more commonly nonlinear as discussed in Chatﬁeld (1989) and Diggle (1990), the latter with a biologi- cal emphasis.

Fixed X

Linear regression analysis assumes that the x_iare known constants, i.e. they are fixed values con- trolled or set by the investigator with no variance associated with them. A linear model in which the predictor variables are fixed is known as Model I or a fixed effects model. This will often be the case in designed experiments where the levels of X are treatments chosen specifically. In these circum-

stances, we would commonly have replicate Y- values for each x_iand X may well be a qualitative variable, so analyses that compare mean values of treatment groups might be more appropriate (Chapters 8–12). The fixed X assumption is prob- ably not met for most regression analyses in biology because X and Y are usually both random variables recorded from a bivariate distribution. For example, Peake & Quinn (1993) did not choose mussel clumps of fixed areas but took a haphazard sample of clumps from the shore; any repeat of this study would use clumps with different areas. We will discuss the case of X being random (Model II or random effects model) in Section 5.3.14 but it turns out that prediction and hypothesis tests from the Model I regression are still applicable even when X is not fixed.

In document Experimental Design and Data Analysis for Biologists - Quinn & Keough - Cambridge 2002 (Page 112-114)