• No results found

Box 5.6 Model comparisons in simple linear regression We can use the model relating CWD basal area to riparian tree density to illustrate

5.3.8 Assumptions of regression analysis

The assumptions of the linear regression model strictly concern the error terms (␧i) in the model, as described in Section 5.3.2. Since these error terms are the only random ones in the model, then the assumptions also apply to observations of the response variable yi. Note that these assumptions are not required for the OLS estima-

SSResidual(Full) SSResidual(Reduced) SSResidual SSTotal SSRegression SSTotal

tion of model parameters but are necessary for reliable confidence intervals and hypothesis tests based on t distributions or F distributions.

The residuals from the fitted model (Table 5.4) are important for checking whether the assump- tions of linear regression analysis are met. Residuals indicate how far each observation is from the fitted OLS regression line, in Y-variable space (i.e. vertically). Observations with larger residuals are further from the fitted line that those with smaller residuals. Patterns of residuals represent patterns in the error terms from the linear model and can be used to check assumptions and also the influence each observation has on the fitted model.

Normality

This assumption is that the populations of

Y-values and the error terms (␧i) are normally dis- tributed for each level of the predictor variable xi. Confidence intervals and hypothesis tests based on OLS estimates of regression parameters are robust to this assumption unless the lack of nor- mality results in violations of other assumptions. In particular, skewed distributions of yican cause problems with homogeneity of variance and line- arity, as discussed below.

Without replicate Y-values for each xi, this assumption is difficult to verify. However, reason- able checks can be based on the residuals from the fitted model (Bowerman & O’Connell 1990). The methods we described in Chapter 4 for check- ing normality, including formal tests or graphical methods such as boxplots and probability plots, can be applied to these residuals. If the assump- tion is not met, then there are at least two options. First, a transformation of Y (Chapter 4

92 CORRELATION AND REGRESSION

Table 5.4 Types of residual for linear regression models, where hiis the leverage for observation i

Residual ei⫽yi⫺yˆi

Standardized residual Studentized residual Studentized deleted residual ei

n⫺1 SSResidual(1⫺hi)⫺ei2 ei 兹MSResidual(1⫺hi) ei 兹MSResidual

and Section 5.3.11) may be appropriate if the dis- tribution is positively skewed. Second, we can fit a linear model using techniques that allow other distributions of error terms other than normal. These generalized linear models (GLMs) will be described in Chapter 13. Note that non-normality of Y is very commonly associated with heterogene- ity of variance and/or nonlinearity.

Homogeneity of variance

This assumption is that the populations of Y- values, and the error terms (␧i), have the same var- iance for each xi:

␴1 2⫽␴

2

2⫽. . .⫽␴i2⫽. . .⫽␴

␧2for i⫽1 to n (5.17)

The homogeneity of variance assumption is important, its violation having a bigger effect on the reliability of interval estimates of, and tests of hypotheses about, regression parameters (and parameters of other linear models) than non- normality. Heterogeneous variances are often a result of our observations coming from popula- tions with skewed distributions of Y-values at each

xi and can also be due to a small number of extreme observations or outliers (Section 5.3.9).

Although without replicate Y-values for each

xi, the homogeneity of variance assumption cannot be strictly tested, the general pattern of the residuals for the different xican be very infor- mative. The most useful check is a plot of residu- als against xi or yˆi (Section 5.3.10). There are a couple of options for dealing with heterogeneous variances. If the unequal variances are due to skewed distributions of Y-values at each xi, then appropriate transformations will always help (Chapter 4 and Section 5.3.11) and generalized linear models (GLMs) are always an option (Chapter 13). Alternatively, weighted least squares (Section 5.3.13) can be applied if there is a consis- tent pattern of unequal variance, e.g. increasing variance in Y with increasing X.

Independence

There is also the assumption that the Y-values and the ␧i are independent of each other, i.e. the

Y-value for any xidoes not influence the Y-values for any other xi. The most common situation in which this assumption might not be met is when the observations represent repeated measure- ments on sampling or experimental units. Such

data are often termed longitudinal, and arise from longitudinal studies (Diggle et al. 1994, Ware & Liang 1996). A related situation is when we have a longer time series from one or a few units and we wish to fit a model where the predictor vari- able is related to a temporal sequence, i.e. a time series study (Diggle 1990). Error terms and

Y-values that are non-independent through time are described as autocorrelated. A common occur- rence in biology is positive first-order autocorrela- tion, where there is a positive relationship between error terms from adjacent observations through time, i.e. a positive error term at one time follows from a positive error term at the previous time and the same for negative error terms. The degree of autocorrelation is measured by the auto- correlation parameter, which is the correlation coefficient between successive error terms. More formal descriptions of autocorrelation structures can be found in many textbooks on linear regres- sion models (e.g Bowerman & O’Connell 1990, Neter et al. 1996). Positive autocorrelation can result in underestimation of the true residual var- iance and seriously inflated Type I error rates for hypothesis tests on regression parameters. Note that autocorrelation can also be spatial rather than temporal, where observations closer together in space are more similar than those further apart (Diggle 1996).

If our Y-values come from populations in which the error terms are autocorrelated between adjacent xi, then we would expect the residuals from the fitted regression line also to be corre- lated. An estimate of the autocorrelation parame- ter is the correlation coefficient between adjacent residuals, although some statistical software cal- culates this as the correlation coefficient between adjacent Y-values. Autocorrelation can therefore be detected in plots of residuals against xiby an obvious positive, negative or cyclical trend in the residuals. Some statistical software also provides the Durbin–Watson test of the H0that the auto- correlation parameter equals zero. Because we might expect positive autocorrelation, this test is often one-tailed against the alternative hypothe- sis that the autocorrelation parameter is greater than zero. Note that the Durbin–Watson test is specifically designed for first-order autocorrela- tions and may not detect other patterns of non- independence (Neter et al. 1996).

There are a number of approaches to modeling a repeated series of observations on sampling or experimental units. These approaches can be used with both continuous (this chapter and Chapter 6) and categorical (Chapters 8–12) predictor vari- ables and some are applicable even when the response variable is not continuous. Commonly, repeated measurements on individual units occur in studies that also incorporate a treatment struc- ture across units, i.e. sampling or experimental units are allocated to a number of treatments (rep- resenting one or more categorical predictor vari- ables or factors) and each unit is also recorded repeatedly through time or is subject to different treatments through time. Such “repeated meas- ures” data are usually modeled with analysis of variance type models (partly nested models incor- porating a random term representing units; see Chapter 11). Alternative approaches, including unified mixed linear models (Laird & Ware 1982, see also Diggle et al. 1994, Ware & Liang 1996) and generalized estimating equations (GEEs; see Liang & Zeger 1986, Ware & Liang 1996), based on the generalized linear model, will be described briefly in Chapter 13.

When the data represent a time series, usually on one or a small number of sampling units, one approach is to adjust the usual OLS regression analysis depending on the level of autocorrela- tion. Bence (1995) discussed options for this adjustment, pointing out that the usual estimates of the autocorrelation parameter are biased and recommending bias-correction estimates. Usually, however, data forming a long time series require more sophisticated modeling procedures, such as formal time-series analyses. These can be linear, as described by Neter et al. (1996) but more commonly nonlinear as discussed in Chatfield (1989) and Diggle (1990), the latter with a biologi- cal emphasis.

Fixed X

Linear regression analysis assumes that the xiare known constants, i.e. they are fixed values con- trolled or set by the investigator with no variance associated with them. A linear model in which the predictor variables are fixed is known as Model I or a fixed effects model. This will often be the case in designed experiments where the levels of X are treatments chosen specifically. In these circum-

stances, we would commonly have replicate Y- values for each xiand X may well be a qualitative variable, so analyses that compare mean values of treatment groups might be more appropriate (Chapters 8–12). The fixed X assumption is prob- ably not met for most regression analyses in biology because X and Y are usually both random variables recorded from a bivariate distribution. For example, Peake & Quinn (1993) did not choose mussel clumps of fixed areas but took a haphazard sample of clumps from the shore; any repeat of this study would use clumps with different areas. We will discuss the case of X being random (Model II or random effects model) in Section 5.3.14 but it turns out that prediction and hypothesis tests from the Model I regression are still applicable even when X is not fixed.

Outline

Related documents