• No results found

Box 5.7 Model II regression.

5.3.15 Robust regression

One of the limitations of OLS is that the estimates of model parameters, and therefore subsequent hypothesis tests, can be sensitive to distributional assumptions and affected by outlying observa- tions, i.e. ones with large residuals. Even general- ized linear model analyses (GLMs; see Chapter 13) that allow other distributions for error terms besides normal, and are based on ML estimation, are sensitive to extreme observations. Robust regression techniques are procedures for fitting linear regression models that are less sensitive to deviations of the underlying distribution of error terms from that specified, and also less sensitive to extreme observations (Birkes & Dodge 1993).

Least absolute deviations (LAD)

LAD, sometimes termed least absolute residuals (LAR; see Berk 1990), is where the estimates of ␤0

104 CORRELATION AND REGRESSION

Figure 5.17. (a) Scatterplot (with

Loess smoother, smoothing parameter⫽0.5) of number of species against clump area. (b) Scatterplot (with Loess smoother, smoothing parameter⫽0.5) of residuals against predicted number of species from linear regression of number of species against clump area. Clump area (dm2) Number of species Predicted value 0 1 2 3 0 10 20 30 0 10 20 30 –10 0 10 (a) Residual (b)

Figure 5.18. (a) Scatterplot (with

linear regression line fitted) of log10

number of species against log10 clump area. (b) Scatterplot (with Loess smoother, smoothing parameter⫽0.5) of residuals against predicted number of species from linear regression of log10number of

species against log10clump area.

and ␤1are those that minimize the sum of abso- lute values of the residuals:

|ei|⫽ |( yi⫺yˆi) (5.24)

rather than the sum of squared residuals (兺n i⫽1ei

2)

as in OLS. By not squaring the residuals, extreme observations have less influence on the fitted model. The difficulty is that the computations of the LAD estimates for ␤0and ␤1are more complex than OLS estimates, although algorithms are available (Birkes & Dodge 1993) and robust regres- sion techniques are now common in statistical software (often as part of nonlinear modeling rou- tines).

M-estimators

These were introduced in Chapter 2 for estimating the mean of a population. In a regression context,

M-estimators involve minimizing the sum of some function of ei, with OLS (minimizing 兺n

i⫽1ei

2) and

LAD (minimizing 兺n

i⫽1|ei|) simply being special

i⫽1n

i⫽1n

cases (Birkes & Dodge 1993). Huber M-estimators, described in Chapter 2, weight the obser- vations differently depending how far they are from the center of the distribution. In robust regression analyses, Huber M-estimators weight the residuals (ei) differently depending on how far they are from zero (Berk 1990) and use these new residuals to calculate adjusted Y-values. The esti- mates for ␤0and ␤1are those that minimize both 兺n

i⫽1ei

2(i.e. OLS) when the residuals are near zero

and 兺|ei| (i.e. LAD) when the residuals are far from zero. We need to choose the size of the resid- ual at which the method switches from OLS to LAD; this decision is somewhat subjective, although recommendations are available (Huber 1981, Wilcox 1997). You should ensure that the default value used by your statistical software for robust regression seems reasonable. Wilcox (1997) described more sophisticated robust regression procedures, including an M-estimator based on iteratively reweighting the residuals. One problem with M-estimators is that the sampling distributions of the estimated coefficients are unlikely to be normal, unless sample sizes are

LINEAR REGRESSION ANALYSIS 105

Figure 5.19. (a) Scatterplot (with

Loess smoother, smoothing parameter⫽0.5) of number of individuals against clump area. (b) Scatterplot of residuals against predicted number of individuals from linear regression of number of individuals against clump area.

Clump area (dm ) Predicted value

Figure 5.20. (a) Scatterplot (with

linear regression line and 95% confidence band fitted) of log10 number of individuals against log10

clump area. (b) Scatterplot of residuals against predicted number of individuals from linear regression of log10number of individuals against

log10clump area.

large, and the usual calculations for standard errors, confidence intervals and hypothesis testing may not be valid (Berk 1990). Resampling methods such as bootstrap (Chapter 2) are prob- ably the most reliable approach (Wilcox 1997).

Rank-based (“non-parametric”) regression

This approach does not assume any specific distri- bution of the error terms but still fits the usual linear regression model. This approach might be particularly useful if either of the two variables is not normally distributed and nonlinearity is evident but transformations are either ineffective or misrepresent the underlying biological process. The simplest non-parametric regression analysis is based on the [n(n⫺1)]/2 OLS slopes of the regression lines for each pair of X values (the slope for y1x1and y2x2, the slope for y2x2and y3x3, the slope for y1x1and y3x3, etc.). The non-paramet- ric estimator of ␤1 (b1) is the median of these slopes and the non-parametric estimator of ␤0(b0) is the median of all the yi⫺b1xidifferences (Birkes & Dodge 1993, Sokal & Rohlf 1995, Sprent 1993). A

ttest for ␤1based on the ranks of the Y-values is described in Birkes & Dodge (1993); an alternative is to simply use Kendall’s rank correlation coeffi- cient (Sokal & Rohlf 1995).

Randomization test

A randomization test of the H0that ␤1equals zero can also be constructed by comparing the observed value of b1to the distribution of b1found by pairing the yiand xivalues at random a large number of times and calculating b1 each time (Manly 1997). The P value then is the % of values of

b1from this distribution equal to or larger than the observed value of b1.

5.4

Relationship between

regression and correlation

Outline

Related documents