Robust regression - Box 5.7 Model II regression.

Box 5.7 Model II regression.

5.3.15 Robust regression

One of the limitations of OLS is that the estimates of model parameters, and therefore subsequent hypothesis tests, can be sensitive to distributional assumptions and affected by outlying observations, i.e. ones with large residuals. Even general- ized linear model analyses (GLMs; see Chapter 13) that allow other distributions for error terms besides normal, and are based on ML estimation, are sensitive to extreme observations. Robust regression techniques are procedures for ﬁtting linear regression models that are less sensitive to deviations of the underlying distribution of error terms from that speciﬁed, and also less sensitive to extreme observations (Birkes & Dodge 1993).

Least absolute deviations (LAD)

LAD, sometimes termed least absolute residuals (LAR; see Berk 1990), is where the estimates of ␤₀

104 CORRELATION AND REGRESSION

Figure 5.17. (a) Scatterplot (with

Loess smoother, smoothing parameter⫽0.5) of number of species against clump area. (b) Scatterplot (with Loess smoother, smoothing parameter⫽0.5) of residuals against predicted number of species from linear regression of number of species against clump area. Clump area (dm2₎ Number of species Predicted value 0 1 2 3 0 10 20 30 0 10 20 30 –10 0 10 (a) Residual (b)

Figure 5.18. (a) Scatterplot (with

linear regression line ﬁtted) of log10

number of species against log₁₀ clump area. (b) Scatterplot (with Loess smoother, smoothing parameter⫽0.5) of residuals against predicted number of species from linear regression of log10number of

species against log₁₀clump area.

and ␤₁are those that minimize the sum of absolute values of the residuals:

|e_i|⫽ |( yi⫺yˆi) (5.24)

rather than the sum of squared residuals (兺n i⫽1ei

2₎

as in OLS. By not squaring the residuals, extreme observations have less influence on the fitted model. The difficulty is that the computations of the LAD estimates for ␤₀and ␤₁are more complex than OLS estimates, although algorithms are available (Birkes & Dodge 1993) and robust regression techniques are now common in statistical software (often as part of nonlinear modeling rou- tines).

M-estimators

These were introduced in Chapter 2 for estimating the mean of a population. In a regression context,

M-estimators involve minimizing the sum of some function of e_i, with OLS (minimizing 兺n

i⫽1ei

2_{) and}

LAD (minimizing 兺n

i⫽1|ei|) simply being special

兺

i⫽1n

兺

i⫽1n

cases (Birkes & Dodge 1993). Huber M-estimators, described in Chapter 2, weight the observations differently depending how far they are from the center of the distribution. In robust regression analyses, Huber M-estimators weight the residuals (e_i) differently depending on how far they are from zero (Berk 1990) and use these new residuals to calculate adjusted Y-values. The esti- mates for ␤₀and ␤₁are those that minimize both 兺n

i⫽1ei

2_{(i.e. OLS) when the residuals are near zero}

and 兺|ei| (i.e. LAD) when the residuals are far from zero. We need to choose the size of the residual at which the method switches from OLS to LAD; this decision is somewhat subjective, although recommendations are available (Huber 1981, Wilcox 1997). You should ensure that the default value used by your statistical software for robust regression seems reasonable. Wilcox (1997) described more sophisticated robust regression procedures, including an M-estimator based on iteratively reweighting the residuals. One problem with M-estimators is that the sampling distributions of the estimated coefﬁcients are unlikely to be normal, unless sample sizes are

LINEAR REGRESSION ANALYSIS 105

Figure 5.19. (a) Scatterplot (with

Loess smoother, smoothing parameter⫽0.5) of number of individuals against clump area. (b) Scatterplot of residuals against predicted number of individuals from linear regression of number of individuals against clump area.

Clump area (dm ) Predicted value

Figure 5.20. (a) Scatterplot (with

linear regression line and 95% conﬁdence band ﬁtted) of log₁₀ number of individuals against log10

clump area. (b) Scatterplot of residuals against predicted number of individuals from linear regression of log10number of individuals against

log₁₀clump area.

large, and the usual calculations for standard errors, conﬁdence intervals and hypothesis testing may not be valid (Berk 1990). Resampling methods such as bootstrap (Chapter 2) are prob- ably the most reliable approach (Wilcox 1997).

Rank-based (“non-parametric”) regression

This approach does not assume any speciﬁc distribution of the error terms but still ﬁts the usual linear regression model. This approach might be particularly useful if either of the two variables is not normally distributed and nonlinearity is evident but transformations are either ineffective or misrepresent the underlying biological process. The simplest non-parametric regression analysis is based on the [n(n⫺1)]/2 OLS slopes of the regression lines for each pair of X values (the slope for y₁x₁and y₂x₂, the slope for y₂x₂and y₃x₃, the slope for y₁x₁and y₃x₃, etc.). The non-parametric estimator of ␤₁ (b₁) is the median of these slopes and the non-parametric estimator of ␤₀(b₀) is the median of all the yi⫺b₁x_idifferences (Birkes & Dodge 1993, Sokal & Rohlf 1995, Sprent 1993). A

ttest for ␤₁based on the ranks of the Y-values is described in Birkes & Dodge (1993); an alternative is to simply use Kendall’s rank correlation coefﬁ- cient (Sokal & Rohlf 1995).

Randomization test

A randomization test of the H₀that ␤₁equals zero can also be constructed by comparing the observed value of b₁to the distribution of b₁found by pairing the y_iand x_ivalues at random a large number of times and calculating b₁ each time (Manly 1997). The P value then is the % of values of

b₁from this distribution equal to or larger than the observed value of b₁.

5.4 Relationship between

regression and correlation

In document Experimental Design and Data Analysis for Biologists - Quinn & Keough - Cambridge 2002 (Page 124-126)