Box 6.7 Worked example of indicator (dummy) variables We will consider a subset of the data from Loyn (1987) where abundance of forest

birds is the response variable and grazing intensity (1 to 5 from least to greatest) and log₁₀patch area are the predictor variables. First, we treat grazing as a continuous variable and ﬁt model 6.28.

Coefﬁcient Estimate Standard error t P

Intercept 21.603 3.092 6.987 ⬍0.001 Grazing ⫺2.854 0.713 ⫺4.005 ⬍0.001 Log₁₀area 6.890 1.290 5.341 ⬍0.001

Note that both the effects of grazing and log₁₀area are signiﬁcant and the partial regression slope for grazing is negative, indicating that, holding patch area constant, there are fewer birds in patches with more intense grazing.

Now we will convert grazing into four dummy variables with no grazing (level 1) as the reference category (Table 6.3) and ﬁt model 6.29.

Estimate Standard error t P

Intercept 15.716 2.767 5.679 ⬍0.001 Grazing1 0.383 2.912 0.131 0.896 Grazing2 ⫺0.189 2.549 ⫺0.074 0.941 Grazing3 ⫺1.592 2.976 ⫺0.535 0.595 Grazing4 ⫺11.894 2.931 ⫺4.058 ⬍0.001 Log10area 7.247 1.255 5.774 ⬍0.001 The partial regression slopes for these dummy variables measure the difference in bird abundance between the grazing category represented by the dummy variable and the reference category for any speciﬁc level of log₁₀area. Note that only the effect of intense grazing (category: 5; dummy variable: grazing₄) is different from the no grazing category.

the reference category (zero grazing) for any spe- ciﬁc value of log₁₀area. Using analysis of covariance terminology (Chapter 12), each regression slope measures the difference in the adjusted mean of Y between that category and the refer- ence category (Box 6.7). Interaction terms between the dummy variables and the continuous variable could also be included. These interactions measure how much the slopes of the regressions between Y and the log₁₀ area differ between the levels of grazing. Most statistical software now automates the coding of categorical variables in regression analyses, although you should check what form of coding your software uses. Models that incorporate continuous and categorical predictors will also be considered as part of analysis of covariance in Chapter 12.

6.1.15 Finding the “best” regression model

In many uses of multiple regression, biologists want to find the smallest subset of predictors that provides the “best fit” to the observed data. There are two apparent reasons for this (Mac Nally 2000), related to the two main purposes of regression analysis – explanation and prediction. First, the “best” subset of predictors should include those that are most important in explaining the varia- tion in the response variable. Second, other things being equal, the precision of predictions from our fitted model will be greater with fewer predictor variables in the model. Note that, as we said in the introduction to Chapter 5, biologists, especially ecologists, seem to rarely use their regression models for prediction and we agree with Mac Nally (2000) that biologists are usually searching for the “best” regression model to explain the response variable.

It is important to remember that there will rarely be, for any real data set, a single “best” subset of predictors, particularly if there are many predictors and they are in any way correlated with each other. There will usually be a few models, with different numbers of predictors, which provide similar ﬁts to the observed data. The choice between these competing models will still need to be based on how well the models meet the assumptions, diagnostic considerations of outli- ers and other inﬂuential observations and biolog- ical knowledge of the variables retained.

Criteria for “best” model

Irrespective of which method is used for selecting which variables are included in the model (see below), some criterion must be used for deciding which is the “best” model. One characteristic of such a criterion is that it must protect against “overﬁtting”, where the addition of extra predictor variables may suggest a better ﬁt even when these variables actually add very little to the explanatory power. For example, r2 _cannot

decrease as more predictor variables are added to the model even if those predictors contribute nothing to the ability of the model to predict or explain the response variable (Box 6.8). So r2_{is not}

suitable for comparing models with different numbers of predictors.

We are usually dealing with a range of models, with different numbers of predictors, but all are subsets of the full model with all predictors. We will use P to indicate all possible predictors, p is the number of predictors included in a specific model, n is the number of observations and we will assume that an intercept is always fitted. If the models are all additive, i.e. no interactions, the number of parameters is p⫹1 (the number of predictors plus the intercept). When interactions are included, then p in the equations below should be the number of parameters (except the intercept) in the model, including both predictors and their interactions. We will describe four criteria for determining the fit of a model to the data (Table 6.4).

The ﬁrst is the adjusted r2 _{which takes into}

account the number of predictors in the model and, in contrast to the usual r2_{, basically uses}

mean squares instead of sum of squares and can increase or decrease as new variables are added to the model. A larger value indicates a better fit. Using the MS_Residualfrom the fit of the model is equivalent where a lower value indicates a better fit.

The second is Mallow’s C_p, which works by comparing a speciﬁc reduced model to the full model with all P predictors included. For the full model with all P predictors, C_pwill equal P⫹1 (the number of parameters including the intercept). The choice of the best model using C_phas two com- ponents: C_pshould be as small as possible and as close to p as possible.

138 MULTIPLE AND COMPLEX REGRESSION

Box 6.8 Hierarchical partitioning and model selection.

In document Experimental Design and Data Analysis for Biologists - Quinn & Keough - Cambridge 2002 (Page 156-158)