Box 6.3 The multiple linear regression model and its

parameters

Consider a set of i⫽1 to n observations where each observation was selected because of its specific X-values, i.e. the values of the p ( j⫽2 to p) predictor vari- ables X₁, X₂, . . . X_j. . . X_pwere fixed by the investigator, whereas the Y-value for each observation was sampled from a population of possible Y-values. The multiple linear regression model that we usually fit to the data is:

y_i⫽b₀⫹b₁x_i1⫹b₂x_i2⫹...⫹b_jx_ij⫹...⫹b_px_ip⫹e_i (6.1) In model 6.1 we have the following.

y_iis the value of Y for the ith observation when the predictor variable X₁ equals x_i1, X₂equals x_i2, X_jequals x_ij, etc.

b0is the population intercept, the true mean value of Y when X1equals zero,

X₂equals zero, X_jequals zero, etc.

b₁is the partial population regression slope for Y on X₁holding X₂, X₃, etc., constant. It measures the change in Y per unit change in X₁holding the value of all other X-variables constant.

b₂is the partial population regression slope for Y on X₂holding X₁, X₃, etc., constant. It measures the change in Y per unit change in X₂holding the value of all other X-variables constant.

bjis the partial population regression slope for Y on Xjholding X1, X2, etc.,

constant; it measures the change in Y per unit change in X_jholding the value of the other p⫺1 X-variables constant.

eiis random or unexplained error associated with the ith observation. Each ei

measures the difference between each observed y_iand the mean of y_i; the latter is the value of y_ipredicted by the population regression model, which we never know. We assume that when the predictor variable X₁equals x_i1, X₂equals x_i2, X_jequals x_ij, etc., these error terms are normally distributed, their mean is zero (E(ei) equals zero) and their variance is the same and is

designated r_e2_{. This is the assumption of homogeneity of variances. We also}

assume that these eiterms are independent of, and therefore uncorrelated

with, each other. These assumptions (normality, homogeneity of variances and independence) also apply to the response variable Y when the predictor variable X₁equals x_i1, X₂equals x_i2, X_jequals x_ij, etc.

centesimal degree change in latitude, holding longitude constant.

␤2is the population slope for Y on X2holding

X₁, X₃, etc., constant. It measures the change in relative abundance of C₃grasses for a one centesimal degree change in longitude, holding latitude constant.

␤jis the population slope for Y on X_jholding

X₁, X₂, etc., constant; it measures the change in Y per unit change in X_jholding the value of the other p⫺1 X-variables constant.

␧iis random or unexplained error associated with the ith observation of relative abundance of C₃grasses not explained by the model.

The slope parameters (␤₁, ␤₂, . . ., ␤j, . . ., ␤p) are termed partial regression slopes (coefﬁcients) because they measure the change in Y per unit 118 MULTIPLE AND COMPLEX REGRESSION

Fitting the multiple regression model to our data and obtaining estimates of the model parameters is an extension of the methods used for simple linear regression, although the computations are complex. We need to estimate the parameters (b0,

b₁, b2, . . ., bpand re

2_{) of the multiple linear regression model based on our random}

sample of n (x_i1, x_i2, . . ., x_ip, y_i) observations. Once we have estimates of the parameters, we can determine the sample regression line:

yˆ_i⫽b₀⫹b₁x_i1⫹b₂x_i2⫹...⫹b_jx_ij⫹...⫹b_px_ip where:

yˆ_iis the value of y_ifor x_i1, x_i2, . . ., x_ij, . . ., x_ippredicted by the ﬁtted regression line, b₀is the sample estimate of b0, the Y-intercept,

b₁, b₂, . . ., b_j, . . . b_pare the sample estimates of b1, b2, . . ., bj, . . ., bp, the partial

regression slopes.

We can estimate these parameters using either (ordinary) least squares (OLS) or maximum likelihood (ML). If we assume normality, the OLS estimates of b0, b1,

etc., are the same as the ML estimates. As with simple regression, we will focus on OLS estimation. The actual calculations for the OLS estimates of the model parameters involve solving a set of simultaneous normal equations, one for each parameter in the model, and are best represented with matrix algebra (Box 6.4).

The OLS estimates of b0, b1, b2, etc., are the values that produce a sample

regression line (yˆ_i⫽b₀⫹b₁x_i1⫹b₂x_i2⫹...⫹b_jx_ij⫹...⫹b_px_ip) that minimizes 兺n

i⫽1( yi⫺yˆi)

2_{. These are the sum of the squared deviations (SS) between each}

observed y_iand the value of y_ipredicted by the sample regression line for each x_ij. Each ( y_i⫺yˆ_i) is a residual from the ﬁtted regression plane and represents the ver- tical distance between the regression plane and the Y-value for each observation (Figure 6.1). The OLS estimate of r_e2_{(the variance of the model error terms) is the}

sample variance of these residuals and is the Residual (or Error) Mean Square from the analysis of variance (Section 6.1.3).

Figure 6.1. Scatterplot of the log-transformed relative

abundance of C₃plants against longitude and latitude for 73 sites from Paruelo & Lauenroth (1996) showing OLS ﬁtted multiple regression linear response surface.

Lo ng_itu de (°W ) Latitude ( °N) 90 120 ₂₀ 60 lo g1 0 C3 g ras s a bu n da n ce 0 –1.5

change in a particular X holding the other p⫺1

X-variables constant. It is important to distinguish these partial regression slopes in multiple linear regression from the regression slope in simple linear regression. If we ﬁt a simple regression model between Y and just one of the X-variables, then that slope is the change in Y per unit change in X, ignoring the other p⫺1 predictor variables we might have recorded plus any predictor variables we didn’t measure. Again using the data from Paruelo & Lauenroth (1996), the partial regression slope of the relative abundance of C₃ grasses against longitude measures the change in relative abundance for a one unit (one centesimal degree) change in longitude, holding latitude constant. If we ﬁtted a simple linear regression model for relative abundance of C₃grasses against longitude, we completely ignore latitude and any other predictors we didn’t record in the interpretation of the slope. Multiple regression models enable us to assess the relationship between the response variable and each of the predictors, adjusting for the remaining predictors.

6.1.2 Estimating model parameters

We estimate the parameters (␤₀, ␤₁, ␤₂, . . ., ␤pand ␴␧2) of the multiple linear regression model, based

on our random sample of n (x_i₁, x_i₂, . . ., x_ij, . . ., x_ip, y_i) observations, using OLS methods (Box 6.3). The ﬁtted regression line is:

ˆi⫽b₀⫹b₁x_i₁⫹b₂x_i₂⫹. . .⫹bjx_ij⫹. . .⫹bpx_ip (6.4) where:

ˆ_iis the value of relative abundance of C₃ grasses for x_i₁, x_i₂, . . ., x_ij, . . ., x_ip(e.g. a given combination of latitude and longitude) predicted by the ﬁtted regression model,

b₀is the sample estimate of ␤₀, the Y- intercept,

b₁, b₂, . . ., b_j, . . . b_pare the sample estimates of ␤1,␤2, . . ., ␤j, . . ., ␤p, the partial regression slopes.

We can also determine standardized partial regression slopes that are independent of the units in which the variables are measured (Section 6.1.6).

The OLS estimates of these parameters are the values that minimize the sum of squared deviations (SS) between each observed value of rel-

ative abundance of C₃ grasses and the relative abundance of C₃ grasses predicted by the ﬁtted regression model. This difference between each observed y_iand each predicted yˆ_iis called a resid- ual (e_i). We will use the residuals for checking the ﬁt of the model to our data in Section 6.1.8.

The actual calculations for the OLS estimates of the model parameters involve solving a set of simultaneous normal equations (see Section 5.2.3), one for each parameter in the model, and are best represented with matrix algebra (Box 6.4). The computations are tedious but the estimates, and their standard errors, should be standard output from multiple linear regression routines in your statistical software. Conﬁdence intervals for the parameters can also be calculated using the t distribution with n⫺p df. New Y-values can be predicted from new values of any or all of the p

X-variables by substituting the new X-values into the regression equation and calculating the pre- dicted Y-value. As with simple regression, be careful about predicting from values of any of the

X-variables outside the range of your data. Standard errors and prediction intervals for new

Y-values can be determined (see Neter et al. 1996). Note that the confidence intervals for model parameters (slopes and intercept) and prediction intervals for new Y-values from new X-values depend on the number of observations and the number of predictors. This is because the divisor for the MS_Residual, and the df for the t distribution used for confidence intervals, is n⫺(p⫹1). Therefore, for a given standard error, our confi- dence in predicted Y-values from our fitted model is reduced when we include more predictors.

6.1.3 Analysis of variance

Similar to simple linear regression models described in Chapter 5, we can partition the total variation in Y (SS_Total) into two additive compo- nents (Table 6.1). The ﬁrst is the variation in Y explained by its linear relationship with X₁, X₂, . . .,

X_p, termed SS_Regression. The second is the variation in Y not explained by the linear relationship with

X₁, X₂, . . ., X_p, termed SS_Residualand which is meas- ured as the difference between each observed y_i and the Y-value predicted by the regression model ( yˆ_i). These SS in Table 6.1 are identical to those in Table 5.1 for simple regression models. In fact, the

120 MULTIPLE AND COMPLEX REGRESSION

In document Experimental Design and Data Analysis for Biologists - Quinn & Keough - Cambridge 2002 (Page 137-140)