• No results found

Box 6.3 The multiple linear regression model and its

parameters

Consider a set of i⫽1 to n observations where each observation was selected because of its specific X-values, i.e. the values of the p ( j⫽2 to p) predictor vari- ables X1, X2, . . . Xj. . . Xpwere fixed by the investigator, whereas the Y-value for each observation was sampled from a population of possible Y-values. The multiple linear regression model that we usually fit to the data is:

yi⫽b0⫹b1xi1⫹b2xi2⫹...⫹bjxij⫹...⫹bpxip⫹ei (6.1) In model 6.1 we have the following.

yiis the value of Y for the ith observation when the predictor variable X1 equals xi1, X2equals xi2, Xjequals xij, etc.

b0is the population intercept, the true mean value of Y when X1equals zero,

X2equals zero, Xjequals zero, etc.

b1is the partial population regression slope for Y on X1holding X2, X3, etc., constant. It measures the change in Y per unit change in X1holding the value of all other X-variables constant.

b2is the partial population regression slope for Y on X2holding X1, X3, etc., constant. It measures the change in Y per unit change in X2holding the value of all other X-variables constant.

bjis the partial population regression slope for Y on Xjholding X1, X2, etc.,

constant; it measures the change in Y per unit change in Xjholding the value of the other p⫺1 X-variables constant.

eiis random or unexplained error associated with the ith observation. Each ei

measures the difference between each observed yiand the mean of yi; the latter is the value of yipredicted by the population regression model, which we never know. We assume that when the predictor variable X1equals xi1, X2equals xi2, Xjequals xij, etc., these error terms are normally distributed, their mean is zero (E(ei) equals zero) and their variance is the same and is

designated re2. This is the assumption of homogeneity of variances. We also

assume that these eiterms are independent of, and therefore uncorrelated

with, each other. These assumptions (normality, homogeneity of variances and independence) also apply to the response variable Y when the predictor variable X1equals xi1, X2equals xi2, Xjequals xij, etc.

centesimal degree change in latitude, holding longitude constant.

␤2is the population slope for Y on X2holding

X1, X3, etc., constant. It measures the change in relative abundance of C3grasses for a one centesimal degree change in longitude, holding latitude constant.

␤jis the population slope for Y on Xjholding

X1, X2, etc., constant; it measures the change in Y per unit change in Xjholding the value of the other p⫺1 X-variables constant.

␧iis random or unexplained error associated with the ith observation of relative abundance of C3grasses not explained by the model.

The slope parameters (␤1, ␤2, . . ., ␤j, . . ., ␤p) are termed partial regression slopes (coefficients) because they measure the change in Y per unit 118 MULTIPLE AND COMPLEX REGRESSION

Fitting the multiple regression model to our data and obtaining estimates of the model parameters is an extension of the methods used for simple linear regression, although the computations are complex. We need to estimate the parameters (b0,

b1, b2, . . ., bpand re

2) of the multiple linear regression model based on our random

sample of n (xi1, xi2, . . ., xip, yi) observations. Once we have estimates of the param- eters, we can determine the sample regression line:

i⫽b0⫹b1xi1⫹b2xi2⫹...⫹bjxij⫹...⫹bpxip where:

iis the value of yifor xi1, xi2, . . ., xij, . . ., xippredicted by the fitted regression line, b0is the sample estimate of b0, the Y-intercept,

b1, b2, . . ., bj, . . . bpare the sample estimates of b1, b2, . . ., bj, . . ., bp, the partial

regression slopes.

We can estimate these parameters using either (ordinary) least squares (OLS) or maximum likelihood (ML). If we assume normality, the OLS estimates of b0, b1,

etc., are the same as the ML estimates. As with simple regression, we will focus on OLS estimation. The actual calculations for the OLS estimates of the model param- eters involve solving a set of simultaneous normal equations, one for each param- eter in the model, and are best represented with matrix algebra (Box 6.4).

The OLS estimates of b0, b1, b2, etc., are the values that produce a sample

regression line (yˆi⫽b0⫹b1xi1⫹b2xi2⫹...⫹bjxij⫹...⫹bpxip) that minimizes 兺n

i⫽1( yi⫺yˆi)

2. These are the sum of the squared deviations (SS) between each

observed yiand the value of yipredicted by the sample regression line for each xij. Each ( yi⫺yˆi) is a residual from the fitted regression plane and represents the ver- tical distance between the regression plane and the Y-value for each observation (Figure 6.1). The OLS estimate of re2(the variance of the model error terms) is the

sample variance of these residuals and is the Residual (or Error) Mean Square from the analysis of variance (Section 6.1.3).

Figure 6.1. Scatterplot of the log-transformed relative

abundance of C3plants against longitude and latitude for 73 sites from Paruelo & Lauenroth (1996) showing OLS fitted multiple regression linear response surface.

Lo ngitu de (°W ) Latitude ( °N) 90 120 20 60 lo g1 0 C3 g ras s a bu n da n ce 0 –1.5

change in a particular X holding the other p⫺1

X-variables constant. It is important to distinguish these partial regression slopes in multiple linear regression from the regression slope in simple linear regression. If we fit a simple regression model between Y and just one of the X-variables, then that slope is the change in Y per unit change in X, ignoring the other p⫺1 predictor variables we might have recorded plus any predictor vari- ables we didn’t measure. Again using the data from Paruelo & Lauenroth (1996), the partial regression slope of the relative abundance of C3 grasses against longitude measures the change in relative abundance for a one unit (one centesimal degree) change in longitude, holding latitude con- stant. If we fitted a simple linear regression model for relative abundance of C3grasses against longi- tude, we completely ignore latitude and any other predictors we didn’t record in the interpretation of the slope. Multiple regression models enable us to assess the relationship between the response variable and each of the predictors, adjusting for the remaining predictors.

6.1.2 Estimating model parameters

We estimate the parameters (␤0, ␤1, ␤2, . . ., ␤pand ␴␧2) of the multiple linear regression model, based

on our random sample of n (xi1, xi2, . . ., xij, . . ., xip, yi) observations, using OLS methods (Box 6.3). The fitted regression line is:

y

ˆi⫽b0⫹b1xi1⫹b2xi2⫹. . .⫹bjxij⫹. . .⫹bpxip (6.4) where:

y

ˆiis the value of relative abundance of C3 grasses for xi1, xi2, . . ., xij, . . ., xip(e.g. a given combination of latitude and longitude) predicted by the fitted regression model,

b0is the sample estimate of ␤0, the Y- intercept,

b1, b2, . . ., bj, . . . bpare the sample estimates of ␤1,␤2, . . ., ␤j, . . ., ␤p, the partial regression slopes.

We can also determine standardized partial regression slopes that are independent of the units in which the variables are measured (Section 6.1.6).

The OLS estimates of these parameters are the values that minimize the sum of squared deviations (SS) between each observed value of rel-

ative abundance of C3 grasses and the relative abundance of C3 grasses predicted by the fitted regression model. This difference between each observed yiand each predicted yˆiis called a resid- ual (ei). We will use the residuals for checking the fit of the model to our data in Section 6.1.8.

The actual calculations for the OLS estimates of the model parameters involve solving a set of simultaneous normal equations (see Section 5.2.3), one for each parameter in the model, and are best represented with matrix algebra (Box 6.4). The computations are tedious but the estimates, and their standard errors, should be standard output from multiple linear regression routines in your statistical software. Confidence intervals for the parameters can also be calculated using the t distribution with n⫺p df. New Y-values can be predicted from new values of any or all of the p

X-variables by substituting the new X-values into the regression equation and calculating the pre- dicted Y-value. As with simple regression, be careful about predicting from values of any of the

X-variables outside the range of your data. Standard errors and prediction intervals for new

Y-values can be determined (see Neter et al. 1996). Note that the confidence intervals for model parameters (slopes and intercept) and prediction intervals for new Y-values from new X-values depend on the number of observations and the number of predictors. This is because the divisor for the MSResidual, and the df for the t distribution used for confidence intervals, is n⫺(p⫹1). Therefore, for a given standard error, our confi- dence in predicted Y-values from our fitted model is reduced when we include more predictors.

6.1.3 Analysis of variance

Similar to simple linear regression models described in Chapter 5, we can partition the total variation in Y (SSTotal) into two additive compo- nents (Table 6.1). The first is the variation in Y explained by its linear relationship with X1, X2, . . .,

Xp, termed SSRegression. The second is the variation in Y not explained by the linear relationship with

X1, X2, . . ., Xp, termed SSResidualand which is meas- ured as the difference between each observed yi and the Y-value predicted by the regression model ( yˆi). These SS in Table 6.1 are identical to those in Table 5.1 for simple regression models. In fact, the

120 MULTIPLE AND COMPLEX REGRESSION

Outline

Related documents