Part II
Chapter 7
Multiple Regression
A multiple linear regression model is a linear model that describes how a y-variable relates to two or more xvariables (or transformations of x-variables).
For example, suppose that a researcher is studying factors that might af-fect systolic blood pressures for women aged 45 to 65 years old. The response variable is systolic blood pressure (Y). Suppose that two predictor variables of interest are age (X1) and body mass index (X2). The general structure of a multiple linear regression model for this situation would be
Y =β0+β1X1+β2X2+.
• The equation β0 +β1X1 +β2X2 describes the mean value of blood pressure for specific values of age and BMI.
• The error term () describes the characteristics of the differences be-tween individual values of blood pressure and their expected values of blood pressure.
One note concerning terminology. A linear model is one that is linear in the beta coefficients, meaning that each beta coefficient simply multiplies an x-variable or a transformation of an x-variable. For instance y=β0+β1x+ β2x2+is called a multiple linear regression model even though it describes a quadratic, curved, relationship between y and a single x-variable.
88 CHAPTER 7. MULTIPLE REGRESSION
7.1
About the Model
Notation for the Population Model
• A population model for a multiple regression model that relates a y-variable to p−1 predictor variables is written as
yi =β0+β1xi,1+β2xi,2+. . .+βp−1xi,p−1 +i. (7.1)
• We assume that the i have a normal distribution with mean 0 and
constant varianceσ2. These are the same assumptions that we used in simple regression with one x-variable.
• The subscriptirefers to the ithindividual or unit in the population. In the notation for thexvariables, the subscript followingisimply denotes which x-variable it is.
Estimates of the Model Parameters
• The estimates of the β coefficients are the values that minimize the sum of squared errors for the sample. The exact formula for this will be given in the next chapter when we introduce matrix notation. • The letter b is used to represent a sample estimate of a β coefficient.
Thus b0 is the sample estimate of β0, b1 is the sample estimate of β1, and so on.
• MSE = nSSE−p estimates σ2, the variance of the errors. In the formula, n= sample size,p= number of β coefficients in the model and SSE = sum of squared errors. Notice that for simple linear regression p = 2. Thus, we get the formula for MSE that we introduced in that context of one predictor.
• In the case of two predictors, the estimated regression equation yields a plane (as opposed to a line in the simple linear regression setting). For more than two predictors, the estimated regression equation yields a hyperplane.
CHAPTER 7. MULTIPLE REGRESSION 89
Predicted Values and Residuals
• A predicted value is calculated as ˆyi = b0 +b1xi,1 +b2xi,2 +. . .+ bp−1xi,p−1, where the b values come from statistical software and the x-values are specified by us.
• A residual (error) term is calculated as ei = yi −yˆi, the difference
between an actual and a predicted value of y.
• A plot of residuals versus predicted values ideally should resem-ble a horizontal random band. Departures from this form indicates difficulties with the model and/or data.
• Other residual analyses can be done exactly as we did in simple re-gression. For instance, we might wish to examine a normal probability plot (NPP) of the residuals. Additional plots to consider are plots of residuals versus eachx-variable separately. This might help us identify sources of curvature or nonconstant variance.
Interaction Terms
• An interactionterm is when there is a coupling or combined effect of 2 or more independent variables.
• Suppose we have a response variable (Y) and two predictors (X1 and X2). Then, the regression model with an interaction term is written as
Y =β0 +β1X1+β2X2+β3X1∗X2+.
Suppose you also have a third predictor (X3). Then, the regression model with all interaction terms is written as
Y =β0+β1X1+β2X2+β3X3+β4X1∗X2+β5X1∗X3 +β6X2∗X3+β7X1∗X2∗X3+. In a model with more predictors, you can imagine how much the model grows by adding interactions. Just make sure that you have enough observations to cover the degrees of freedom used in estimating the corresponding regression coefficients!
90 CHAPTER 7. MULTIPLE REGRESSION
• For each observation, their value of the interaction is found by multi-plying the recorded values of the predictor variables in the interaction. • In models with interaction terms, the significance of the interaction term should always be assessed first before proceeding with significance testing of the main variables.
• If one of the main variables is removed from the model, then the model shouldnot include any interaction terms involving that variable.
7.2
Significance Testing of Each Variable
Within a multiple regression model, we may want to know whether a par-ticular x-variable is making a useful contribution to the model. That is, given the presence of the other x-variables in the model, does a particular x-variable help us predict or explain the y-variable? For instance, suppose that we have three x-variables in the model. The general structure of the model could be
Y =β0+β1X1+β2X2+β3X3+. (7.2) As an example, to determine whether variableX1is a useful predictor variable in this model, we could test
H0 :β1 = 0 HA:β1 6= 0.
If the null hypothesis above were the case, then a change in the value of X1 would not changeY, soY andX1 are not related. Also, we would still be left with variables X2 and X3 being present in the model. When we cannot reject the null hypothesis above, we should say that we do not need variable X1 in the model given that variables X2 and X3 will remain in the model. In general, the interpretation of a slope in multiple regression can be tricky. Correlations among the predictors can change the slope values dramatically from what they would be in separate simple regressions.
To carry out the test, statistical software will report p-values for all co-efficients in the model. Each p-value will be based on a t-statistic calculated as
CHAPTER 7. MULTIPLE REGRESSION 91
For our example above, the t-statistic is: t∗ = b1−0
s.e.(b1)
= b1 s.e.(b1)
.
Note that the hypothesized value is usually just 0, so this portion of the formula is often omitted.
7.3
Examples
Example 1: Heat Flux Data Set
The data are from n = 29 homes used to test solar thermal energy. The variables of interest for our model are y = total heat flux, and x1, x2, and x3, which are the focal points for the east, north, and south directions, re-spectively. There are two other measurements in this data set: another measurement of the focal points and the time of day. We will not utilize these predictors at this time. Table 7.1 gives the data used for this analysis.
The regression model of interest is
yi =β0+β1xi,1+β2xi,2+β3xi,3+i.
Figure 7.1(a)gives a histogram of the residuals. While the shape is not com-pletely bell-shaped, it again is not suggestive of any severe departures from normality. Figure 7.1(b) gives a plot of the residuals versus the fitted val-ues. Again, the values appear to be randomly scattered about 0, suggesting constant variance.
The following provides thet-tests for the individual regression coefficients:
########## Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 389.1659 66.0937 5.888 3.83e-06 *** east 2.1247 1.2145 1.750 0.0925 . north -24.1324 1.8685 -12.915 1.46e-12 *** south 5.3185 0.9629 5.523 9.69e-06 *** ---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 8.598 on 25 degrees of freedom
92 CHAPTER 7. MULTIPLE REGRESSION Histogram of Residuals Residuals Density −10 0 10 20 0.00 0.01 0.02 0.03 0.04 0.05 0.06 (a) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 200 220 240 260 −15 −10 −5 0 5 10 15 20
Residuals vs. Fitted Values
Fitted Values
Residuals
(b)
Figure 7.1: (a) Histogram of the residuals for the heat flux data set. (b) Plot of the residuals.
Multiple R-Squared: 0.8741, Adjusted R-squared: 0.859 F-statistic: 57.87 on 3 and 25 DF, p-value: 2.167e-11 ##########
At the α= 0.05 significance level, both north and south appear to be statis-tically significant predictors of heat flux. However, east is not (with ap-value of 0.0925). While we could claim this is a marginally significant predictor, we will rerun the analysis by dropping the east predictor.
The following provides thet-tests for the individual regression coefficients for the newly suggested model:
########## Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 483.6703 39.5671 12.224 2.78e-12 ***
north -24.2150 1.9405 -12.479 1.75e-12 ***
south 4.7963 0.9511 5.043 3.00e-05 ***
---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 8.932 on 26 degrees of freedom
CHAPTER 7. MULTIPLE REGRESSION 93
Multiple R-Squared: 0.8587, Adjusted R-squared: 0.8478 F-statistic: 79.01 on 2 and 26 DF, p-value: 8.938e-12 ##########
The residual plots still appear okay (they are not included here) and we obtain new estimates for our model (in the above). Some things to note from this final analysis are:
• The final sample multiple regression equation is ˆ
yi = 483.67−24.22xi,2+ 4.80xi,3.
To use this equation for prediction, we substitute specified values for the two directions (i.e., north and south).
• We can interpret the “slopes” in the same way that we do for a straight-line model, but we have to add the constraint that values of other variables remain constant.
– When the south position is held constant, the average flux tem-perature for a home decreases by 24.22 degrees for each 1 unit increase in the north position.
– When the north position is held constant, the average flux temper-ature for a home increases by 4.80 degrees for each 1 unit increase in the south position.
• The value of R2 = 0.8587 means that the model (the two x-variables) explains 85.87% of the observed variation in a home’s flux temperature. • The value √MSE = 8.9321 is the estimated standard deviation of the
residuals. Roughly, it is the average absolute size of a residual.
Example 2: Kola Project Data Set
The Kola Project ran from 1993-1998 and involved extensive geological sur-veys of Finland, Norway, and Russia. The entire published data set consists of over 600 observations measured on 111 variables. Table7.2provides merely a subset of this data for three variables. The data is subsetted on the LITO variable for counts of “1”. The sample size of this subset is n= 131.
The investigators are interested in modeling the geological composition variable Cr INAA as a function of Cr and Co. A scatterplot of this data with
94 CHAPTER 7. MULTIPLE REGRESSION
the least squares plane is provided in Figure7.2. In this 3D plot, observations above the plane (i.e., observations with positive residuals) are given by green points and observations below the plane (i.e., observations with negative residuals) are given by red points. The output for fitting a multiple linear regression model to this data is below:
Residuals:
Min 1Q Median 3Q Max
-149.95 -34.42 -14.74 11.58 560.38
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.3483 11.6908 4.563 1.17e-05 ***
Cr 1.8577 0.2324 7.994 6.66e-13 ***
Co 2.1808 1.7530 1.244 0.216
---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 74.76 on 128 degrees of freedom
Multiple R-squared: 0.544, Adjusted R-squared: 0.5369
F-statistic: 76.36 on 2 and 128 DF, p-value: < 2.2e-16
Note that Co is found to be not statistically significant. However, the scat-terplot in Figure 7.2 clearly shows that the data is skewed to the right for each of the variables (i.e., the bulk of the data is clustered near the lower-end of values for each variable while there are fewer values as you increase along a given axis). In fact, a plot of the standardized residuals against the fitted values (Figure7.3) indicates that a transformation is needed.
Since the data appears skewed to the right for each of the variables, a log transformation on Cr INAA, Cr, and Co will be taken. The scatterplot in Figure 7.4 shows the results from this transformations along with the new least squares plane. Clearly, the transformation has done a better job linearizing the relationship. The output for fitting a multiple linear regression model to this transformed data is below:
Residuals:
Min 1Q Median 3Q Max
CHAPTER 7. MULTIPLE REGRESSION 95
Figure 7.2: 3D scatterplot of the Kola data set with the least squares plane.
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 200 300 400 500 600 −2 0 2 4 6 Fitted Values Standardized Residuals
Figure 7.3: The standardized residuals versus the fitted values for the raw Kola data set.
96 CHAPTER 7. MULTIPLE REGRESSION
Figure 7.4: 3D scatterplot of the Kola data set where the logarithm of each variable has been taken.
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 4.0 4.5 5.0 5.5 6.0 −2 −1 0 1 2 3 Fitted Values Standardized Residuals
Figure 7.5: The standardized residuals versus the fitted values for the log-transformed Kola data set.
CHAPTER 7. MULTIPLE REGRESSION 97
Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) 2.65109 0.17630 15.037 < 2e-16 ***
ln_Cr 0.57873 0.08415 6.877 2.42e-10 ***
ln_Co 0.08587 0.09639 0.891 0.375
---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.3784 on 128 degrees of freedom Multiple R-squared: 0.5732, Adjusted R-squared: 0.5665 F-statistic: 85.94 on 2 and 128 DF, p-value: < 2.2e-16
There is also a noted improvement in the plot of the standardized residuals versus the fitted values (Figure 7.5). Notice that the log transformation of Co is not statistically significant as it has a high p-value (0.375).
After omitting the log transformation of Co from our analysis, a simple linear regression model is fit to the data. Figure 7.6 provides a scatterplot of the data and a plot of the standardized residuals against the fitted values. These plots, combined with the following simple linear regression output, indicate a highly statistically significant relationship between the log trans-formation of Cr INAA and the log transtrans-formation of Cr.
Residuals:
Min 1Q Median 3Q Max
-0.85999 -0.24113 -0.05484 0.17339 1.38702 Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.60459 0.16826 15.48 <2e-16 ***
ln_Cr 0.63974 0.04887 13.09 <2e-16 ***
---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.3781 on 129 degrees of freedom Multiple R-squared: 0.5705, Adjusted R-squared: 0.5672 F-statistic: 171.4 on 1 and 129 DF, p-value: < 2.2e-16
98 CHAPTER 7. MULTIPLE REGRESSION ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 3 4 5 3.5 4.0 4.5 5.0 5.5 6.0 6.5 ln(Cr) ln(Cr_INAA) (a) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4.0 4.5 5.0 5.5 6.0 −2 −1 0 1 2 3 Fitted Values Standardized Residuals (b)
Figure 7.6: (a) Scatterplot of the Kola data set where the logarithm of Cr INAA has been regressed on the logarithm of Cr. (b) Plot of the stan-dardized residuals for this simple linear regression fit.
CHAPTER 7. MULTIPLE REGRESSION 99
i Flux Insolation East South North Time 1 271.8 783.35 33.53 40.55 16.66 13.20 2 264.0 748.45 36.50 36.19 16.46 14.11 3 238.8 684.45 34.66 37.31 17.66 15.68 4 230.7 827.80 33.13 32.52 17.50 10.53 5 251.6 860.45 35.75 33.71 16.40 11.00 6 257.9 875.15 34.46 34.14 16.28 11.31 7 263.9 909.45 34.60 34.85 16.06 11.96 8 266.5 905.55 35.38 35.89 15.93 12.58 9 229.1 756.00 35.85 33.53 16.60 10.66 10 239.3 769.35 35.68 33.79 16.41 10.85 11 258.0 793.50 35.35 34.72 16.17 11.41 12 257.6 801.65 35.04 35.22 15.92 11.91 13 267.3 819.65 34.07 36.50 16.04 12.85 14 267.0 808.55 32.20 37.60 16.19 13.58 15 259.6 774.95 34.32 37.89 16.62 14.21 16 240.4 711.85 31.08 37.71 17.37 15.56 17 227.2 694.85 35.73 37.00 18.12 15.83 18 196.0 638.10 34.11 36.76 18.53 16.41 19 278.7 774.55 34.79 34.62 15.54 13.10 20 272.3 757.90 35.77 35.40 15.70 13.63 21 267.4 753.35 36.44 35.96 16.45 14.51 22 254.5 704.70 37.82 36.26 17.62 15.38 23 224.7 666.80 35.07 36.34 18.12 16.10 24 181.5 568.55 35.26 35.90 19.05 16.73 25 227.5 653.10 35.56 31.84 16.51 10.58 26 253.6 704.05 35.73 33.16 16.02 11.28 27 263.0 709.60 36.46 33.83 15.89 11.91 28 265.8 726.90 36.26 34.89 15.83 12.65 29 263.8 697.15 37.20 36.27 16.71 14.06
Table 7.1: The heat flux for homes data.
100 CHAPTER 7. MULTIPLE REGRESSION X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y 40.9 6.2 300 71.4 11.8 200 52.1 7.7 140 21.3 4.2 110 60.7 10.5 270 66.3 9 230 18 6.1 75 73.5 8.5 210 29.6 10.1 140 99.1 16.1 220 23.7 9.3 54 80.1 18.8 170 40.9 8.7 120 18.6 3.6 93 37.7 6.1 110 75.4 16.6 790 27.8 5.2 240 30.5 8.6 140 16.1 2.7 68 32.3 8.7 100 23.5 3.8 110 28.9 5.2 130 40 5.5 100 19.4 3.9 62 16.6 15.2 64 23 5.3 120 38.4 14.4 100 20 5.5 300 29.2 5.1 92 44.2 9.1 120 23.7 4.9 90 48.3 8.2 110 6.9 2 37 27.7 8 100 16.4 5.4 82 40 6.7 120 57.1 7.8 250 18.1 4.7 100 13.4 2.5 100 22.5 2.4 95 50 9.7 190 10 3.3 87 24 4.1 93 31.8 14.7 180 129 30.7 210 16.3 3.5 83 28.8 15.2 110 17.1 6.2 180 106 13.7 220 31.6 8.7 130 18.4 3.8 63 10.9 2 50 36.5 7.3 81 19.5 6.8 90 9.4 3.3 47 30.6 7.3 110 66.6 10.5 170 25.2 5.5 110 5.9 1.6 86 52.6 11.4 130 37.2 9.6 120 13.6 2.9 170 83.8 16.2 160 53.9 11.5 210 42 9.2 120 23.2 6.4 88 280 25.2 640 88.6 15.5 320 17.5 3.9 64 41.6 10.1 150 21.9 5 62 25.6 6.8 69 67.1 8.6 120 18 4 120 18.5 3.5 92 18.9 3.1 110 10.6 2.5 69 37.4 8.7 97 26.7 5.1 170 16.7 4.5 84 11.3 2 27 32.3 7.2 97 50.2 7 340 19.6 5.6 86 44 14.1 130 217 10 400 30.9 10 120 15.1 4.2 110 34.1 12.1 240 16.7 5 49 25.5 5.5 61 25.1 7 150 29.3 5.2 87 29.8 7.8 160 21.4 3.9 140 8.4 2 34 49.2 14 180 15.5 2.5 70 32.3 6.3 220 25.4 6.9 140 118 18.3 330 14 3 68 31.9 3.7 110 18.7 4.4 72 10.8 3.4 78 30 6.9 120 28.7 6.7 120 21.6 7.6 110 59.3 9.6 300 21.9 9 390 36.2 5.3 94 24 5.2 110 20 3.8 110 33 5.6 71 45.2 3.8 99 19.3 4.3 130 24.4 5.4 120 30.8 6.5 110 16.3 5.4 59 243 24.1 590 28.6 6.6 76 55.5 11.5 130 50 7.5 130 9.6 2.4 47 37.9 30.3 130 25.9 6.9 96 15.8 4.8 79 36.4 6 110 10.2 4.7 46 16.5 4.7 63 19.3 3.5 54
Table 7.2: The subset of the Kola data. HereX1,X2, andY are the variables Cr, Co, and Cr INAA, respectively.
Chapter 8
Matrix Notation in Regression
There are two main reasons for using matrices in regression. First, the no-tation simplifies the writing of the model. Secondly, and most importantly, matrix formulas provide the means by which statistical software calculates the estimated coefficients and their standard errors, as well as the set of pre-dicted values for the observed sample. If necessary, a review of matrices and some of their basic properties can be found in Appendix B.8.1
Matrices and Regression
In matrix notation, the theoretical regression model for the population is written as
Y=Xβ+. The four different items in the equation are:
1. Y is a n-dimensional column vector that vertically lists they values:
Y= Y1 Y2 .. . Yn .
2. TheX matrix is a matrix in which each row gives thex-variable data for a different observation. The first column equals 1 for all observations (unless doing a regression through the origin), and each column after
102 CHAPTER 8. MATRIX NOTATION IN REGRESSION
the first gives the data for a different variable. There is a column for each variable, including any added interactions, transformations, indicators, and so on. The abstract formulation is:
X= 1 X1,1 . . . X1,p−1 1 X2,1 . . . X2,p−1 .. . ... . .. ... 1 Xn,1 . . . Xn,p−1 .
In the subscripting, the first value is the observation number and the second number is the variable number. The first column is always a column of 1’s. TheX matrix has n rows and p columns.
3. β is a p-dimensional column vector listing the coefficients:
β = β0 β1 .. . βp−1 .
Notice the subscript for the numbering of the β’s. As an example, for simple linear regression, β = (β0 β1)T. The β vector will contain symbols, not numbers, as it gives the population parameters.
4. is a n-dimensional column vector listing the errors:
= 1 2 .. . n .
Again, we will not have numerical values for the vector.
As an example, suppose that data for ay-variable and twox-variables is as given in Table 8.1. For the model
yi =β0 +β1xi,1+β2xi,2+β3xi,1∗xi,2+i,
CHAPTER 8. MATRIX NOTATION IN REGRESSION 103 yi 6 5 10 12 14 18
xi,1 1 1 3 5 3 5 xi,2 1 2 1 1 2 2 Table 8.1: A sample data set.
Y= 6 5 10 12 14 18 ,X = 1 1 1 1 1 1 2 2 1 3 1 3 1 5 1 5 1 3 2 6 1 5 2 10 , β= β0 β1 β2 β3 , = 1 2 3 4 5 6 .
1. Notice that the first column of the X matrix equals 1 for all rows (observations), the second column gives the values of xi,1, the third column lists the values of xi,2, and the fourth column gives the values
of the interaction values xi,1∗xi,2.
2. For the theoretical model, we do not know the values of the beta co-efficients or the errors. In those two matrices (column vectors) we can only list the symbols for these items.
3. There is a slight abuse of notation that occurs here which often happens when writing regression models in matrix notation. I stated earlier how capital letters are reserved for random variables and lower case letters are reserved for realizations. In this example, capital letters have been used for the realizations. There should be no misunderstanding as it will usually be clear if we are in the context of discussing random variables or their realizations.
Finally, using Calculus rules for matrices, it can be derived that the or-dinary least squares estimates of the β coefficients are calculated using the matrix formula
b= (XTX)−1XTy,
104 CHAPTER 8. MATRIX NOTATION IN REGRESSION
which minimizes the sum of squared errors ||e||2 =eTe
= (Y−Y)ˆ T(Y−Y)ˆ = (Y−XbT(Y−Xb),
where b = (b0 b1 . . . bp−1)T. As in the simple linear regression case, these
regression coefficient estimators are unbiased (i.e., E(b) = β). The formula above is used by statistical software to calculate values of the sample coeffi-cients.
An important theorem in regression analysis (and Statistics in general) is the Gauss-Markov Theorem, which we alluded to earlier. Since we have the proper matrix notation in place, we will now formalize this very important result.
Theorem 1 (Gauss-Markov Theorem) Suppose that we have the linear re-gression model
Y=Xβ+,
where E(i|X) = 0 and E(i|X) =σ2 for all i= 1, . . . , n. Then
ˆ
β =b= (XTX)−1XTY
is an unbiased estimator of β and has the smallest variance of all other unbiased estimates of β.
Any estimator which is unbiased and has smaller variance than any other unbiased estimators is called abest linear unbiased estimatororBLUE. An important note regarding the matrix expressions introduced above is that ˆ Y =Xb =X(XTX)−1XTY =HY and e=Y−Yˆ =Y−HY = (In×n−H)Y,
CHAPTER 8. MATRIX NOTATION IN REGRESSION 105
where H = X(XTX)−1XT is the n ×n hat matrix. H is important for several reasons as it appears often in regression formulas. One important implication of H is that it is a projection matrix, meaning that it projects the response vector, Y, as a linear combination of the columns of the X matrix in order to obtain the vector of fitted values, ˆY. Also, the diagonal of this matrix contains the hj,j values we introduced earlier in the context of
Studentized residuals, which is important when discussing leverage.
8.2
Variance-Covariance Matrix of b
Two important characteristics of the sample multiple regression coefficients are their standard errors and their correlations with each other. The variance-covariance matrix of the sample coefficientsbis a symmetric p×psquare matrix. Remember that p is the number of beta coefficients in the model (including the intercept).
The rows and the columns of the variance-covariance matrix are in co-efficient order (first row is information about b0, second is about b1, and so on).
• The diagonal values (from top left to bottom right) are the variances of the sample coefficients (written as Var(bi)). The standard error of a
coefficient is the square root of its variance.
• An off-diagonal value is the covariance between two coefficient esti-mates (written as Cov(bi, bj)).
• The correlation between two coefficient estimates can be determined using the following relationship: correlation = covariance divided by product of standard deviations (written as Corr(bi, bj)).
In regression, the theoretical variance-covariance matrix of the sample coef-ficients is
V(b) = σ2(XTX)−1.
Recall, the MSE estimates σ2, so theestimated variance-covariance ma-trix of the sample beta coefficients is calculated as
ˆ
V(b) = MSE(XTX)−1.
106 CHAPTER 8. MATRIX NOTATION IN REGRESSION
100×(1−α)% confidence intervals are also readily available for β: bj±t∗n−p;1−α/2
q
ˆ V(b)j,j,
where ˆV(b)j,jis thejthdiagonal element of the estimated variance-covariance
matrix of the sample beta coefficients (i.e., the (estimated) standard error). Furthermore, the Bonferroni joint 100(1−α)% confidence intervals are:
bj±t∗n−p;1−α/(2p) q ˆ V(b)j,j, forj = 0,1,2, . . . ,(p−1).
8.3
Statistical Intervals
The statistical intervals for estimating the mean or predicting new observa-tions in the simple linear regression case can easily extend to the multiple regression case. Here, it is only necessary to present the formulas.
First, let use define the vector of given predictors as
Xh = 1 Xh,1 Xh,2 .. . Xh,p−1 .
We are interested in either intervals for E(Y|X = Xh) or intervals for the
value of a new responsey given that the observation has the particular value Xh. First we define the standard error of the fit at Xh given by:
s.e.( ˆYh) =
q
MSE(XTh(XTX)−1X
h).
Now, we can give the formulas for the various intervals: • 100×(1−α)% Confidence Interval:
ˆ
CHAPTER 8. MATRIX NOTATION IN REGRESSION 107
• Bonferroni Joint 100×(1−α)% Confidence Intervals: ˆ
yhi±t
∗
n−p;1−α/(2q)s.e.(ˆyhi), for i= 1,2, . . . , q.
• 100×(1−α)% Working-Hotelling Confidence Band: ˆ yh± q pFp,n∗ −p;1−αs.e.(ˆyh). • 100×(1−α)% Prediction Interval: ˆ yh±t∗n−p;1−α/2 p MSE/m+ [s.e.(ˆyh)]2,
where m = 1 corresponds to a prediction interval for a new observa-tion at a given Xh and m > 1 corresponds to the mean of m new
observations calculated at the same Xh.
• Bonferroni Joint 100×(1−α)% Prediction Intervals: ˆ yhi±t ∗ n−p;1−α/(2q) p MSE + [s.e.(ˆyhi)] 2, for i= 1,2, . . . , q.
• Scheff´e Joint 100×(1−α)% Prediction Intervals: ˆ yhi ± q qFq,n∗ −p;1−α(MSE + [s.e.(ˆyh)]2), for i= 1,2, . . . , q. • [100×(1−α)%]/[100×P%] Tolerance Intervals: – One-Sided Intervals: (−∞,yˆh+Kα,P √ MSE) and (ˆyh−Kα,P √ MSE,∞)
are the upper and lower one-sided tolerance intervals, respectively, where Kα,P is found similarly as in the simple linear regression
setting, but with n∗ = (XTh(XTX)−1X
h)−1.
108 CHAPTER 8. MATRIX NOTATION IN REGRESSION – Two-Sided Interval: ˆ yh±Kα/2,P/2 √ MSE,
whereKα/2,P/2 is found similarly as in the simple linear regression set-ting, but withn∗as given above andf =n−p, wherepis the dimension of Xh.
8.4
Example
Example: Heat Flux Data Set (continued)
Refer back to the heat flux data set where only north and south were used as predictors of insolation. The MSE for this model is equal to 79.7819. How-ever, if we are interested in the full variance-covariance matrix and correlation matrix, then this must be calculated by hand by finding the (XTX)−1. Then,
ˆ V(b) = 79.7819 19.6229 −0.5521 −0.2918 −0.5521 0.0472 −0.0066 −0.2918 −0.0066 0.0113 = 1565.5532 −44.0479 −23.2797 −44.0479 3.7657 −0.5305 −23.2797 −0.5305 0.9046 .
Taking the square roots of the diagonal terms of this matrix gives you the values of s.e.(b0), s.e.(b1), and s.e.(b2).
We can also calculate thecorrelation matrix of b (denoted byrb) for
this data set:
rb = Var(b0) √ Var(b0)Var(b0) Cov(b0,b1) √ Var(b0)Var(b1) Cov(b0,b2) √ Var(b0)Var(b2) Cov(b1,b0) √ Var(b1)Var(b0) Var(b1) √ Var(b1)Var(b1) Cov(b1,b2) √ Var(b1)Var(b2) Cov(b2,b0) √ Var(b2)Var(b0) Cov(b2,b1) √ Var(b2)Var(b1) Var(b2) √ Var(b2)Var(b2) = 1565.5532 √ (1565.5532)(1565.5532) −44.0479 √ (1565.5532)(3.7657) 23.2797 √ (1565.5532)(0.9046) −44.0479 √ (3.7657)(1565.5532) 3.7657 √ (3.7657)(3.7657) 0.5305 √ (3.7657)(0.9046) −23.2797 √ (0.9046)(1565.5532) −0.5305 √ (0.9046)(3.7657) 0.9046 √ (0.9046)(0.9046) = 1 −0.5737 −0.6186 −0.5737 1 −0.2874 −0.6186 −0.2874 1 .
CHAPTER 8. MATRIX NOTATION IN REGRESSION 109
rb is an estimate of the population correlation matrix ρb. For example,
Corr(b1, b2) = −0.2874, which implies there is a fairly low, negative correla-tion between the average change in flux for each unit increase in the south position and each unit increase in the north position. Therefore, the presence of the north position only slightly affects the estimate of the south’s beta co-efficient. The consequence is that it is fairly easy to separate the individual effects of these two variables. Note that we usually do not care about cor-relations concerning the intercept, b0 since we usually wish to provide an interpretation concerning the x-variables.
If all x-variables are uncorrelated with each other, then all covariances between pairs of sample coefficients that multiply x-variables will equal 0. This means that the estimate of one beta is not affected by the presence of the other x-variables. Many experiments are designed to achieve this property, but achieving it with real data is often a different story.
The correlation matrix presented above should NOT be confused with the correlation matrix, r, constructed for each pairwise combination of the variables Y, X1, X2, . . . , Xp−1; namely: r = 1 Corr(Y, X1) . . . Corr(Y, Xp−1) Corr(X1, Y) 1 . . . Corr(X1, Xp−1) .. . ... . .. ... Corr(Xp−1, Y) Corr(Xp−1, X1) . . . 1 .
Note that all of the diagonal entries are 1 because the correlation between a variable and itself is a perfect (positive) association. This correlation ma-trix is what most statistical software reports and it does not always report rb. The interpretation of each entry in r is identical to the Pearson
correla-tion coefficient interpretacorrela-tion presented earlier. Specifically, it provides the strength and direction of the association between the variables correspond-ing to the row and column of the respective entry. For this example, the correlation matrix is:
r= 1 −0.8488 −0.1121 −0.8488 1 0.2874 −0.1121 0.2874 1 .
We can also calculate the 95% confidence intervals for the regression co-efficients. First note that t26,0.975 = 2.0555. The 95% confidence interval for β1 is calculated using −24.2150±2.0555
√
3.7657 and for β2 it is calculated
110 CHAPTER 8. MATRIX NOTATION IN REGRESSION
using 4.7963± 2.0555√0.9046. Thus, we are 95% confident that the true population regression coefficients for the north and south focal points are between (-28.2039, -20.2262) and (2.8413, 6.7513), respectively.
Chapter 9
Indicator Variables
We next discuss how to include categorical predictor variables in a regression model. A categorical variable is a variable for which the possible out-comes are nameable characteristics, groups or treatments. Some examples are gender (male or female), highest educational degree attained (secondary school, college undergraduate, college graduate), blood pressure medication used (drug 1, drug 2, drug 3), etc.
We use indicator variables to incorporate a categorical x-variable into a regression model. Anindicator variableequals 1 when an observation is in a particular group and equals 0 when an observation is not in that group. An interaction between an indicator variable and a quantitative variable exists if the slope between the response and the quantitative variable depends upon the specific value present for the indicator variable.
9.1
The “Leave One Out” Method
When a categorical predictor variable hask categories, it is possible to define k indicator variables. However, as explained later, we should only use k−1 of them as predictor variables in the regression model.
Let us consider an example where we are analyzing data for a clinical trial done to compare the effectiveness of three different medications used to treat high blood pressure. n = 90 participants are randomly divided into three groups of 30 patients and each group is assigned a different medication. The response variable is the reduction in diastolic blood pressure in a 3 month period. In addition to the treatment variables, two other predictor variables
112 CHAPTER 9. INDICATOR VARIABLES
will be X1 =age and X2 =body mass index.
We are examining three different treatments so we can define the following three indicator variables for the treatment:
X3 = 1 if patient used treatment 1, 0 otherwise X4 = 1 if patient used treatment 2, 0 otherwise X5 = 1 if patient used treatment 3, 0 otherwise.
On the surface, it seems that our model should be the following “over-parameterized” model, a model that requires us to make a modification in order to estimate coefficients:
yi =β0+β1xi,1+β2xi,2+β3xi,3+β4xi,4+β5xi,5+i. (9.1)
The difficulty with this model is that theX matrix has a linear dependency, so we can’t estimate the individual coefficients (technically, this is because there will be an infinite number of solutions for the betas). The dependency stems from the fact that Xi,3 +Xi,4 +Xi,5 = 1 for all observations because each patient uses one (and only one) of the treatments. In theX matrix, the linear dependency is that the sum of the last three columns will equal the first column (all 1 ’s). This scenario leads to what is called collinearity and we investigate this in the next chapter.
One solution (there are others) for avoiding this difficulty is the “leave one out” method. The“leave one out” method has the general rule that whenever a categorical predictor variable has k categories, it is possible to definekindicator variables, but we should only use k−1 of them to describe the differences among the k categories. For the overall fit of the model, it does not matter which set of k−1 indicators we use. The choice of which k−1 indicator variables we use, however, does affect the interpretation of the coefficients that multiply the specific indicators in the model.
In our example with three treatments (and three possible indicator vari-ables), we might leave out the third indicator giving us this model:
yi =β0+β1xi,1+β2xi,2+β3xi,3+β4xi,4+i. (9.2)
For the overall fit of the model, it would work equally well to leave out the first indicator and include the other two or to leave out the second and include the first and third.
CHAPTER 9. INDICATOR VARIABLES 113
9.2
Coefficient Interpretations
The interpretation of the coefficients that multiply indicator variables is tricky. The interpretation for the individual betas with the “leave one out” method is that a coefficient multiplying an indicator in the model measures the difference between the group defined by the indicator in the model and the group defined by the indicator that was left. Usually, a control or placebo group is the one that is “left out”.
Let us consider our example again. We are predicting decreases in blood pressure in response toX1 =age,X2 =body mass, and which of three different treatments a person used. The variables X3 and X4 are indicators of the treatment, as defined above. The model we will examine is
yi =β0+β1xi,1+β2xi,2+β3xi,3+β4xi,4+i.
To see what is going on, look at each treatment separately by substituting the appropriately defined values of the two indicators into the equation.
• For treatment 1, by definition X3 = 1 and X4 = 0 leading to yi =β0+β1xi,1+β2xi,2+β3(1) +β4(0) +i
=β0+β1xi,1+β2xi,2+β3+i.
• For treatment 2, by definition X3 = 0 and X4 = 1 leading to yi =β0+β1xi,1+β2xi,2+β3(0) +β4(1) +i
=β0+β1xi,1+β2xi,2+β4+i.
• For treatment 3, by definition X3 = 0 and X4 = 0 leading to yi =β0+β1xi,1+β2xi,2+β3(0) +β4(0) +i
=β0+β1xi,1+β2xi,2+i.
Now compare the three equations to each other. The only difference between the equations for treatments 1 and 3 is the coefficient β3. The only difference between the equations for treatments 2 and 3 is the coefficient β4. This leads to the following meanings for the coefficients:
• β3 = difference in mean response for treatment 1 versus treatment 3, assuming the same age and body mass.
114 CHAPTER 9. INDICATOR VARIABLES
• β4 = difference in mean response for treatment 2 versus treatment 3, assuming the same age and body mass.
Here the coefficients are measuring differences from the third treatment. With the “leave one out” method, a coefficient multiplying an indicator in the model measures the difference between the group defined by the indicator in the model and the group defined by the indicator that was left.
IMPORTANT CAUTIONS: Notice that the coefficient that multi-plies an indicator variable in the model does not retain the meaning implied by the definition of the indicator. It is common for students to wrongly state that a coefficient measures the difference between that group and the other groups. That is WRONG! It is also incorrect to say only that a coefficient multiplying an indicator “measures the effect of being in that group”. An effect has to involve a comparison - with the “leave one out” method it is a comparison to the group associated with the indicator left out.
One application where many indicator variables (or binary predictors) are used is in conjoint analysis, which is a marketing tool that attempts to capture a respondent’s preference given the presence or absence of various attribute levels. The X matrix is called a “dummy” matrix as it consists of only 1’s and 0’s. The response is then regressed on the indicators using ordinary least squares and researchers attempt to quantify items like iden-tification of different market segments, predict profitability, or predict the impact of a new competitor.
One additional note is that, in theory, with a linear dependence there are an infinite number of suitable solutions for the betas (as will be seen with multicollinearity). With the “leave one out” method, we are picking one with a particular meaning and then the resulting coefficients measure differences from the specified group. A method, often used in courses focused strictly on ANOVA or Design of Experiments, offers a different meaning for what we estimate. There it will be more common to parameterize in a way so that a coefficient measures how a group differs from an overall average.
9.3
Testing Overall Group Differences
To test the overall significance of a categorical predictor variable, we use a general linear F-test procedure (which is developed in detail later). We form the reduced model by dropping the indicator variables from the model.
CHAPTER 9. INDICATOR VARIABLES 115
More technically, the null hypothesis is that the coefficients multiplying the indicator all equal 0.
For our example with three treatments of high blood pressure and addi-tional x-variables age and body mass, the details for doing an overall test of treatment differences are:
• Full model is: yi =β0+β1xi,1+β2xi,2+β3xi,3+β4xi,4+i.
• Null hypothesis is: H0 :β3 =β4 = 0.
• Reduced model is: yi =β0+β1xi,1+β2xi,2+i.
9.4
Interactions
To examine a possible interaction between a categorical predictor and a quan-titative predictor, include product variables between each indicator and the quantitative variable.
As an example, suppose we thought there could be an interaction be-tween the body mass variable (X2) and the treatment variable. This would mean that we thought that treatment differences in blood pressure reduction depend on the specific value of body mass. The model we would use is:
yi =β0+β1xi,1+β2xi,2+β3xi,3+β4xi,4+β5xi,2∗xi,3+β6xi,2∗xi,4+i.
To test whether there is an interaction, the null hypothesis is H0 : β5 = β6 = 0. We would use the general linear F test procedure to carry out the test. The full model is the interaction model given three lines above. The reduced model is now:
yi =β0+β1xi,1+β2xi,2+β3xi,3+β4xi,4+i.
A visual way to assess if there is an interaction is by using an interaction plot. An interaction plot is created by plotting the response versus the quantitative predictor and connecting the successive values according to the grouping of the observations. Recall that an interaction between factors occurs when the change in response from lower levels to higher levels of one factor is not quite the same as going from lower levels to higher levels of another factor. Interaction plots allow us to compare the relative strength of the effects across factors. What results is one of three possible trends:
116 CHAPTER 9. INDICATOR VARIABLES
• The lines could be (nearly) parallel, which indicates no interaction. This means that the change in the response from lower levels to higher levels for each factor is roughly the same.
• The lines intersect within the scope of the study, which indicates an interaction. This means that the change in the response from lower levels to higher levels of one factor is noticeably different than the change in another factor. This type of interaction is called adisordinal interaction.
• The lines do not intersect within the scope of the study, but the trends indicate that if we were to extend the levels of our factors, then we may see an interaction. This type of interaction is called an ordinal interaction.
Figure 9.1 illustrates each type of interaction plot using a mock data set pertaining to the mean tensile strength measured at three different speeds of 3 different processes. The upper left plot illustrates the case where no interaction is present because the change in mean tensile strength is similar for each process as you increase the speed (i.e., the lines are parallel). The upper right plot illustrates an interaction because as the speeds increase, the change in mean tensile strength is noticeably different depending on which process is being used (i.e., the lines cross). The bottom right plot illustrates an ordinal interaction where no interaction is present within the scope of the range of speeds studied, but if these trends continued for higher speeds, then we may see an interaction (i.e., the lines may cross).
It should also be noted that just because lines cross, it does not necessarily imply the interaction is statistically significant. Lines which appear nearly parallel, yet cross at some point, may not yield a statistically significant interaction term. If two lines cross, the more different the slopes appear and the more data that is available, then the more likely the interaction term will be significant.
9.5
Relationship to ANCOVA
When dealing with categorical predictors in regression analysis, we often say that we are performing a regression with indicator variables or a regression
CHAPTER 9. INDICATOR VARIABLES 117 1.0 1.5 2.0 2.5 3.0 3.5 4.0 20 40 60 80 100 No Interaction Predictor Response ● ● ● ● ● ● ● ● ● ● ● ● Treatment 1 Treatment 2 Treatment 3 (a) 1.0 1.5 2.0 2.5 3.0 3.5 4.0 10 20 30 40 50 60 Disordinal Interaction Predictor Response ● ● ● ● ● ● ● ● ● ● ● ● Treatment 1 Treatment 2 Treatment 3 (b) 1.0 1.5 2.0 2.5 3.0 3.5 4.0 10 20 30 40 50 60 Ordinal Interaction Predictor Response ● ● ● ● ● ● ● ● ● ● ● ● Treatment 1 Treatment 2 Treatment 3 (c)
Figure 9.1: (a) A plot of no interactions amongst the groups (notice how the lines are nearly parallel). (b) A plot of a disordinal interaction amongst the groups (notice how the lines intersect). (c) A plot of an ordinal interaction amongst the groups (notice how the lines don’t intersect, but if we were to extrapolate beyond the predictor limits, then the lines would likely cross).
118 CHAPTER 9. INDICATOR VARIABLES
with interactions (if we are interested in testing for interactions with indi-cator variables and other variables). However, in the design and analysis of experiments literature, this model is also used, but with a slightly dif-ferent motivation. Various experimental layouts using ANOVA tables are commonly used in the design and analysis of experiments. These ANOVA tables are constructed to compare the means of several levels of one or more treatments. For example, aone-way ANOVA can be used to compare six different dosages of blood pressure pills and the mean blood pressure of in-dividuals who are taking one of those six dosages. In this case, there is one factor with six different levels. Suppose further that there are four different races represented in this study. Then atwo-way ANOVAcan be used since we have two factors - the dosage of the pill and the race of the individual taking the pill. Furthermore, an interaction term can be included if we sus-pect that the dosage a person is taking and the race of the individual have a combined effect on the response. As you can see, you can extend to the more generaln-way ANOVA(with or without interactions) for the setting with n treatments. However, dealing with n >2 can often lead to difficulty in interpreting the results.
One other important thing to point out with ANOVA models is that, while they use least squares for estimation, they differ from how categorical variables are handled in a regression model. In an ANOVA model, there is a parameter estimated for the factor level means and these are used for the linear model of the ANOVA. This differs slightly from a regression model which estimates a regression coefficient for, say, n −1 indicator variables (assuming there are n levels of the categorical variable and we are using the “leave-one-out” method). Also, ANOVA models utilize ANOVA tables, which are broken down by each factor (i.e., you would look at the sums of squares for each factor present). ANOVA tables for regression models simply test if the regression model has at least one variable which is a significant predictor of the response. More details on these differences are better left to a course on design of experiments.
When there is also a continuous variable measured with each response, then then-way ANOVA model needs to reflect the continuous variable. This model is then referred to as an Analysis of Covariance (or ANCOVA) model. The continuous variable in an ANCOVA model is usually called the covariateor sometimes theconcomitant variable. One difference in how an ANCOVA model is approached is that an interaction between the covari-ate and each factor is always tested first. The reason why is because an
CHAPTER 9. INDICATOR VARIABLES 119
ANCOVA is conducted to investigate the overall relationship between the response and the covariate while assuming this relationship is true for all groups (i.e., for all treatment levels). If, however, this relationship does dif-fer across the groups, then the overall regression model is inaccurate. This assumption is called the assumption of homogeneity of slopes. This is assessed by testing for parallel slopes, which involves testing the interaction term between the covariate and each factor in the ANCOVA table. If the interaction is not statistically significant, then you can claim parallel slopes and proceed to build the ANCOVA model. If the interaction is statistically significant, then the regression model used is not appropriate and an AN-COVA model should not be used.
As an example of how to write ANCOVA models, first consider the one-way ANCOVA setting. Suppose we have i = 1, . . . , I treatments and each treatment has j = 1, . . . , Ji pairs of continuous variables measured (i.e.,
(xi,1, yi,1), . . . ,(xi,Ji, yi,Ji)). Then the one-way ANCOVA model is written as
yi,j =αi+βxi,j +i,j,
where αi is the mean of the ith treatment level, β is the common regression
slope, and the i,j are iid normal with mean 0 and variance σ2. So note
that the test of parallel slopes concerns testing if β is the same for all slopes versus if it is not the same for all slopes. A high p-value indicates that we have parallel slopes (or homogeneity of slopes) and can therefore use an ANCOVA model.
9.6
Coded Variables
In the early days when computing power was limited, coding of the variables accomplished simplifying the linear algebra and thus allowing least squares solutions to be solved manually. Many methods exist for coding data, such as:
• Converting variables to two values (e.g., {-1, 1} or{0, 1}). • Converting variables to three values (e.g., {-1, 0, 1}).
• Coding continuous variables to reflect only important digits (e.g., if the costs of various nuclear programs range from $100,000 to $150,000,
120 CHAPTER 9. INDICATOR VARIABLES
coding can be done by dividing through by $100,000, resulting in the range being from 1 to 1.5).
The purpose of coding is to simplify the calculation of (XTX)−1in the various regression equations, which was especially important when this had to be done by hand. It is important to note that the above methods are just a few possibilities and that there are no specific guidelines or rules of thumb for when to code data.
Today when (XTX)−1 is calculated with computers, there may be a sig-nificant rounding error in the linear algebra manipulations if the difference in the magnitude of the predictors is large. Good statistical programs assess the probability of such errors, which would warrant using coded variables. When coding variables, one should be aware of different magnitudes of the parameter estimates compared to those for the original data. The intercept term can change dramatically, but we are concerned with any drastic changes in the slope estimates. In order to protect against additional errors due to the varying magnitudes of the regression parameters, you can compare plots of the actual data and the coded data and see if they appear similar.
9.7
Examples
Example 1: Software Development Data Set
Suppose that data from n = 20 institutions is collected on similar software development projects. The data set includes Y = number of man-years required for each project,X1 = number of application subprograms developed for the project, andX2 = 1 if an academic institution developed the program or 0 if a private firm developed the program. The data is given in Table9.1. Suppose we wish to estimate the number of man-years necessary for de-veloping this type of software for the purpose of contract bidding. We also suspect a possible interaction between the number of application subpro-grams developed and the type of institution. Thus, we consider the multiple regression model
yi =β0 +β1xi,1+β2xi,2+β3xi,1∗xi,2+i.
So first, we fit the above model and assess the significance of the interaction term.
CHAPTER 9. INDICATOR VARIABLES 121
########## Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -23.8112 20.4315 -1.165 0.261 subprograms 0.8541 0.1066 8.012 5.44e-07 *** institution 35.3686 26.7086 1.324 0.204 sub.inst -0.2019 0.1556 -1.297 0.213 ---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 38.42 on 16 degrees of freedom
Multiple R-Squared: 0.8616, Adjusted R-squared: 0.8356 F-statistic: 33.2 on 3 and 16 DF, p-value: 4.210e-07 ##########
The above gives thet-tests for these predictors. Notice that only the predictor of application subprograms (i.e., X1) is statistically significant, so we should consider dropping the interaction term for starters.
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 100 200 300 400 0 50 100 150 200 250 300 350 Number of Subprograms Number of Man−Years Academic Institution Private Firm
Figure 9.2: An interaction plot where the grouping is by institution.
An interaction plot can also be used to justify use of an interaction term. Figure 9.2 provides the interaction plot for this data set. This plot seems to
122 CHAPTER 9. INDICATOR VARIABLES
indicate a possible (disordinal) interaction. A test of this interaction term yields p = 0.213 (see the earlier output). Even though the interaction plot indicates a possible interaction, the actual interaction term is deemed not statistically significant and thus we can drop it from the model.
i Subprograms Institution Man-Years
1 135 0 52 2 128 1 58 3 221 0 207 4 82 1 95 5 401 0 346 6 360 1 244 7 241 0 215 8 130 0 112 9 252 1 195 10 220 0 54 11 112 0 48 12 29 1 39 13 57 0 31 14 28 1 57 15 41 1 20 16 27 1 33 17 33 1 19 18 7 0 6 19 17 0 7 20 94 1 56
Table 9.1: The software development data set.
We next provide the analysis without the interaction term. Though the results are not shown here, a test of each predictor shows that the subpro-grams predictor is statistically significant (p = 0.000), while the institution predictor is not statistically significant (p = 0.612). This then tells us that there is no statistically significant difference in man-years for this type of soft-ware development between academic institutions and private firms. However, the number of subprograms is still a statistically significant predictor. So the final model should be a simple linear regression model with subprograms as the predictor and man-years as the response. The final estimated regression
CHAPTER 9. INDICATOR VARIABLES 123
equation is
ˆ
yi =−3.47742 + 0.75088xi,1, which can be found from the following output:
########## Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.47742 13.12068 -0.265 0.794
subprograms 0.75088 0.07591 9.892 1.06e-08 ***
---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 38.38 on 18 degrees of freedom
Multiple R-Squared: 0.8446, Adjusted R-squared: 0.836
F-statistic: 97.85 on 1 and 18 DF, p-value: 1.055e-08 ##########
Example 2: Steam Output Data (continued)
Consider coding the steam output data by rounding the temperature to the nearest integer value ending in either 0 or 5. For example, a temperature of 57.5 degrees would be rounded up to 60 degrees while a temperature of 76.8 degrees would be rounded down to 75 degrees. While you would probably not utilize coding on such an easy data set where magnitude is not an issue, it is utilized here just for illustrative purposes.
Figure 9.3 compares the scatterplots of this data set with the original temperature value and the coded temperature value. The plots look compa-rable, suggesting that coding could be used here. Recall that the estimated regression equation for the original data was ˆyi = 13.6230−0.0798xi. The
estimated regression equation for the coded data is ˆyi = 13.7765−0.0824xi,
which is also comparable.
124 CHAPTER 9. INDICATOR VARIABLES ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 40 50 60 70 7 8 9 10 11 12 Steam Data
Uncoded Temperature (Fahrenheit)
Steam Usage (Monthly)
(a) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 40 50 60 70 7 8 9 10 11 12 Steam Data
Coded Temperature (Fahrenheit)
Steam Usage (Monthly)
(b)
Figure 9.3: Comparing scatterplots of the atmospheric pressure data with the original temperature (a) and with the temperature coded (b). A line of best fit for each is also shown.
Chapter 10
Multicollinearity
Recall that the columns of a matrix are linearly dependent if one column can be expressed as a linear combination of the other columns. A matrix theorem is that if there is a linear dependence among the columns ofX, then (XTX)−1 does not exist. This means that we can’t determine estimates of the beta coefficients since the formula for determining the estimates involves (XTX)−1.
In multiple regression, the term multicollinearity refers to the linear relationships among thex-variables. Often, the use of this term implies that thex-variables are correlated with each other, so when thex-variables are not correlated with each other, we might say that there is no multicollinearity.
10.1
Sources and Effects of Multicollinearity
There are various sources for multicollinearity. For example, in the data collection phase an investigator may have drawn the data from such a narrow subspace of the independent variables that collinearity appears. Physical constraints, such as design limits, may also impact the range of some of these independent variables. Model specification (such as defining more variables than observations or specifying too many higher-ordered terms/interactions) and outliers can both lead to collinearity.
When there is no multicollinearity among x-variables, the effects of the individualx-variables can be estimated independently of each other (although we will still want to do a multiple regression). When multicollinearity is present, the estimated coefficients are correlated (confounding) with each
126 CHAPTER 10. MULTICOLLINEARITY
other. This creates difficulty when we attempt to interpret how individual x-variables affect y.
Along with this correlation, multicollinearity has a multitude of other ramifications on our analysis, including:
• inaccurate regression coefficient estimates,
• inflated standard errors of the regression coefficient estimates, • deflated t-tests for significance testing of the regression coefficients, • false nonsignificance determined by the p-values, and
• degradation of model predictability.
In designed experiments with multiple x-variables, researchers usually choose the value of the x-variables so that there is no multicollinearity. In observational studies (sample surveys), it is nearly always the case that the x-variables will be correlated.
10.2
Detecting and Correcting
Multicollinear-ity
We introduce three primary ways for detecting multicollinearity - two of which are fairly straight-forward to implement, while the third method is actually a variety of measures based on the eigenvalues and eigenvectors of the standardized design matrix.
Method 1: Pairwise Scatterplots
For the first method, we can visually inspect the data by doing pairwise scatterplots of the independent variables. So if you havep−1 independent variables, then you should inspect all p−21 pairwise scatterplots. You will be looking for any plots that seem to indicate a linear relationship between pairs of independent variables.
Method 2: VIF
Second, we can use a measure of multicollinearity called thevariance infla-tion factor(VIF). This is defined as
V IFj =
1 1−R2
j
CHAPTER 10. MULTICOLLINEARITY 127
where R2j is the coefficient of determination obtained by regressing Xj on
the remaining independent variables. A common rule of thumb is that if V IFj = 1, then there is no collinearity, if 1 < V IFj < 5, then there is
possibly some moderate collinearity, and if V IFj ≥5, then there is a strong
indication of a collinearity problem. Most of the time, we will shoot for values as close to 1 as possible and that usually will be sufficient. The bottom line is that the higher the V IF, the more likely multicollinearity is an issue.
Sometimes, the tolerance is also reported. The tolerance is simply the inverse of theV IF (i.e., T olj =V IFj−1). In this case, the lower the T ol, the
more likely multicollinearity is an issue.
If multicollinearity is suspected after doing the above, then a couple of things can be done. First, reassess the choice of model and determine if there are any unnecessary terms and remove them. You may wish to start by removing the one you most suspect first, because this will then drive down the V IFs of the remaining variables.
Next, check for outliers and see what effects some of the observations with higher residuals have on the analysis. Remove some (or all) of the suspected outliers and see how that effects the pairwise scatterplots and V IF values.
You can also standardize the variables which involves simply subtracting each variable by it’s mean and dividing by it’s standard deviation. Thus, the standardized X matrix is given as:
X∗ = √1 n−1 X1,1−X¯1 sX1 X1,2−X¯2 sX2 . . . X1,p−1−X¯p−1 sXp−1 X2,1−X¯1 sX1 X2,2−X¯2 sX2 . . . X2,p−1−X¯p−1 sXp−1 .. . ... . .. ... Xn,1−X¯1 sX1 Xn,2−X¯2 sX2 . . . Xn,p−1−X¯p−1 sXp−1 ,
which is a n×(p−1) matrix, and the standardized Y vectoris given as:
Y∗ = √1 n−1 Y1−Y¯ sY Y2−Y¯ sY .. . Yn−Y¯ sY ,
which is still a n-dimensional vector. Here, sXj = s Pn i=1(Xi,j −X¯j)2 n−1 D. S. Young STAT 501
128 CHAPTER 10. MULTICOLLINEARITY forj = 1,2, . . . ,(p−1) and sY = s Pn i=1(Yi−Y¯)2 n−1 .
Notice that we have removed the column of 1’s in forming X∗, effectively reducing the column dimension of the original X matrix by 1. Because of this, we no longer can estimate an intercept term (b0), which may be an important part of the analysis. Thus, proceed with this method only if you believe the intercept term adds little value to explaining the science behind your regression model!
When using the standardized variables, the regression model of interest becomes:
Y∗ =X∗β∗+∗,
where β∗ is now a (p−1)-dimensional vector of standardized regression co-efficients and ∗ is an n-dimensional vector of errors pertaining to this stan-dardized model. Thus, the ordinary least squares estimates are
b∗ = (X∗TX∗)−1X∗TY∗ =r−XX1 rXY,
whererXX is the (p−1)×(p−1) correlation matrix of the predictors andrXY
is the (p−1)-dimensional vector of correlation coefficients between the predic-tors and the response. Becauseb∗ is a function of correlations, this method is called acorrelation transformation. Sometimes, it may be enough to just simply center the variables by their respective means in order to decrease the V IFs. Note the relationship between the quantities introduced above and the correlation matrix r from earlier:
r= 1 rT XY rXY rXX .
Method 3: Eigenvalue Methods
Finally, the third method for identifying potential multicollinearity concerns a variety of measures utilizing eigenvalues and eigenvectors. First, note that the eigenvalue λj and the corresponding (p−1)-dimensional orthonormal
eigenvectors ξj are solutions to the system of equations:
CHAPTER 10. MULTICOLLINEARITY 129
for j = 1, . . . ,(p−1). Since theλj’s are normalized, it follows that
ξTjX∗TX∗ξj =λj.
Therefore, if λj ≈ 0, then X∗ξj ≈ 0; i.e., the columns of X∗ are
approx-imately linearly dependent. Thus, since the sum of the eigenvalues must equal the number of predictors (i.e., (p−1)), then very small λj’s (say, near
0.05) are indicative of collinearity. Another criterion commonly used is to declare multicollinearity is present whenPp−1
j=1λ
−1
j >5(p−1). Moreover, the
entries of the corresponding ξj’s indicate the nature of the linear
dependen-cies; i.e., large elements of the eigenvectors identify the predictor variables that comprise the collinearity.
A measure of the overall multicollinearity of the variables can be obtained by computing what is called the condition number of the correlation ma-trix (i.e., r) which is defined as pλ(p−1)/λ(1), such that λ(1) and λ(p−1) are the minimum and maximum eigenvalues, respectively. Obviously this quan-tity is always greater than 1, so a large number is indicative of collinearity. Empirical evidence suggests that a value less than 15 typically means weak collinearity, values between 15 and 30 is evidence of moderate collinearity, while anything over 30 is evidence of strong collinearity.
Condition numbers for the individual predictors can also be calculated. This is accomplished by taking cj =
p
λ(p−1)/λj for each j = 1, . . . ,(p−1).
When data is centered and scaled, then cj ≤ 100 indicates no collinearity,
100 < cj < 1000 indicates moderate collinearity, while cj ≥ 1000 indicates
strong collinearity for predictor Xj. When the data is only scaled (i.e., for
regression through the origin models), then collinearity will always be worse. Thus, more relaxed limits are usually used. For example, a common rule of thumb is to use 5 times the limits mentioned above; namely, cj ≤ 500
indicates no collinearity, 500 < cj < 5000 indicates moderate collinearity,
while cj ≥5000 indicates strong collinearity for predictor Xj.
It should be noted that there are many heuristic ways other than those de-scribed above to assess multicollinearity with eigenvalues and eigenvectors.1
Moreover, it should be noted that some observations can have an undue influ-ence on these various measures of collinearity. These observations are called collinearity-influential observations and care should be taken with how
1For example, one such technique involves taking the square eigenvector relative to the
square eigenvalue and then seeing what percentage each quantity in this (p−1)-dimensional
vector explains of the total variation for the corresponding regression coefficient.
130 CHAPTER 10. MULTICOLLINEARITY
these observations are handled. You can typically use some of the residual diagnostic measures (e.g., DFITs, Cook’s Di, DFBETAS, etc.) for
identify-ing potential collinearity-influential observations since there is no established or agreed-upon method for classifying such observations.
Finally, there are also some more advanced regression procedures that can be performed in the presence of multicollinearity. Such methods include principal components regression and ridge regression. These methods are discussed later.
10.3
Examples
Example 1: Muscle Mass Data Set
Suppose that data fromn = 6 individuals is collected on their muscle mass. The data set includes Y = muscle mass, X1 = age, and two possible (yet redundant) indicator variables for gender:
1. X2 = 1 if the person is male (M), and 0 if female (F). 2. X3 = 1 if the person is female (F), and 0 if male (M). The data are given in Table 10.1.
mass 60 50 70 42 50 45 age 40 45 43 60 60 65
sex M F M F M F
Table 10.1: The muscle mass data set.
Suppose that we (mistakenly) attempt to use the model yi =β0+β1xi,1+β2xi,2+β3xi,3+i.
Notice that we used both indicator variables in the model. For these data and this model,
X= 1 40 1 0 1 45 0 1 1 43 1 0 1 60 0 1 1 60 1 0 1 65 0 1 .
CHAPTER 10. MULTICOLLINEARITY 131
The sum of the last two columns equals the first column for every row in the X matrix. This is a linear dependence, so parameter estimates cannot be calculated because (XTX)−1 does not exist. In practice, the usual solution is to drop one of the indicator variables from the model. Another solution is to drop the intercept (thus dropping the first column of X above), but that is not usually done.
For this example, we can’t proceed with a multiple regression analysis be-cause there is perfect collinearity with X2 andX3. Sometimes, a generalized inverse can be used (which requires more of a discussion beyond the scope of this course) or if you attempt to do an analysis on such a data set, the software you are using may zero out one of the variables that is contributing to the collinearity and then proceed to do an analysis. However, this can lead to errors in the final analysis.
Example 2: Heat Flux Data Set (continued)
Let us return to the heat flux data set. Let our model include the east, south, and north focal points, but also incorporate time and insolation as predictors. First, let us run a multiple regression analysis which includes these predictors.
########## Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 325.43612 96.12721 3.385 0.00255 ** east 2.55198 1.24824 2.044 0.05252 . north -22.94947 2.70360 -8.488 1.53e-08 *** south 3.80019 1.46114 2.601 0.01598 * time 2.41748 1.80829 1.337 0.19433 insolation 0.06753 0.02899 2.329 0.02900 * ---Signif. c