MULTICOLLINEARITY
MULTICOLLINEARITY 119 0 5
0 0 150 300 8 16 0 150 300 10 5 0 16 8 0 Sugars Fiber Potassium
Figure 3.10 Matrix plot of the predictor variables shows correlation between fiber and potassium.
robust with respect to the inclusion of correlated variables in the model, as we shall verify in the exercises. So in the presence of correlated predictors, we would look to ci to help explain large changes insbi.
We may expresscias ci= 1 (n−1)si2 1 1−R2i wheres2
i represents the sample variance of the observed values of theith predictor, xi, andRi2represents the R2 value obtained by regressingxi on the other predictor variables. Note thatR2
i will be large whenxiis highly correlated with the other predic- tors. Note that of the two terms inci, the first factor, 1/((n−1)si2), measures only the intrinsic variability within the ofith predictor,xi. It is the second factor, 1/(1−Ri2) that measures the correlation between theith predictorxiand the remaining predictor variables. For this reason, this second factor is denoted as thevariance inflation factor (VIF) forxi:
VIFi = 1 1−R2
i
Can we describe the behavior of the VIF? Suppose that xi is completely uncorrelated with the remaining predictors, so that R2
i =0. Then we will have VIFi=1/(1−0)=1.That is, the minimum value for VIF is 1, which is reached whenxiis completely uncorrelated with the remaining predictors. However, as the de- gree of correlation betweenxiand the other predictors increases,R2i will also increase. In that case, VIFi =1/(1−Ri2) will increase without bound asR2iapproaches 1. Thus, there is no upper limit to the value that VIFican take.
SPH SPH
JWDD006-03 JWDD006-Larose November 25, 2005 17:26 Char Count= 0
120 CHAPTER 3 MULTIPLE REGRESSION AND MODEL BUILDING
What effect do these changes in VIFi have on sbi, the variability of the ith coefficient? We have sbi =sci=s 1 (n−1) si2 1 1−Ri2 =s V I Fi (n−1)si2
Ifxiis uncorrelated with the other predictors, VIFi =1, and the standard error of the coefficientsbi will not be inflated. However, ifxi is correlated with the other predictors, the large VIFi value will produce an overinflation of the standard error of the coefficientsbi. As you know, inflating the variance estimates will result in a degradation in the precision of the estimation.
A rough rule of thumb for interpreting the value of the VIF is to consider VIFi ≥ 5 to be an indicator of moderate multicollinearity and to considerV I Fi ≥ 10 to be an indicator of severe multicollinearity. A variance inflation factor of 5 corresponds toR2
i =0.80, and VIFi = 10 corresponds to Ri2=0.90.
Getting back to our example, suppose that we went ahead with the regression of nutritional rating on sugars, fiber, the shelf 2 indicator, and the new variable, potassium, which is correlated with fiber. The results, including the observed variance inflation factors, are shown in Table 3.12. The estimated regression equation for this model is
ˆ
y=52.184−2.1953(sugars)+4.1449(fiber) +2.588(shelf)−0.04208(potassium)
The p-value for potassium is not very small (0.099), so at first glance the variable may or may not be included in the model. Also, the p-value for the shelf 2 indicator variable (0.156) has increased to such an extent that we should perhaps not include it in the model. However, we should probably not put too much credence into any of these results, since the VIFs observed seem to indicate the presence of a multicollinearity problem. We need to resolve the evident multicollinearitybeforemoving forward with this model.
The VIF for fiber is 6.5 and the VIF for potassium is 6.7, with both values indicating moderate-to-strong multicollinearity. At least the problem is localized with these two variables only, as the other VIFs are reported at acceptably low values. How shall we deal with this problem? Some texts suggest choosing one of the variables and eliminating it from the model. However, this should be viewed only as a last resort, since the variable omitted may have something to teach us. As we saw in Chapter 1, principal components can be a powerful method for using the correlation structure in a large group of predictors to produce a smaller set of independent components. However, the multicollinearity problem in this example is strictly localized to two variables, so the application of principal components analysis in this instance might be considered overkill. Instead, we may prefer to construct a user-defined composite, as discussed in Chapter 1. Here, our user-defined composite will be as simple as possible, the mean of fiberzand potassiumz, where thez-subscript notation indicates that the variables have been standardized. Thus, our compositeWis defined asW =
fiberz+potassiumz
/2.
MULTICOLLINEARITY 121
TABLE 3.12 Regression Results, with Variance Inflation Factors Indicating a Multicollinearity Problem
The regression equation is
Rating = 52.2 - 2.20 Sugars + 4.14 Fiber + 2.59 shelf 2 - 0.0421 Potass
Predictor Coef SE Coef T P VIF
Constant 52.184 1.632 31.97 0.000 Sugars -2.1953 0.1854 -11.84 0.000 1.4 Fiber 4.1449 0.7433 5.58 0.000 6.5 shelf 2 2.588 1.805 1.43 0.156 1.4 Potass -0.04208 0.02520 -1.67 0.099 6.7 S = 6.06446 R-Sq = 82.3% R-Sq(adj) = 81.4% Analysis of Variance Source DF SS MS F P Regression 4 12348.8 3087.2 83.94 0.000 Residual Error 72 2648.0 36.8 Total 76 14996.8 Source DF Seq SS Sugars 1 8701.7 Fiber 1 3416.1 shelf 2 1 128.5 Potass 1 102.5
Note that we need to standardize the variables involved in the composite, to avoid the possibility that the greater variability of one of the variables will overwhelm that of the other variable. For example, the standard deviation of fiber among all cereals is 2.38 grams, and the standard deviation of potassium is 71.29 milligrams. (The grams/milligrams scale difference is not at issue here. What is relevant is the difference in variability, even on their respective scales.) Figure 3.11 illustrates the difference in variability.
We therefore proceed to perform the regression of nutritional rating on the vari- ablessugarszandshelf2, andW =
fiberz+potassiumz
/2. The results are provided in Table 3.13.
Note first that the multicollinearity problem seems to have been resolved, with the VIF values all near 1. Note also, however, that the regression results are rather disappointing, with the values ofR2,R2
adj, andsall underperforming the model results found in Table 3.7, from the modely=β0+β1(sugars)+β2(fiber)+β4(shelf2)+
ε,which did not even include the potassium variable.
What is going on here? The problem stems from the fact that the fiber variable is a very good predictor of nutritional rating, especially when coupled with sugar content, as we shall see later when we perform best subsets regression. Therefore,
SPH SPH
JWDD006-03 JWDD006-Larose November 25, 2005 17:26 Char Count= 0
122 CHAPTER 3 MULTIPLE REGRESSION AND MODEL BUILDING
Fiber
Potassium
0 50 100 150 200 250 300 0 2 4 6 8 10 12 14
Figure 3.11 Fiber and potassium have different variabilities, requiring standardization prior to construction of a user-defined composite.
using the fiber variable to form a composite with a variable that has weaker correlation with rating dilutes the strength of fiber’s strong association with rating and so degrades the efficacy of the model.
Thus, reluctantly, we put aside the model y=β0+β1(sugarsz)+β4 (shelf2)+β5(W)+ε. One possible alternative is to change the weights in the composite, to increase the weight of fiber with respect to potassium. For example,
TABLE 3.13 Results of Regression of Rating onSugars, Shelf 2, and the
Fiber/PotassiumComposite
The regression equation is
Rating = 41.7 - 10.9 sugars z + 3.67 shelf 2 + 6.97 fiber potass
Predictor Coef SE Coef T P VIF
Constant 41.6642 0.9149 45.54 0.000 sugars z -10.9149 0.8149 -13.39 0.000 1.2 shelf 2 3.672 1.929 1.90 0.061 1.3 fiber potass 6.9722 0.8230 8.47 0.000 1.1 S = 6.56878 R-Sq = 79.0% R-Sq(adj) = 78.1% Analysis of Variance Source DF SS MS F P Regression 3 11846.9 3949.0 91.52 0.000 Residual Error 73 3149.9 43.1 Total 76 14996.8