Collinearity - Outlier Detection and Multicollinearity in Sequential Variable Selection: A Lea

While the main loop is progressing, laron keeps track of two measures of collinearity: the condition number of the design matrix and VIFs for individ- ual variables. These are recorded along the LARS path so that it is possible to identify potential causes of sudden spikes in collinearity. This may arise due to the entry of a collinear variable in the model or through the deletion of cases, as portrayed in Figure 3.1.

In data sets with no significant outliers and particularly in the presence of collinear predictors, laron tends to increase the number of cases that are removed in order to form a branch. When variables are correlated, the Lasso tends to select one of the group and ignore the others; changes in the data set may induce selection of a different variable from the group with relatively little impact on either the prediction or interpretability of the model.

To examine this phenomenon, let us utilize the diabetes data from the

larspackage

> fit3 <- laron(diabetes$x, log(diabetes$y)) > fit3

Call:

laron.default(x = diabetes$x, y = log(diabetes$y)) 9 cases nominated for investigation:

Showing the first 5 cases.

Case 93 388 290 59 381

Avg Cook’s Distance 2.056 1.738 1.3 1.17 0.911 ----

Maximum condition number 23.71293 Occurs in branch 4 step 10

----

5 variables showed evidence of collinearity:

Variable tc ldl hdl ltg tch

Max VIF 69.678 45.305 18.163 10.832 8.916 Here there are 9 cases nominated as outlying, and yet their average standard- ized Cook’s distances are fairly small. By examining the outliers function, I can see that most of these points have a selection ratio of 0.75. Generally, I would not expect these observations to affect the set of predictors chosen by LARS. When variablestcandtchare removed (thus removing any indication of collinearity), the selection ratio of these observations drops to 0.5 or below. The second reason that so many observations are selected is due to the fact that the acceptable tail probability levels are set as high as 0.1 for outlier nomination. They are set this high intentionally as a conservative measure when the variance estimate is unnaturally inflated. Setting probs = c(0.0001, 0.001, 0.01)removes all but one case (93) from consideration.

Additional information on collinearity can be obtained through the collinearity_{function. This extracts information concerning high condition} numbers (including the branch and step where it occurs) and VIFs. It also per- forms a check for groups of predictors with pairwise correlations above a certain threshold. This thresholdmaxcoris the only option available, and defaults to 0.8. Using thediabetesdata as an example again,

> collinearity(fit3, maxcor = 0.7) Largest Condition Numbers:

Branch Step Condition.Number

1: 4.000 10.000 23.713 . 2: 4.000 9.000 23.225 . 3: 2.000 10.000 22.465 . 4: 3.000 10.000 21.701 . 5: 1.000 12.000 21.682 . 6: 2.000 9.000 19.342 . 7: 1.000 9.000 18.826 .

8: 2.000 8.000 18.786 . Showing 8 of 12 nominated.

---

Condition Numbers: Inf *** 100 ** 50 * 30 . 15 0

Most Collinear Variables: Variable Max.VIF 1: tc 69.678 ** 2: ldl 45.305 * 3: hdl 18.163 * 4: ltg 10.832 * 5: tch 8.916 . ---

Variance Inflation Factors: Inf ** 50 * 10 . 5 0 Correlated Groups:

Group 1: tc ldl

This example also provides an excellent reminder that collinearity and correlation are not synonymous in multiple regression. The hdl_, ltg_{, and} tch predictors exhibit high collinearity, though they are not correlated (withr> 0.7) with any other predictor.

4.3.1 Interpreting Collinear Graphs

There are four graphs that concern collinearity within thelaron_{fit: a VIF bar} chart, the condition number path, a correlation heat map, and a plot of the change in condition number against the influence of an omitted observation. When plotting alaronobject, they are obtained withwhich.graphs = c(4, 5, 7, 9). The first three may also be obtained by plotting the output from the collinearityfunction, with graph indices1, 2, and3, respectively. Exam- ples of these plots for thediabetesdata are shown in Figure 4.2.

The VIF bar chart shows the relative amount of collinearity exhibited by in- dividual predictors. The condition number path allows for easy identification of the step in the LARS process where collinearity first becomes an issue for each

Figure 4.2: Collinearity plots for thediabetesdata.

branch. Pairwise correlated variables are visible in the correlation heat map. If there are any correlated groups, the heatmap shows these across rows with dots indicating inclusion in the group. If there are no correlated groups, a standard correlation heat map is shown. The final plot indicates if there are any observations whose removal tends to uncover collinearity. Here, an observation’s average influence is plotted against the absolute change in the condition number of the design matrix between the last step when it was included and the first step when the observation was removed. Cases with a large change (e.g. greater than 10) combined with a large influence, near the top right of the graph, may be suspicious. Points near the top right are labeled with their branch number and subject identification number for easy exploration.

In the case of the diabetes data shown in Figure 4.2, the VIF bar chart shows that predictorstc and ldl show the strongest evidence of collinearity. From the correlation heatmap, it is clear that the two variables are highly correlated with each other (r = 0.897). Whentc_{is removed, only}tch_{maintains its high} VIF although the condition number stays below 15 for the entire path. In the

condition number path shown, all four branches eventually yield a high condition number, although only branch four has a steep spike immediately after it breaks off in step 7. This could indicate that the removal of these observations is the cause of the sudden spike; however, in this case it is unlikely, due to the low Cook’s distances of the observations.

In document Outlier Detection and Multicollinearity in Sequential Variable Selection: A Least Angle Regression-Based Approach (Page 103-107)