Reduction of small coefficients to zero - The Instability of Cross-Validated Lasso

One of the benefits of Lasso is that it reduces the number of covariates with non-zero coefficients. Other regularisation methods such as ridge only shrinks the coefficients towards zero. However, even though Lasso picks a covariate with non-zero coefficient, the coefficient may be very small. It may be so small that in effect it does not contribute to the prediction. When the focus is on mapping the relevant genes, i.e. not finding the exact coefficients but just finding the genes with non-zero coefficients, these genes with small coefficients may be of no interest. In fact, it would be useful to rule them out and instead find the most relevant genes.

To strengthen the effects of the truly relevant genes, optL1’s choice of λ was decreased by a constant eps=10. This way, λ is forced to move to the right of Figure 2.5 which leads to more covariates; those that already were in the model will mostly have greater coefficients, and the new covariates will have small coefficients.

Figure 4.11: Speed of the gene effect reduction, sizes. The simulation parameters were set to n = 300 and ρ = 0.3. The heat maps show that the signature sizes stay approximately the same even though the speed of the gene effect reduction decreases. The top image shows 10-fold CV, and the bottom image shows LOOCV.

Figure 4.12: Speed of the gene effect reduction, indices. n = 300 and ρ = 0.3. LOOCV. This heat map shows the genes with non-zero coefficients after running optL1. Each row is a new dataset and a new run with optL1.

Then the coefficients smaller than a limit cutoff were reduced to zero, ex- cluding these genes from the model. The new set of coefficients of all genes form the basis of this calculation:

sensitivity = True Positives

True Positives + False Negatives =

TP + FN, (4.1) specificity = True Negatives

False Positives + True Negatives =

FP + TN. (4.2)

TP is the number of real relevant genes which have a non-zero coefficient, FN are the real relevant genes with coefficient zero, FP are the irrelevant genes which have a non-zero coefficient and TN are the irrelevant genes with coefficient zero. The terms are summarised in Figure 4.13. Sensitivity is also called

Figure 4.13: ROC grid.

True Positive Rate (TPR), and specificity is 1−False Positive Rate (FPR). The relation between these numbers are visualised in an ROC (Receiver Operating Characteristic)[13] in Figure 4.14.

The ROC curve is based on simulated data with simulation parameters n = 300, ρ = 0.3, p = 1000 (genes) and k = 10. The relevant genes were therefore defined to be the genes with indices 1-20 and 101-120. optL1 with 10-fold CV was run on the data set, λ decreased by 10, which resulted in a prediction model. To more easily see how well optL1 picks out the real relevant genes among the first 100, the coefficients of the genes with indices greater than 100 were ignored. The set of 100 coefficients formed the basis of the calculations described in (4.1) and (4.2). The cutoff-limits were set to 0, 0.025, 0.05, 0.075, 0.1, 0.125, 0.15 and 0.175, forming a new predictive model every time. For every new model, sensitivity and specificity were calculated. 10-fold CV and cutoffs were repeated several times, and the average of sensitivities and specificities were calculated. This again was applied on many data sets and the averages were calculated. These values are presented in the ROC.

Figure 4.14: ROC of cutoff effect when λ has been decreased by 10. The x-axis describes (1-specificity) which is equivalent with FPR, and the y-axis describes the sensitivity which is equivalent to TPR. If a model is well fitted, TPR should be large and FPR small. The models are marked by circles in the plot with cutoff-valued label. Markings above the diagonal line are well fit.

The perfect model would have only the real relevant genes and no irrelevant genes, i.e. perfectly classified, which would be represented by a point in the ROC at (0,1). If genes were picked at random, the corresponding points would be placed on the diagonal line y = x. I.e. that points above the diagonal line represent a good prediction, and points below bad predictions, even worse than random. The distance to the diagonal reflects the quality of the predictive model: The point (0,1) is furthest away, positively, and is perfectly classified. A model with all coefficients non-zero gives FN=0 and TN=0 which gives sensitivity 1 and specificity 0. This would give the point (1,1). A model with only zero-coefficients will result in the point (0,0).

The ROC shows that the models lie close to a random guess. Of course, the size of the coefficients are not considered, which means that an irrelevant gene with non-zero coefficient is weighted equally as a relevant gene with non- zero coefficient. Hopefully, removing genes with very small coefficients from the model (reducing the coefficient to zero) would lead to better predictive models, but this does not seem to be the case.

Notice that even after reducing the number of genes in focus to 100, because of the sparsity of Lasso, TN will always be large, which implies specificity ≈ 1, i.e. the points will have x-coordinate close to 0, which is good. Also, it seems that FN will be large compared to TP, so sensitivity and the y-coordinate will be close to zero. Reflecting on this, it is not so strange that the points in Figure 4.14 lie close to (0,0).

In document The Instability of Cross-Validated Lasso (Page 67-72)