3.3 The Lasso Method
3.3.7 The Final Algorithm
In this subsection we present the final algorithm of the bolasso. The data matrix X, the response vector Y , the objective function and, in general, all the functions discussed previously, are assumed to have been theoretically found. Such that they can be used in the algorithm.
The initialization step of the algorithm is done by creating the total B bootstrap samples from the data matrix X. This is done by sampling rows from the data matrix with replacement. The statistic that we want to estimate here is the vector of the estimates of the objective function. Furthermore, for each bootstrap sample we need to compute the bootstrap replicate of the statistic. That is, for each bootstrap sample we need to estimate the vector of the coefficients of the objective function.
For doing that, we first need to find the optimal λ by which the likelihood will be penalized. This is done by choosing the λ which gives the minimum cross validation error, from a sequence of λ values (the generation of the λ sequence will be discussed in the next section). For each λ in that sequence, we define the training and testing sets. Furthermore, we use the training set with that λ, in the cyclic coordinate algorithm for estimating the parameters of the objective function. Then we fit the model with the estimated parameters on the testing set. The prediction error for that fold is the norm of the difference between the fitted values and the response vector Y . We repeat the procedure on the other sets, with the same λ, and we take the average of the errors. This is done for each λ in the sequence and then the one which gives the lowest error is chosen as optimal.
When we have the optimal λ, we estimate the coefficients via the cyclic co- ordinate algorithm. Note that, this is also done in each cross validation step for each λ, but for that case we use the training sets, and not the whole data matrix. Therefore, after finding the optimal λ, we have to re-estimate the coefficients using the whole data matrix X. First we initialize the coordinate algorithm by giving initial values for the coefficients. Usually a vector of zeros is given. Then, for the βj coefficient, we estimate its value using the weights, the working response, the soft threshold and the optimal λ. This is done in circles starting from β1and finishing
one circle at βp. Each time a circle is done, we check if the norm of the difference between the newly estimated vector and the one from the previous circle, or the initial one, is smaller than a tolerance. If it is, then we are done with the cyclic coordinate algorithm and with that bootstrap sample. If it isn’t, then we have to run more circles until the tolerance is reached.
30 CHAPTER 3. THE LASSO METHOD After all the bootstrap replicates have been computed, we use the AIC for finding the optimal threshold. The output from the bootstrap should be a matrix of dimensions p× B. Each column of the matrix corresponds to one bootstrap replicate, that is, the estimated vector of coefficients from that bootstrap sample. Each row of the matrix corresponds to the values that the specific coefficient took among the bootstrap samples. For each row in the matrix, we compute the total number of zeros. From those numbers we create a frequencies vector of length p. Those are the frequency thresholds and AIC will choose the optimal. For each value in the frequencies vector we create a model. This is done by checking which of the coefficients had been set to zero less times that the chosen threshold. For those, their mean among the bootstraps is taken as an estimate, while the others are simply set to zero. Then the AIC is computed by the equation (3.9), where k is the number of non-zero estimates in the newly computed β vector. Finally, this is done for all thresholds in the sequence and the one which gives the lowest AIC is chosen as optimal. For that optimal threshold we again check which of the coefficients have been set to zero less times that the optimal threshold and we take their mean (for each coefficient respectively), among the bootstrap values. The others are set to zero. This vector of coefficients is the bolasso estimation of the coefficients for the objective function. The complete form of the algorithm is given in algorithm 5.
3.3. THE LASSO METHOD 31
Data: Data matrix X, response vector Y , link function g, formulas for
working response zi and iterative weights wi, the number K of the folds for the cross validation, the number B of bootstrap samples.
Result: Estimated generalized linear regression coefficients using the
bolasso method.
Initialization: Sample B bootstrap samples from the data matrix, with replacement.
for Each bootstrap sample b do
Compute λmax and set a value to λmin≥ 0, or compute the λ sequence.
Compute the optimal λoptimalvia cross validation (using the proper cyclic coordinate algorithm on each fold, with or without warm starts).
Cyclic coordinate step (used also in each cross validation):
Initialization: Give initial values for β and compute the initial working responses z and weights w
while∥βnew− βold∥ > 1e − 06(tol.) do
for j← 1 to length(β) do
Update βj by the coordinate which gives the greatest reduction/growth of the objective function;
Update the vector of parameters β;
end
Update the z and w with the new β;
end end
AIC step: Select the optimal model threshold by using the AIC
criterion, based on all the bootstraps;
Find the ”allowed” number of times a coefficient could be zero in each bootstrap. Here, k in the AIC criterion is the number of non-zero coefficients for that model.
32 CHAPTER 3. THE LASSO METHOD