• No results found

The GGGL defined in (2.31) with penalties (2.32), (2.33), (2.34) consist of three regu- larisation parameters: λ1, λ2, and µ. The first two parameters control the number of

selected groups and predictors, and µ controls the weight of the network structure that is imposed to the model. These parameters are traditionally tuned by a cross-validation procedure in which the model fit is assessed by its prediction error. Cross-validation proce-

2.9 Parameter tuning 68

Algorithm 3 PCDM Step 5

Input: The group to be updated R(k)I , denoted by RI for short; data X, y (superscript

* removed for simplicity); parameter λ1, λ2; estimated coefficients before step k: β(k),

denoted by β for convenience,  Output: column vector ˆβ(k+1)

R(k)I , denoted by ˆβIfor convenience.

1: if (2.62) holds then 2: βˆI = (0, 0, ..., 0)T 3: else 4: for i ∈ RIdo 5: if (2.63) holds then 6: βˆi = 0 7: else 8: βˆi = ˆγias in (2.64). 9: end if 10: end for 11: if k ˆβI− βIk2 ≤  then 12: stop. 13: else 14: βI = ˆβI and go back to 4. 15: end if 16: end if

dures can avoid model overfitting by evaluating the out-of-sample prediction performance of the model, therefore the optimal model parameters generally have good generalisability to new data. However, parameters tuned using this minimal prediction error criterion do not necessarily give rise to a model comprising the true sparsity pattern. This is primarily due to two reasons: One, the model tends to recruit noise variables which are moderately correlated with the true variables to reduce prediction error via model reparameterisation; Two, a fixed combination of model parameters often result in different sparsity patterns in each fold of the cross-validation, which may also be different from the sparsity patterns obtained by using the same parameters in the model when applying to the full data. In fact, it has been observed that sparsity parameters optimised in this way often result in a larger model than the true one, where many noise variables are selected [Leng et al., 2006,Birg´e

and Massart, 2007]. In this section we firstly review a variable ranking procedure called

stability selection which can rank the groups and individual predictors in a GGGL model according to their importance in predicting the response for a given µ. We also propose a

new algorithm inspired by stability selection, which can identify an optimal µ from a set of candidate values. Therefore, the complete procedure for variable selection using GGGL consist of two steps: firstly searching for an optimal µ; and secondly employing stability selection with the optimal µ to obtain a group ranking and a predictor ranking from which important groups and predictors can be selected.

2.9.1

Stability selection

For fixed µ in GGGL, a data resampling procedure called stability selection [Meinshausen

and B¨uhlmann,2010] can be adopted to rank the groups and predictors according to their

importance in predicting the response. Stability selection consists of fitting the sparse (re- gression) model to a large number of subsamples of the data using pre-determined sparsity parameters, where each subsample comprises half of the subjects which are independently sampled without replacement from the pool of all subjects. Variable selection results across all subsamples are integrated to compute empirical selection probabilities for each variable (and for each group as well in GGGL). The sets of important variables (and groups) are selected by setting a threshold on the selection probabilities from the ranking. As shown

byMeinshausen and B¨uhlmann[2010], the top-ranking variables are insensitive to the par-

ticular choice of the sparsity parameters. Given µ, the procedure of stability selection for GGGL is presented below:

1. Randomly extract half of the subjects from the pool of all subjects and denote the data matrices/vectors consisting of extracted subjects Xs and ys respectively.

2. Fit GGGL on predictor matrix Xs and response vector ys for a fixed µ, where λ1

and λ2 are either prescribed or chosen such that a prescribed number of groups and

variables are selected.

3. Record the groups and variables which are selected in step 2.

4. Repeat steps 1 to 3 N times, where N is typically at least 1000.

2.9 Parameter tuning 70

the groups and variables in the corresponding lists according to the selection proba- bilities.

2.9.2

Hybrid cross-validation with subsampling

Stability selection procedure as in the last subsection assumes a given graph-regularisation parameter µ in GGGL. In practice, such a value is unknown and needs to be carefully cho- sen to avoid either over-relying on the prior knowledge or giving insufficient attention on it. We define µ as being optimal if the set of important variables selected with this particular µ give rise to the lowest out-of-sample prediction error, where prediction is made using the important predictors only. Assuming a set of candidate values of µ are available which we denote by Θ, we propose a hybrid cross-validation algorithm to identify the optimal µ. The core idea is to fit a sparse model for each µ in Θ using the training data, followed by learning the non-sparse model coefficients using only the selected predictors from the training data, and finally compare prediction errors corresponding to various µ using the testing data. To increase the stability of the selected variables and the robustness of predic- tion performance, we implant the subsampling procedure introduced in the last subsection, such that the “variable selection - model fitting - prediction” procedure is carried out M times for each µ in each fold of the cross-validation algorithm.

Specifically, we perform a 10-fold cross validation. For each fold, we denote the train- ing dataset as Dtrainwhich comprises of 90% of the subjects, and the testing dataset as Dtest

which comprises of 10% of the subjects, and proceed as follows:

1. Randomly extract half of the subjects from Dtrainand fit the sparse model for each µ ∈

Θ, where sparsity parameters are either prescribed or chosen such that a prescribed number of variables are selected. Retain the set of selected variables/groups for each µ.

2. For each µ ∈ Θ, fit the same model using only the selected variables from step 1, but without sparsity-inducing penalties, on the training set. Note the graph penalty reg- ularised by µ is included in the objective function. Retain the estimated coefficients for each µ.

3. For each µ ∈ Θ, use the coefficients estimated from step 2 to compute prediction errors on Dtest. Retain the prediction error for each µ and choose the value of µ that

results in the least prediction error.

4. Repeat steps 1 to 3 M times and record the number of times each µ ∈ Θ is chosen as the optimal value.

This procedure is repeated for all ten folds, and during this process we keep track of the number of times that each µ has been deemed optimal across the 10 × M subsampled data. The candidate with the most counts is the optimal value in Θ.