Cross Validation - Testing and model selection for prediction in large sets of variables

7.3 Cross Validation

Note again that in the case of correlation between hypotheses we propose the unknown variance assumption standardizing the weights in the score by respective variance esti- mates. If a large number of variables are included in the score and a small sample size is applied, ˆΣk can not always be inverted. Choosing the simple additive score with stan-

dardized weights leads to only slightly poorer performances in terms of the AUC than considering the whole estimated covariance matrix (if possible). This is a reason why the simple additive score has attracted a lot of attention in applications.

In the situation of correlated variables, generally the same tendencies for the prognostic scores determined by the cross validation procedure can be found as in the situation of independence between hypotheses. Because of the at optimum of the interpolated functions for large positive ρ (see Figures 7.2, 7.3) again there may be large dierences between ˆαopt determined from the cross validation procedure and αopt. However, again the

determined ˆγopt and corresponding ˆαopt values are leading to a mean future performance

in terms of AUC(ˆγopt)which is close to AUC(αopt)(see Table 7.2 for the cross validation

results of the investigated examples).

Despite the underlying correlation structure the cross validation procedure (only considering the estimated variances in the score) seem to work well ending in larger cross validation based AU C(ˆd γ_opt) values and small ˆα_opt values if the alternative holds and in

small cross validation basedAU C(ˆd γ_opt)values and large ˆα_opt values if the global null hy-

pothesis is true (see histograms of AU C(ˆd γ_opt) in Figure 7.5 and of ˆα_opt in Figure 7.6).

Assuming ρ = 0.6 and 0.9 the eect sizes have to be very large to achieve the benchmark AU C∗. Thus the cross validation procedure ends in AU C(ˆd γ_opt) values larger than 0.9

(Figure 7.5 rst and third row). For ρ = −0.6 the eect size ∆ is very small in order to achieve AUC∗. Fortunately, values of AU C(ˆd γ_opt) are also small under the alternative

indicating that no good prediction score can be determined with the given eect and sample sizes (Figure 7.5 second row).

7 Situation of correlation between variables

large eect sizes. For example, for ρ = 0.9, ˆαopt on average is 0.003 assuming me = 10

prognostic variables. For ρ = −0.6 and assuming me = 10 an average ˆαopt = 0.588 is

indicating that the resulting score includes more than 50% non-prognostic variables. For me = 60more than 80% non-prognostic variables can be expected in the prediction score.

Table 7.2 summarizes the results under the alternative.

Table 7.3 shows the results of the cross validation procedure under the global null hypoth- esis. Under the global null, AU C(ˆd γ_opt) is generally smaller than 0.7 and ˆα_opt is generally

larger than 0.9 (see medians in table Table 7.3). This indicates that we can expect that more than 90% variables without prognostic ability are included in the score and that the performance of the selected score is very poor. In this situation one may conclude that no prediction score should be selected from the given data.

The cross validation procedure seems to work also in the situation of correlated hypotheses ending atAU C(ˆd γ_opt)and ˆα_opt values giving a good evaluation of the underlying prediction

score. Again the conclusion is that both criteria, AU C(ˆd γ_opt) and ˆα_opt should be consid-

ered to decide whether a score should be constructed from a given data set or not. Figure 7.7 shows scatterplots of AU C(ˆd γ_opt) vs. ˆα_opt for all investigated examples. However, one

contradiction remains, the dierences between the estimated ˆαopt and the true FDR are

slightly larger under the correlated case than under the independent case (see Tables 7.2 and 7.3).

7.3 Cross Validation

Table 7.2: Cross Validation in the correlated case: The true best choice of the FDR, αopt, and the corresponding AUC(αopt)as well as results determined from the cross validation

procedure: the selection boundary ˆγoptand the corresponding ˆαopt, the true FDR and ˆαopt,∞,

the true AUC(ˆγopt), the cross validatedM D(ˆd γopt)and the FWER for a varying number of

prognostic variables me. The number of tested variables is set to m = 1000. The per-group

sample size is xed to n = 50. ρ = 0.6, −0.6 and 0.9. AUC∗= 0.965.

me 10 60 ρ 0.6 -0.6 0.9 0.6 -0.6 0.9 ∆ 1.422 0.421 2.111 0.646 0.166 1.265 αopt 0.024 0.793 0.001 0.225 0.908 0.025 AU C(αopt) 0.961 0.621 0.960 0.950 0.588 0.958 ˆ γopt 0.007(0.01) 0.013(0.01) 0.00004(0.0001) 0.030(0.03) 0.292(0.27) 0.0096(0.01) 0.002 0.007 0.0000005 0.019 0.167 0.0018 ˆ αopt 0.220(0.26) 0.588(0.25) 0.003(0.009) 0.281(0.17) 0.888(0.07) 0.103(0.14) 0.128 0.634 0.0000505 0.254 0.891 0.024 FDR 0.243(0.26) 0.579(0.31) 0.021(0.07) 0.278(0.19) 0.857(0.06) 0.110(0.16) 0.167 0.692 0.000 0.250 0.861 0.016 ˆ αopt,∞ 0.226(0.26) 0.607(0.21) 0.004(0.01) 0.283(0.17) 0.858(0.05) 0.105(0.14) 0.131 0.658 0.0000495 0.255 0.854 0.027 AU C(ˆγopt) 0.957(0.01) 0.618(0.11) 0.960(0.001) 0.947(0.01) 0.584(0.01) 0.957(0.002) 0.959 0.624 0.960 0.948 0.586 0.957 d AU C(ˆγopt) 0.964(0.02) 0.744(0.07) 0.960(0.01) 0.957(0.02) 0.661(0.07) 0.960(0.02) 0.966 0.751 0.960 0.959 0.667 0.960 FWER 0.648 0.826 0.110 0.986 0.998 0.542

Table 7.3: Cross Validation in the correlated case under the global null: Results determined from the cross validation procedure: the selection boundary ˆγoptand the corre-

sponding ˆαopt, the true FDR using the determined selection threshold, ˆαopt,∞(always 1) the

true AUC(ˆγopt) (always 0.5), the cross validatedAU C(ˆd γopt) and the FWER for a varying

number of parameters for the autoregressive correlation ρ. The number of tested variables is set to m = 1000. The per-group sample size is xed to n = 50. AUC∗= 0.965.

ρ 0.6 -0.6 0.6 ˆ γopt 0.034(0.06) 0.044(0.08) 0.040(0.08) 0.006 0.009 0.007 ˆ αopt 0.812(0.22) 0.841(0.20) 0.770(0.27) 0.916 0.935 0.904 FDR 0.972(0.17) 0.964(0.19) 0.886(0.32) 1.000 1.000 1.000 ˆ αopt,∞ 1.000 1.000 1.000 AU C(ˆγopt) 0.500 0.500 0.500 d AU C(ˆγopt) 0.650(0.07) 0.662(0.07) 0.604(0.07) 0.659 0.670 0.607 FWER 0.972 0.964 0.886

7 Situation of correlation between variables

Figure 7.5:Correlated hypotheses: Distribution of the cross validation basedAU C(ˆd γopt)(500 sim-

ulation runs) assuming correlation between variables: me = 10(rst column), 60 (second

column) or 0 (third column) are assumed among m = 1000 hypotheses. ρ = 0.6 (rst row) −0.6(second row) and 0.9 (third row) is assumed. The sample size is set to n = 50.

7.3 Cross Validation

Figure 7.6:Correlated hypotheses: Distribution of the cross validation based ˆαopt (500 simulation

runs) assuming correlation between variables: me= 10(rst column), 60 (second column)

or 0 (third column) are assumed among m = 1000 hypotheses. ρ = 0.6 (rst row) −0.6 (second row) and 0.9 (third row) is assumed. The sample size is set to n = 50.

7 Situation of correlation between variables

Figure 7.7:Correlated hypotheses: Scatterplots of AU C(ˆd γopt) vs. ˆαopt (500 simulation runs) as-

suming correlation between variables: me = 10 (rst column), 60 (second column) or 0

(third column) are assumed among m = 1000 hypotheses. ρ = 0.6 (rst row) −0.6 (second row) and 0.9 (third row) is assumed. The sample size is set to n = 50.

8 Real data applications

To evaluate the cross validation procedure we used three data sets summarized and pre- processed by Pavlidis et al. (2003) and a data set investigated in Tian et al. (2003). Note that because of the large number of investigated genes in the four investigated data sets (> 6000) it was not possible to apply the forward logistic regression in SAS 9.1. due to lack of memory space.

In document Testing and model selection for prediction in large sets of variables (Page 108-114)