7.3 Cross Validation
Note again that in the case of correlation between hypotheses we propose the unknown variance assumption standardizing the weights in the score by respective variance esti- mates. If a large number of variables are included in the score and a small sample size is applied, ˆΣk can not always be inverted. Choosing the simple additive score with stan-
dardized weights leads to only slightly poorer performances in terms of the AUC than considering the whole estimated covariance matrix (if possible). This is a reason why the simple additive score has attracted a lot of attention in applications.
In the situation of correlated variables, generally the same tendencies for the prognos- tic scores determined by the cross validation procedure can be found as in the situation of independence between hypotheses. Because of the at optimum of the interpolated functions for large positive ρ (see Figures 7.2, 7.3) again there may be large dierences between ˆαopt determined from the cross validation procedure and αopt. However, again the
determined ˆγopt and corresponding ˆαopt values are leading to a mean future performance
in terms of AUC(ˆγopt)which is close to AUC(αopt)(see Table 7.2 for the cross validation
results of the investigated examples).
Despite the underlying correlation structure the cross validation procedure (only con- sidering the estimated variances in the score) seem to work well ending in larger cross validation based AU C(ˆd γopt) values and small ˆαopt values if the alternative holds and in
small cross validation basedAU C(ˆd γopt)values and large ˆαopt values if the global null hy-
pothesis is true (see histograms of AU C(ˆd γopt) in Figure 7.5 and of ˆαopt in Figure 7.6).
Assuming ρ = 0.6 and 0.9 the eect sizes have to be very large to achieve the benchmark AU C∗. Thus the cross validation procedure ends in AU C(ˆd γopt) values larger than 0.9
(Figure 7.5 rst and third row). For ρ = −0.6 the eect size ∆ is very small in order to achieve AUC∗. Fortunately, values of AU C(ˆd γopt) are also small under the alternative
indicating that no good prediction score can be determined with the given eect and sample sizes (Figure 7.5 second row).
ˆ
7 Situation of correlation between variables
large eect sizes. For example, for ρ = 0.9, ˆαopt on average is 0.003 assuming me = 10
prognostic variables. For ρ = −0.6 and assuming me = 10 an average ˆαopt = 0.588 is
indicating that the resulting score includes more than 50% non-prognostic variables. For me = 60more than 80% non-prognostic variables can be expected in the prediction score.
Table 7.2 summarizes the results under the alternative.
Table 7.3 shows the results of the cross validation procedure under the global null hypoth- esis. Under the global null, AU C(ˆd γopt) is generally smaller than 0.7 and ˆαopt is generally
larger than 0.9 (see medians in table Table 7.3). This indicates that we can expect that more than 90% variables without prognostic ability are included in the score and that the performance of the selected score is very poor. In this situation one may conclude that no prediction score should be selected from the given data.
The cross validation procedure seems to work also in the situation of correlated hypotheses ending atAU C(ˆd γopt)and ˆαopt values giving a good evaluation of the underlying prediction
score. Again the conclusion is that both criteria, AU C(ˆd γopt) and ˆαopt should be consid-
ered to decide whether a score should be constructed from a given data set or not. Figure 7.7 shows scatterplots of AU C(ˆd γopt) vs. ˆαopt for all investigated examples. However, one
contradiction remains, the dierences between the estimated ˆαopt and the true FDR are
slightly larger under the correlated case than under the independent case (see Tables 7.2 and 7.3).
7.3 Cross Validation
Table 7.2: Cross Validation in the correlated case: The true best choice of the FDR, αopt, and the corresponding AUC(αopt)as well as results determined from the cross validation
procedure: the selection boundary ˆγoptand the corresponding ˆαopt, the true FDR and ˆαopt,∞,
the true AUC(ˆγopt), the cross validatedM D(ˆd γopt)and the FWER for a varying number of
prognostic variables me. The number of tested variables is set to m = 1000. The per-group
sample size is xed to n = 50. ρ = 0.6, −0.6 and 0.9. AUC∗= 0.965.
me 10 60 ρ 0.6 -0.6 0.9 0.6 -0.6 0.9 ∆ 1.422 0.421 2.111 0.646 0.166 1.265 αopt 0.024 0.793 0.001 0.225 0.908 0.025 AU C(αopt) 0.961 0.621 0.960 0.950 0.588 0.958 ˆ γopt 0.007(0.01) 0.013(0.01) 0.00004(0.0001) 0.030(0.03) 0.292(0.27) 0.0096(0.01) 0.002 0.007 0.0000005 0.019 0.167 0.0018 ˆ αopt 0.220(0.26) 0.588(0.25) 0.003(0.009) 0.281(0.17) 0.888(0.07) 0.103(0.14) 0.128 0.634 0.0000505 0.254 0.891 0.024 FDR 0.243(0.26) 0.579(0.31) 0.021(0.07) 0.278(0.19) 0.857(0.06) 0.110(0.16) 0.167 0.692 0.000 0.250 0.861 0.016 ˆ αopt,∞ 0.226(0.26) 0.607(0.21) 0.004(0.01) 0.283(0.17) 0.858(0.05) 0.105(0.14) 0.131 0.658 0.0000495 0.255 0.854 0.027 AU C(ˆγopt) 0.957(0.01) 0.618(0.11) 0.960(0.001) 0.947(0.01) 0.584(0.01) 0.957(0.002) 0.959 0.624 0.960 0.948 0.586 0.957 d AU C(ˆγopt) 0.964(0.02) 0.744(0.07) 0.960(0.01) 0.957(0.02) 0.661(0.07) 0.960(0.02) 0.966 0.751 0.960 0.959 0.667 0.960 FWER 0.648 0.826 0.110 0.986 0.998 0.542
Table 7.3: Cross Validation in the correlated case under the global null: Results determined from the cross validation procedure: the selection boundary ˆγoptand the corre-
sponding ˆαopt, the true FDR using the determined selection threshold, ˆαopt,∞(always 1) the
true AUC(ˆγopt) (always 0.5), the cross validatedAU C(ˆd γopt) and the FWER for a varying
number of parameters for the autoregressive correlation ρ. The number of tested variables is set to m = 1000. The per-group sample size is xed to n = 50. AUC∗= 0.965.
ρ 0.6 -0.6 0.6 ˆ γopt 0.034(0.06) 0.044(0.08) 0.040(0.08) 0.006 0.009 0.007 ˆ αopt 0.812(0.22) 0.841(0.20) 0.770(0.27) 0.916 0.935 0.904 FDR 0.972(0.17) 0.964(0.19) 0.886(0.32) 1.000 1.000 1.000 ˆ αopt,∞ 1.000 1.000 1.000 AU C(ˆγopt) 0.500 0.500 0.500 d AU C(ˆγopt) 0.650(0.07) 0.662(0.07) 0.604(0.07) 0.659 0.670 0.607 FWER 0.972 0.964 0.886
7 Situation of correlation between variables
Figure 7.5:Correlated hypotheses: Distribution of the cross validation basedAU C(ˆd γopt)(500 sim-
ulation runs) assuming correlation between variables: me = 10(rst column), 60 (second
column) or 0 (third column) are assumed among m = 1000 hypotheses. ρ = 0.6 (rst row) −0.6(second row) and 0.9 (third row) is assumed. The sample size is set to n = 50.
7.3 Cross Validation
Figure 7.6:Correlated hypotheses: Distribution of the cross validation based ˆαopt (500 simulation
runs) assuming correlation between variables: me= 10(rst column), 60 (second column)
or 0 (third column) are assumed among m = 1000 hypotheses. ρ = 0.6 (rst row) −0.6 (second row) and 0.9 (third row) is assumed. The sample size is set to n = 50.
7 Situation of correlation between variables
Figure 7.7:Correlated hypotheses: Scatterplots of AU C(ˆd γopt) vs. ˆαopt (500 simulation runs) as-
suming correlation between variables: me = 10 (rst column), 60 (second column) or 0
(third column) are assumed among m = 1000 hypotheses. ρ = 0.6 (rst row) −0.6 (second row) and 0.9 (third row) is assumed. The sample size is set to n = 50.
8 Real data applications
To evaluate the cross validation procedure we used three data sets summarized and pre- processed by Pavlidis et al. (2003) and a data set investigated in Tian et al. (2003). Note that because of the large number of investigated genes in the four investigated data sets (> 6000) it was not possible to apply the forward logistic regression in SAS 9.1. due to lack of memory space.