Comparison with MC Logic Regression - Statistical analysis of genotype and gene expression data

FIGURE 8.5. Left panel: Application of RFE-SVM to the reduced HapMap data set, where in each step of RFE-SVM the 10% least important variables are shaved off. Right panel: The ten most important SNPs found in an application of Random Forests to the HapMap data set using 5,000 trees and 12 randomly selected variables at each node.

to very consistent results. Only the first two SNPs shown in the right panel of Figure 8.5 are constantly detected under the top 15 SNPs (but only seldom as the two most important variables).

8.5 Comparison with MC Logic Regression

To compare logicFS with MC logic regression, the latter is applied to the simu- lated data sets considered in Section 8.4.1 using the same parameter settings of logic regression as in Section 8.4.1 and 500,000 iterations in the MCMC algorithm. The last 400,000 models are kept in memory to compute the importance measures. For each variable, each pair and triplet of variables, the output of MC logic regression provides the number of models containing this variable or set of variables as measure of importance, where it is ignored, on the one hand, whether the variable itself or its complement is in the model, and on the other hand, whether the variables are combined by ∧ or by ∨. Since specific conjunc- tions are not considered by this importance measure, we additionally compute the value of VIMAdhoc for each of the prime implicants obtained by converting

8.5 Comparison with MC Logic Regression 116

rithm 8.1. (We, however, do not calculate VIMSingle and VIMMultiple, since, on

the one hand, all models are built during the same run of MC logic regression on the whole data set – and not in different applications to different subsets of this data set – and on the other hand, the determination of these measures would be done on the same test data for each of the iterations.)

A drawback of the measure used in MC logic regression is that interactions of different orders have to be considered separately, since each subset of the variables contained in the set of interest is in at least as many models as the set itself such that each subset is at least as important as the set of interest. By contrast, both VIMSingle and VIMMultiple enable the comparison of interactions

of different orders.

In Figure 8.6, the results of ten applications of MC logic regression to the data set of Simulation 1 are displayed. This figure reveals two problems of this procedure: If in the single tree case the set of interacting SNPs is detected, then it will typically be in virtually any of the models, and in almost any case, as the

FIGURE 8.6. Fraction of models (marked by solid dots) containing particular sets of variables, and VIMAdhoc (marked by crosses) for specific interactions computed

from the models visited during ten applications of both the single (left panel) and the multiple (right panel) tree approach of MC logic regression to the data set of Simulation 1.

8.5 Comparison with MC Logic Regression 117

intended interaction. However, even the single variable S12 is not found in any

of the applications, and the triplet S41, S51, S61 is only identified in 50% of

the analyses. Even though in the multiple tree case the sets of interacting SNPs are found in almost any of the applications, the SNP interactions explaining the cases are rarely detected. For example, S12 is in virtually any of the models,

but mostly in interaction with another variable. Moreover, not only S21, S32

and S72, S82 , but also, e.g., S32, S72 or S21, S82 frequently appear jointly

in more than 99% of the models, and would therefore be considered to be of a similar importance when using the proportion as importance measure. By contrast, none of the two-way interactions composed of either the pairS32, S72

or ofS21, S82 exhibits a large value of VIMSingleor VIMMultiple(cf. Section8.4.1,

in particular Figure 8.3).

The applications of MC logic regression to the 50 data sets of Simulation 2 lead to similar results: S61∧ S71C is always identified by the single tree approach,

but in only about 40% of the applications of the multiple tree approach, whereas SC

31∧ S91C ∧ S10,1C is found in 90% of the single tree applications, and in about

60% of the analyses with multiple trees. By contrast, logicFS always identifies both S61∧ S71C and S31C ∧ S91C ∧ S10,1C . (These results differ a little from the results

presented in Schwender and Ickstadt, 2007. A reason for this might be that in Schwender and Ickstadt, 2007, we have split each of the 50 data sets into 63.2% training and 36.8% test data such that the distribution of the explained cases and controls remains unchanged in the split data sets. Because of this disagreement, we have repeated the above analysis a few times, where each of these analyses has led to similar results.)

These two simulations show the advantage of logicFS over MC logic regression: In MC logic regression, SNP interactions are identified by applying a search algorithm once to the whole data set. If an interaction explanatory for the case- control status is detected once, i.e. is in one of the models visited during this search, then it is very likely that it will be identified to be important. However, if the variables composing this interaction do not jointly occur in any of these

In document Statistical analysis of genotype and gene expression data (Page 124-127)