4.5 Empirical Evaluation
4.5.3 Procedure
We ensure comparability of the application of several learning systems and of their results by a number of measures.
1. We executed all experiments on the same machine such that the same condi-tions e. g. of main memory, processor speed, and operating system specifics applied. More specifically, we used a workstation of type Sun-Blade-1000 with 1 GB main memory and an UltraSparc-III processor with 750 MHz.
For aspects of the software used, e. g. the version of the operating system, cf. Appendix A.
2. The same point of departure was used for all experiments, viz. our prepa-rations of the data as MySQL databases. In a number of cases, this meant for all learners the usage of a reduced variant of the original data where aspects without relevance for the learning examples or learning task were
4.5. EMPIRICAL EVALUATION 87
left out. This was especially important for ILP systems that were not able to directly use MySQL but had to load all given data into main memory.
For more information about the reductions, cf. Appendix B.
3. The input formats for the single learners were produced following conven-tions for the systems as stated in their documentation or used in earlier research, preferably by the authors of the systems themselves. Still, we took great care to use largly equivalent representations of the data across learning systems. This also applies to the definitions of declarative bias.
4. We applied all learning systems in their respective default schemes. That means that default settings of the parameters were used, if their application was reasonable. This also concerns the declarative biases used.
5. We also tried other preparations of the data and other settings for learning in order to gain a more complete picture of the opportunities of the learning systems.
In summary, we started for all experiments from MySQL (reduced) databases.
If necessary, the data were exported into the corresponding formats as input for the learning systems. Bias definitions were also derived from the databases.
These steps were supported by tools that we developed for those purposes, cf.
Appendix A.
After that, systems were applied in conventional ways with their default set-tings. For Relaggs, we were able to use the same implementation that is also used in the following chapter. For the applicability of this implementation, we had to make the exploitation of foreign links and functional dependencies explicit by precomputing a number of joins of the original or reduced tables. Details are given in Appendix B. The times taken for these transformations are recorded in the experimental results section.
We used a setting branching f actor = 0 for computing these joins. Fur-thermore, we set maximum cardinality = 100 for nominal attributes to be con-sidered for propositionalization. An exception was made for ECML-1998 data, where we used maximum cardinality = 10. This exception was made mainly because MySQL tables have a restricted breadth which would have been exceeded otherwise.
The aggregate functions we applied were the following: average, maximum, minimum, and sum for numeric attributes, count of possible values for nominal attributes, and count of related records. In order to unify experiments we did not use the MySQL functions but implementations within Relaggs. These also had to be used in other experiments with non-standard aggregate functions that are not offered by MySQL.
In the following, we present more information about special settings of the learning systems in our experiments. Progol and Rsd were also used with
non-default parameter settings and on other preparations of the data. These experiments are reported separately.
For Tilde, advanced features of the system such as the opportunity for sam-pling or chunking were not used, and not either discretization or ≥ tests for numeric variables. The latter was remedied for extra experiments that are re-ported separately. Furthermore, we consistently used test accuracy after pruning as provided in cross-validation result files by Tilde, although there was a sec-ond accuracy given there, called ”after safe pruning”, which was occasionally different.
RollUp was simulated with Relaggs, parameterized in the same way as Relaggsin the unified experiments, and with joins directly computed with the help of MySQL. Because of main memory limits that made the handling of many Java double variables a high effort for Household.class prediction, we split the target table in four parts for propositionalization, in order to combine the results before propositional learning took place.
After propositionalization by Dinus, Rsd, RollUp, or Relaggs, we applied WEKA learners, especially J48 and SMO, both with default parameterizations again.
In order to uniformly arrive at interpretable results, we used stratified 10-fold cross-validation for all experiments. To this end, we developed tools for partitioning the different kinds of input files for the learning systems in a way such that the same partitions included the same sets of examples across learning systems.
Using our own partitionings of the data enabled us to do paired t-tests. Fur-thermore, advantages with respect to memory usage could be noticed for larger datasets, where e. g. WEKA had difficulties to execute its default cross-validation.
We did not execute multiple cross-validations, although our tools allow for it by the opportunity for the user to specify a seed for the randomizer used during partitioning. Beside the time effort this would have meant e. g. for 10 times 10-fold cross-validation, we rely here on the standard deviations as a means of information. Especially for larger data sets, these are small enough to indicate stability of the results.
We measured classification accuracy or equivalently error, including signifi-cances of differences between learning systems, running times, complexities of models and further properties of features across the experimental conditions. For the determination of accuracies or equivalently error rates, we performed strati-fied 10-fold cross-validation, as stated above.
For running times and complexities of models, we measured training using all available labeled examples. This is an interesting case, because in practice, those models will usually be applied as predictors, based on the assumption that cross-validation results carry over to those models and that learning from more examples leads to higher quality models in general.
4.5. EMPIRICAL EVALUATION 89
Table 4.3: Error rate averages and standard deviations (in percent; n. a. as not applicable for reasons of (1) database schema or (2) running time; best results in bold, second best in italics)
Target Foil Progol Tilde Dinus Rsd Relaggs
Trains.bound 40.0 30.0 30.0 n. a. (1) 40.0 10.0
± 39.4 ± 35.0 ± 25.8 ± 31.6 ± 31.6
KRK.illegal 2.8 n. a. (2) 24.9 n. a. (1) 23.8 27.7
± 1.1 ± 1.2 ± 1.5 ± 1.1
Muta042.active 22.7 23.3 21.3 18.8 16.3 14.3
± 21.7 ± 14.0 ± 17.4 ± 14.3 ± 15.3 ± 16.0
Muta188.active 10.2 18.4 22.3 20.6 22.3 13.2
± 4.9 ± 11.1 ± 8.2 ± 11.6 ± 8.2 ± 9.1
Partner.class n. a. (2) n. a. (2) n. a. (2) 19.1 n. a. (2) 2.5
± 0.2 ± 0.5
Household.class n. a. (2) n. a. (2) n. a. (2) 42.9 n. a. (2) 7.1
± 2.0 ± 0.8
Loan.status 12.7 n. a. (2) n. a. (2) 11.1 n. a. (2) 7.2
± 3.2 ± 0.6 ± 3.4
Card.type 14.6 n. a. (2) n. a. (2) 11.8 n. a. (2) 11.8
± 2.8 ± 0.5 ± 2.4
Gene.growth 10.6 21.0 19.3 31.9 19.6 17.9
± 2.7 ± 3.3 ± 3.4 ± 0.3 ± 4.2 ± 4.0
Gene.nucleus 12.8 19.4 11.6 37.8 12.6 15.0
± 3.0 ± 4.7 ± 2.2 ± 5.0 ± 2.6 ± 2.5
Note that we do not include times for loading data into main memory as usual for ILP learners or for producing their input formats in the first place. These times are roughly constant across the experimental conditions and in lower orders of magnitudes than the running times of the learners themselves.