In this section we give an indication on overall performance of the Fantom service. We apply Fantom to one of the microarray data sets used in [GST+99]. We discuss the data set used, what transformations we applied to get a ranking of genes, and what parameters we supplied to Fantom. We also present some performance statistics that address both speed and pruning.
Dataset and Inputs
In this use case we used a well-known publicly available dataset that compares gene expression profiles of AML and ALL [ASS+02]. In this data set, gene expression pro- files were taken from 47 patients suffering from ALL and 25 patients suffering from AML. We first normalized the raw data using Quantile normalization [BIAS03]. After that, we performed mapping of the probes to ENTREZ genes using the Hu6800 an- notations supplied by Affymetrix. We discarded any entries that could not be mapped successfully to a single identifier, to reduce the uncertainty error. Finally, we per- formed a t-value calculation with the Student’s t-test between those two groups, thus researching what all over-expressed genes have in common. If a gene had multiple t-values, the average of those values was taken.
As ontology inputs the GO and KEGG ontologies were used, combined with the ES score metric. Context inputs were set to homo sapiens, and the identifier was kept default to ENTREZ.
Implementation
Implementation of the ontology and mapping generation as well as the interaction files was done in the Python language and run on Python 2.5.2, because of the ef- ficient and easy string handling of the Python scripting language. The web service implementations were done in Microsoft C++ .Net 2008 and Microsoft C# 2008, both using the .Net Framework version 2.0 and 3.5. We embedded this web service in a workflow created in the Taverna [MyG08] workbench.
Performance
In Figure 5.3, performance measurements are shown for different minimal support sizes 𝑆 and different minimum score thresholds 𝐶. In Figure 5.3(a), the horizon- tal axis indicates the minimum support in items, while the verical axis indicates the running time of an experiment. In Figure 5.3(b), the horizontal axis indicates the min- imum score setting in an experiment. The vertical axis is once again the running time of an experiment. 0 100 200 300 400 500 600 5 10 15 20 25 30 35 Time(s) Minimum Support
Support-Based Performance Measurement C=0.60 C=0.65 C=0.70 C=0.75 (a) 0 100 200 300 400 500 600 0.55 0.6 0.65 0.7 0.75 0.8 Time(s) Minimum Score
Score-Based Performance Measurement S=10 S=15 S=20 S=30
(b)
As can be seen in Figure 5.3(a), the increase of the minimum participants has a pro- found effect on the performance of the algorithm. The same effect can be seen in Figure 5.3(b) where the minimum score was increased, though at a lesser extend with bigger subgroups. Still, if we extrapolate the lines in Figure 5.3(b), it is still obvious that pruning based on the ES measurement improves performance greatly (one can simulate the lack of ES pruning by taking a minimum score of0).
Another interesting question is how these two thresholds affect pruning. Intu- itively, smaller subgroups and lower minimum scores result in more rules being gen- erated, and therefore more rules pruned, but if we examine the percentages of rules pruned we see that with both thresholds it is fairly stable around 99.8%. These results are shown in Figure 5.4.In Figure 5.4(a), the horizontal axis indicates the minimum support in items, while the verical axis indicates the percentage of rules pruned. In Figure 5.4(b), the horizontal axis indicates the minimum score setting in an experi- ment. The vertical axis is once again the percentage of rules pruned.
As can be seen, the pruning algorithm is slightly more erratic in the support threshold seen in Figure 5.4(a) than in the score threshold shown in Figure 5.4(b), but overall both are monotonically increasing.
5.5
Conclusions and Future Work
In this chapter we discussed a subgroup discovery service called Fantom that finds subgroups given a set of weighed elements. We explained the technologies behind the algorithm, its data sources, and its way of combining that data to generate com- prehensive patterns that are tailored to the expert knowledge of the researcher.
In our experiments, we have shown several statistics on the Golub et al. data set, which we normalized and then extracted the participating genes and their scores from it. We have shown that pruning can be done with both a monotonic constraint such as support, but also by adapting a non-monotonic constraint such as the ES score measurement, thereby making use of the minimum support threshold. This resulted in the generation of less rules and increased pruning, which rendered at least 99.85% of all the rules generated useless, greatly diminishing redundant information.
For future work, efforts have to be made to increase rule statistics, not only with ES scores, but also p-values for confidence. Furthermore, some other score functions than just ES should be evaluated and supported. A wide overview is presented in [AS09]. Of course, a qualitative re-assessment of the rules with different score mea- sures will have to be made, as well as research into the tradeoff between performance and quality.
99 99.1 99.2 99.3 99.4 99.5 99.6 99.7 99.8 99.9 100 5 10 15 20 25 30 35 40 45
Percentage of Rules Pruned
Minimum Support
Support-Based Pruning Influence C=0.60 C=0.65 C=0.70 C=0.75 (a) 99 99.1 99.2 99.3 99.4 99.5 99.6 99.7 99.8 99.9 100 0.55 0.6 0.65 0.7 0.75 0.8 0.85
Percentage of Rules Pruned
Minimum Score Score-Based Pruning Influence
S=10 S=15 S=20 S=30
(b)
Chapter 6
The Fantom Service: Exact Testing
In this chapter we describe how we combine the Fantom service with the statistical principle of permutation testing. We demonstrate that by performing several iterations of the Fantom service on permutations of an identifier list, we can prune the rule set for the original list even further. We also demonstrate that by combining Fantom and permutation testing, we can determine an optimal support threshold for classes in a multi-class experiment with respect to interestingness of rules.6.1
Introduction
When generating rules from a ranked list of identifiers, the rules usually reflect the ranking, and Fantom is no different in this respect. It is therefore good practise, where possible, to make sure that the ranking is correct, and to make sure that rules gener- ated and reported are specific to that ranking, and not a product of randomness or chance, which can sometimes occur. By generating permutations of a ranked list, one can check if these permutations generate similar rules. If so, then the rules become less important and interesting, for they are not specific to the original ranking.
The method previously described is a variation on Fisher’s Exact Test [Fis22, Fis67]. Fisher’s Exact Test is a statistical significance test that uses permutation gen- eration to determine the deviation from a null hypothesis. It is called an exact test because it does not rely on heuristics and approximations of the deviation, but can calculate it exactly through generation of all possible permutations. Usually, a p-
value is calculated, which is the probability of obtaining a measurement that has at
least (or at most) the same value as the value actually observed, assuming that the null hypothesis is true. The lower the p-value, the less likely the result is obtained by chance, assuming the null hypothesis is true, which makes the result more significant. Consider a simple example: suppose we have a ranking of identifiers, and we cre- ate 99 permutations of those identifiers. After rule generation, we check for each rule if it also appears in the permutation experiments. Suppose a rule in the original ex-
periment appeared in𝑛out of the residual 99 experiments, then a simplified p-value could be 𝑛+1100. Usually an observation (or rule) is considered interesting if its p-value is below0.05, or even0.01. The rule is then said to be statistically significant
In this chapter we will apply this concept to the rules generated in Fantom. We generated permutations of the input, and then generated rules from each of those per- mutations using multiple dedicated instances of the Fantom service. Furthermore, we used the same principle to generate an automated score threshold for a multi-class problems in Fantom by using maximized rule count differentials.
This chapter is organized as follows. In Section 2, we discuss exact testing with Fantom on a single-class problem, generating a rule list with rules unique to the origi- nal permutation. We also discuss exact testing with Fantom for multi-class problems, whereby different groups of identifiers are compared to each other. We explain the difference with the normal experimental setup, and the algorithm behind automatic thresholding of different groups participating in the rules. In Section 3, we present experimental results on both variations using the AML versus ALL dataset again. Finally, in Section 4 we present some conclusions and future work.