As established in Chapter 4, one can use closed and approximate frequent itemset mining to identify subsets of data that have association between response and explanatory variables. Namely the p-value associated with the phi coefficient can be used to determine which discov- ered subsets of data are statistically significant (important) while accounting for the increase in type I error that is associated with multiple testing by using the False Discovery Rate (FDR) correction. The only issue that arises occurs when the phi coefficient fails to adequately quan- tify the association between response and explanatory variables. For example, in figure 5.1 there are six subsets of data that all have the same consistency, but vastly different associated p-values, where consistency = (a +d)/(a +b +c+d). Comparing the top row (black text) to the two below it (blue text), notice that the associated p-value increases in signifi- cance (becomes smaller in value) as there is less balance ininconsistentcellsbandc, where balance means equality in value. Comparing the top row (black text) to the fourth and fifth rows (purple text), notice how the associated p-value increases in significance as there is more balance in consistentcellsa andd. If one deems the metric consistency more important re- garding the association between response and explanatory variables than the phi coefficient, figure 5.1 demonstrates how the methods described in Chapter 4 will not adequately discover
C M Y CM MY CY CMY K consistency.pdf 3/16/12 3:07:37 AM
Figure 5.1: Depicts issue with using statistics based upon p-value. Shows how same consis- tency can give vastly different phi coefficient p-values.
all important subsets of data with regards to consistency using the phi coefficient.
In the preceding chapter we propose a bootstrap methodology that has been adapted to free ourselves from having to use metrics associated with p-values to determine statistically significant subsets of data discovered through the process of closed frequent itemset mining. The primary benefit of this bootstrap methodology is that it can be used withanystatistic or property of the dataset without dependence upon a p-value. Moreover the method provides ways to incorporate multiple metrics into the methodology, such that the final significance can be associated with multiple properties of the data. The advantage of this is that if one can provide a statistic that incorporates the integration of three or more datasets, one can effec- tively extend the method to consider associations between three or more datasets as depicted in figure 5.2. One naive way to associate three or more datasets would be to create metrics for all pairwise associations between all pairs of the datasets and use these multiple metrics with the bootstrap methodology.
Another benefit is that the bootstrap methodology allows one to appropriately account for type I error associated with multiple testing. Similar to the selection of anαvalue in hypoth- esis testing, one selectsδ, the probability that the significance threshold selected will exceed the number of false positives selected, as the threshold criterion. This gives one more control over the probability of observing false positives within the results deemed significant than
E1 E2 A1 A7 A9 C30 C74
<=
Transactions =>
<= 3 Different ‘Sets’ of Items =>
<=
Item Set Mining=>
C M Y CM MY CY CMY K 2PlusDSets.pdf 1/2/12 6:15:49 PM
Figure 5.2: Association for Multiple Data Sources.
what is observed with standard corrective measures like the FDR correction. Additionally, the analysis can be focused solely on multivariate associations or other particular associations of interest with the use of the seed nodes in the closed frequent itemset mining algorithm. Where multivariate is defined as including more than one response variable. This enables one to fo- cus the analysis on the associations they are most interested in. As demonstrated in Chapter 4 with the identification of approximate frequent itemsets, the bootstrap methodology can also be used in such a way as to incorporate fuzziness into the results to provide a method that is robust with noisy, inconsistent data. The bootstrap methodology can be adapted to work with approximate itemsets as long as the incorporation of approximation does not impair the rationale behind one’s metric of interest.
The primary weakness of this method is intuitive for all methods that use bootstrapping; the resampling and the measurement of the metric of interest on the bootstrap samples coupled with the summarization of the bootstrap results are computationally expensive in both time and space. The most computationally expensive process is the closed and approximate itemset mining. As discussed in Chapter 4, both are time consuming dependent upon the size of the original binary dataset, the density of ones in the dataset, and the support threshold used. Running the FDR correction on the phi coefficients p-values is more efficient than using the bootstrap method. Computing closed or approximate itemsets on the bootstrap samples can
be much more computationally expensive than was observed with the original dataset. This is because generating the bootstrap samples using random sampling with replacement of original dataset can result in producing a much more dense (more ones) bootstrap dataset.
Thus for datasets that produce on average more than 25,000 rules, the method can be computationally infeasible. For example, it took approximately 27 minutes to mine and out- put results for 22,881 closed itemsets for the ToxCast data and running 1000 such bootstrap samples would take approximately 19 days of runtime if each bootstrap was run sequentially. Furthermore, depending upon the density and size of the original dataset some of the bootstrap samples can produce 4 or more times the number of closed itemsets than what was observed in the original dataset. An increase from 25,000 to 125,000 closed itemsets would result in taking a day as opposed to an hour just to mine all itemsets, let alone to attempt to summarize the results on a bootstrap sample of that size. Therefore, for the bootstrap methodology to be computationally feasibly applied, one must find ways to limit the number of itemsets they want to quantify to be no greater than 15,000 to 20,000. The seed nodes used in our closed frequent itemset mining algorithm provide a natural way to limit ones results as to enable the bootstrap methodology to be used.
The final weakness is that using too large of a support threshold can result in the bootstrap method producing results that were too stringent with regards to the number of significant results provided. We demonstrate how greater minimum support thresholds tend to produce results that are more stringent than what is seen with the FDR correction. The method seems to work best for support thresholds that are around 0.10 or less. This is somewhat dependent upon the density of ones in the original dataset; thus, we provide methodology for determining the criteria that should be used with regards toδand the support threshold as compared to the results that would have been provided by using FDR correction.