7 Experimental Results
7.1 Compression Factor
Let R be the set of all rules which satisfy both minsup and maxgap con- straints and CRC and CCRS the set of general rules and compact rules satisfying the same constraints. To measure the compression factor achieved by our compact representations, we compare their size with the size of the complete rule set. The compression factor (CF%) for the two representations is respectively (1−|CRC|R||)% and (1−|CCRS|R| |)%.
For theCRCrepresentation, a high compression factor indicates that rules whose antecedent is a generator sequence are a small fraction ofR. Instead, for theCCRS representation, a high compression factor indicates that rules whose antecedent is a closed sequence are a small fraction ofR. In both cases, a small subset ofRencodes all useful information to model classes.
Different data distributions yield a different behavior when varying
minsup and maxgap values. In the following we summarize some com- mon behaviors. Then, we analyze each dataset separately and discuss it in detail.
For moderately high minsup values, the two representations have a very close size (or even exactly the same size). In this case, the subsets of rules in
Rhaving as antecedent a closed sequence or a generator sequence are almost the same.
When lowering the support threshold or increasing themaxgapvalue, the number of rules in setRand in setsCCRS andCRC increases significantly. In this case, theCRCrepresentation often achieves a higher compression than theCCRSrepresentation. This effect occurs formaxgap >1 and lowminsup
values. In this case, the set of rules with a generator sequence as antecedent is smaller than the set of rules with a closed sequence as antecedent. This occurs because when increasingmaxgapor decreasingminsup, mined sequences are characterized by increasing length. Hence, the number of closed sequences, which are the sequences with the longest antecedent, increases significantly. Instead, the increase in the number of generator sequences, which have shorter
length, is less remarkable. Few generator sequences (in most cases only one) are associated to each closed sequence. In addition, as stated by Property 3, each generator sequence can be common to different closed sequences.2
In some cases, theCRC representation achieves a slightly lower compres- sion than theCCRSrepresentation. It occurs formaxgap= 1 and lowminsup
values. With respect to the case above, for this minsup andmaxgap values there are a few more generator sequences than closed sequences. On the av- erage more than one generator sequence is associated to each closed sequence (about 2 in the DNA dataset, and 1.2 in the Reuters and Newsgroup datasets). Generator sequences are still common to more closed sequence as stated in Property 3.
Reuters Dataset
Figure 5 reports the total number of rules in set R for different minsup
and maxgap values. Results show that the rule set becomes very large for
minsup= 0.1% andmaxgap≥3 (e.g., 1,306,929 rules formaxgap= 5). Figure 6a, b show the compression achieved by the two compact repre- sentations. For both of them, for a given maxgap value, the compression factor increases when minsup decreases. Furthermore, for a given minsup
value, the compression factor increases when themaxgapvalue increases. For both representations, the compression factor is significant when setRincludes many rules. When minsup = 0.1% and 3 ≤maxgap ≤ 5, Rincludes from 184,715 to 1,291,696 rules. Compression ranges from 52.57 to 58.61% for the
CCRS representation and from 60.18 to 80.54% for theCRCrepresentation. A lower compression (less than 10%) is obtained whenmaxgap= 1. However, in this case the complete rule set is rather small, since it only includes about 12,000 rules whenminsup= 0.1% and less than 2,000 rules for higher support thresholds.
Fig. 5.Number of rules for Reuters dataset
2Recall that this behavior is peculiar of the sequential pattern domain. In the context of itemset mining, the number of generator itemsets is always greater than or equal to the number of closed itemsets. Furthermore, the sets of generator itemsets associated to different closed itemsets are disjoint.
(a)CRC Set (b)CCRSSet
Fig. 6.Compression factor for Reuters dataset
(a)CRC Set (b)CCRSSet
Fig. 7.Rule length forCRC andCCRSsets for Reuters dataset (maxgap = 2)
For low support thresholds and highmaxgapvalues, theCRCrepresenta- tion always achieves a higher compression. In particular, whenminsup= 0.1% and 3≤maxgap≤5, the compression factor is more than 10% higher than in the CCRS representation (about 20% when maxgap= 5). The two rep- resentations provide a comparable compression for higherminsup and lower
maxgapvalues. To analyze this behavior, Fig. 7 plots the number of general and compact rules for different rule lengths, for maxgap = 2 and different
minsupvalues. As discussed above, when decreasing minsup, the number of compact rules increases more significantly. Figure 7 shows that this is due to an increment in the number of compact rules with longer size.
As showed in Fig. 7a, b, for a givenminsupvalue compression increases for increasingmaxgap values. Figure 8 focuses on this issue and plots the com- pression factor for both compact forms for a large set ofmaxgapvalues and for thresholdsminsup= 0.5% andminsup= 1%. For both forms the compression factor increases untilmaxgap= 5 and then decreases again. The compression factors are very close untilmaxgap= 5 and then the difference between the two representations becomes more significant. This difference is more relevant when minsup = 0.5%. The CRC form always achieves higher compression. An analogous behavior has been obtained for otherminsupvalues.
Fig. 8.Compression factor when varyingmaxgapfor Reuters dataset
(a)Number of rules (b)Compression factor forCRC set
Fig. 9.Newsgroup dataset
Newsgroup Dataset
Figure 9a reports the total number of rules in setRfor differentminsupand
maxgapvalues. The compression factor shows a similar behavior for the two compact forms. In the following we discuss the compression factor for the
CRCset, taken as a representative example (see Fig. 9b). Whenmaxgap= 1, the compression factor is only slightly sensitive to the variation of the support threshold. Hence, the fraction of rules with a closed or a generator sequence as antecedent does not vary significantly when vaying support. Similarly to the case of the Reuters dataset, the CRC representation always achieves a higher compression than theCCRS representation, with an improvement of about 20%.
The case maxgap = 1 yields a different behavior. For both representa- tions, the compression factor increases for increasing support thresholds. From Fig. 9b, the cardinality of the complete rule set is rather stable for growing support values. Instead, both the number of closed and generator sequences decreases. This effect yields growing compression when increasing the support threshold.
When varying maxgap, both compact forms show a compression factor behavior similar to the Reuters dataset. For a given a minsup value, the
(a) Number of rules (b)Compression factor
Fig. 10. DNA dataset
compression factor first increases when increasing maxgap. After a given
maxgap value, it decreases again. This behavior is less evident than in the Reuters dataset. Furthermore, the maxgapvalue where the maximum com- pression is achieved varies with the support threshold.
DNA Dataset
For the DNA dataset, we only consider the casemaxgap= 1. This constraint is particularly interesting in the biological application domain since sequences of adjacent items in the DNA input sequences are mined. Figure 10a reports the number of rules in setsR,CCRS, andCRC for differentminsupvalues. Even if the alphabet only includes four symbols, a large number of rules is generated when decreasing the support threshold.
Figure 10b shows the compression factor for the two compact representa- tions. Both compact forms yield significant benefits for low support thresh- olds. In this case Rcontains a large number of rules (2,672,408 rules when minsup=0.05%), while both compact forms have a significantly smaller size (CF=95.85% for the CRC representation and CF=93.74% for the CCRS
representation). The CRC representation provides a slightly lower compres- sion than the CCRS representation for low support thresholds. Instead, the compression factor is comparable for highminsupvalues.