Compression Factor - 7 Experimental Results

7 Experimental Results

7.1 Compression Factor

Let R be the set of all rules which satisfy both minsup and maxgap constraints and CRC and CCRS the set of general rules and compact rules satisfying the same constraints. To measure the compression factor achieved by our compact representations, we compare their size with the size of the complete rule set. The compression factor (CF%) for the two representations is respectively (1−|CRC_|R||)% and (1−|CCRS_|R| |)%.

For theCRCrepresentation, a high compression factor indicates that rules whose antecedent is a generator sequence are a small fraction ofR. Instead, for theCCRS representation, a high compression factor indicates that rules whose antecedent is a closed sequence are a small fraction ofR. In both cases, a small subset ofRencodes all useful information to model classes.

Diﬀerent data distributions yield a diﬀerent behavior when varying

minsup and maxgap values. In the following we summarize some common behaviors. Then, we analyze each dataset separately and discuss it in detail.

For moderately high minsup values, the two representations have a very close size (or even exactly the same size). In this case, the subsets of rules in

Rhaving as antecedent a closed sequence or a generator sequence are almost the same.

When lowering the support threshold or increasing themaxgapvalue, the number of rules in setRand in setsCCRS andCRC increases signiﬁcantly. In this case, theCRCrepresentation often achieves a higher compression than theCCRSrepresentation. This eﬀect occurs formaxgap >1 and lowminsup

values. In this case, the set of rules with a generator sequence as antecedent is smaller than the set of rules with a closed sequence as antecedent. This occurs because when increasingmaxgapor decreasingminsup, mined sequences are characterized by increasing length. Hence, the number of closed sequences, which are the sequences with the longest antecedent, increases signiﬁcantly. Instead, the increase in the number of generator sequences, which have shorter

length, is less remarkable. Few generator sequences (in most cases only one) are associated to each closed sequence. In addition, as stated by Property 3, each generator sequence can be common to diﬀerent closed sequences.2

In some cases, theCRC representation achieves a slightly lower compression than theCCRSrepresentation. It occurs formaxgap= 1 and lowminsup

values. With respect to the case above, for this minsup andmaxgap values there are a few more generator sequences than closed sequences. On the av- erage more than one generator sequence is associated to each closed sequence (about 2 in the DNA dataset, and 1.2 in the Reuters and Newsgroup datasets). Generator sequences are still common to more closed sequence as stated in Property 3.

Reuters Dataset

Figure 5 reports the total number of rules in set R for diﬀerent minsup

and maxgap values. Results show that the rule set becomes very large for

minsup= 0.1% andmaxgap≥3 (e.g., 1,306,929 rules formaxgap= 5). Figure 6a, b show the compression achieved by the two compact representations. For both of them, for a given maxgap value, the compression factor increases when minsup decreases. Furthermore, for a given minsup

value, the compression factor increases when themaxgapvalue increases. For both representations, the compression factor is signiﬁcant when setRincludes many rules. When minsup = 0.1% and 3 ≤maxgap ≤ 5, Rincludes from 184,715 to 1,291,696 rules. Compression ranges from 52.57 to 58.61% for the

CCRS representation and from 60.18 to 80.54% for theCRCrepresentation. A lower compression (less than 10%) is obtained whenmaxgap= 1. However, in this case the complete rule set is rather small, since it only includes about 12,000 rules whenminsup= 0.1% and less than 2,000 rules for higher support thresholds.

Fig. 5.Number of rules for Reuters dataset

2_{Recall that this behavior is peculiar of the sequential pattern domain. In the} context of itemset mining, the number of generator itemsets is always greater than or equal to the number of closed itemsets. Furthermore, the sets of generator itemsets associated to diﬀerent closed itemsets are disjoint.

(a)CRC Set (b)CCRSSet

Fig. 6.Compression factor for Reuters dataset

(a)CRC Set (b)CCRSSet

Fig. 7.Rule length forCRC andCCRSsets for Reuters dataset (maxgap = 2)

For low support thresholds and highmaxgapvalues, theCRCrepresenta- tion always achieves a higher compression. In particular, whenminsup= 0.1% and 3≤maxgap≤5, the compression factor is more than 10% higher than in the CCRS representation (about 20% when maxgap= 5). The two representations provide a comparable compression for higherminsup and lower

maxgapvalues. To analyze this behavior, Fig. 7 plots the number of general and compact rules for diﬀerent rule lengths, for maxgap = 2 and diﬀerent

minsupvalues. As discussed above, when decreasing minsup, the number of compact rules increases more signiﬁcantly. Figure 7 shows that this is due to an increment in the number of compact rules with longer size.

As showed in Fig. 7a, b, for a givenminsupvalue compression increases for increasingmaxgap values. Figure 8 focuses on this issue and plots the compression factor for both compact forms for a large set ofmaxgapvalues and for thresholdsminsup= 0.5% andminsup= 1%. For both forms the compression factor increases untilmaxgap= 5 and then decreases again. The compression factors are very close untilmaxgap= 5 and then the difference between the two representations becomes more significant. This difference is more relevant when minsup = 0.5%. The CRC form always achieves higher compression. An analogous behavior has been obtained for otherminsupvalues.

Fig. 8.Compression factor when varyingmaxgapfor Reuters dataset

(a)Number of rules (b)Compression factor forCRC set

Fig. 9.Newsgroup dataset

Newsgroup Dataset

Figure 9a reports the total number of rules in setRfor diﬀerentminsupand

maxgapvalues. The compression factor shows a similar behavior for the two compact forms. In the following we discuss the compression factor for the

CRCset, taken as a representative example (see Fig. 9b). Whenmaxgap= 1, the compression factor is only slightly sensitive to the variation of the support threshold. Hence, the fraction of rules with a closed or a generator sequence as antecedent does not vary signiﬁcantly when vaying support. Similarly to the case of the Reuters dataset, the CRC representation always achieves a higher compression than theCCRS representation, with an improvement of about 20%.

The case maxgap = 1 yields a diﬀerent behavior. For both representations, the compression factor increases for increasing support thresholds. From Fig. 9b, the cardinality of the complete rule set is rather stable for growing support values. Instead, both the number of closed and generator sequences decreases. This eﬀect yields growing compression when increasing the support threshold.

When varying maxgap, both compact forms show a compression factor behavior similar to the Reuters dataset. For a given a minsup value, the

(a) Number of rules (b)Compression factor

Fig. 10. DNA dataset

compression factor ﬁrst increases when increasing maxgap. After a given

maxgap value, it decreases again. This behavior is less evident than in the Reuters dataset. Furthermore, the maxgapvalue where the maximum compression is achieved varies with the support threshold.

DNA Dataset

For the DNA dataset, we only consider the casemaxgap= 1. This constraint is particularly interesting in the biological application domain since sequences of adjacent items in the DNA input sequences are mined. Figure 10a reports the number of rules in setsR,CCRS, andCRC for diﬀerentminsupvalues. Even if the alphabet only includes four symbols, a large number of rules is generated when decreasing the support threshold.

Figure 10b shows the compression factor for the two compact representations. Both compact forms yield significant benefits for low support thresholds. In this case Rcontains a large number of rules (2,672,408 rules when minsup=0.05%), while both compact forms have a significantly smaller size (CF=95.85% for the CRC representation and CF=93.74% for the CCRS

representation). The CRC representation provides a slightly lower compression than the CCRS representation for low support thresholds. Instead, the compression factor is comparable for highminsupvalues.

In document Data Mining Foundations And Practice Tsau Young Lin (2008) pdf (Page 36-40)