The Depth-First Algorithm - Knowledge Discovery and Monotonicity

The Depth-First algorithm was introduced in [61]. As the name suggests, a depth-first strategy is used for the generation of candidates. Another important feature is that the structure used to represent the itemsets is a trie which can be described as follows.

A trie consists of nodes and links. Each node contains a number of cells which are also called buckets. Each bucket is labelled by an item. The links connect a bucket (as a parent) to a node (as a child). Each path in the trie starting from the root is an itemset and therefore each bucket corresponds to a unique itemset. An example trie is given in figure 5.5. The root node of that trie contains five cells/buckets labelled with the items a, b, c, d and e. Each path starting from one of those cells represents an itemset. For example the path e→ c → b → a represents the itemset {e, c, b, a} but that means this also {e, c, b} is an itemset as well as {e, c} and {e}.

In the following, the terms “cell”, “path” and “itemset” will be used inter- changeably in the context of an itemset trie. That is because each cell determines exactly one path that leads to it starting from the root. This path corresponds to an itemset containing the items present in the path.

d

e

c

b

a

d

c

b

a

c

b

a

b

a

c

b

a

b

a

Figure 5.5: An example of a trie

cause of the property that the set of all frequent itemsets is a meet semi-lattice and every subset of a frequent itemset is also frequent. In this case each bucket also contains a number which gives the support of the itemset represented by the path from the root to that bucket.

Note that the trie is very different from the FP-tree as in the trie every cell determines uniquely an itemset. This is not the case in the FP-tree where the same itemset may be present in different branches of the tree and extracting only one of those branches does not necessarily give us the full support of the itemset.

The Depth-First algorithm is given in figure 5.6. As a preprocessing step, the frequent 1-itemsets are extracted by one scan of the database. The infrequent ones are not further considered in the algorithm. The frequent 1-itemsets, de- noted by i1, i2, . . . , inform the root of the trie. For each item in the root starting from in−1 and moving towards i1, the subtrie that has been built to the right of it is copied under the bucket. This new subtrie contains all the current candidates. Their support is counted by a database scan and the infrequent ones are pruned (removed from the trie). The algorithm proceeds with the next root-item to the left.

The count procedure is performed by extracting the transactions one by one and “pushing” them through the trie. If the current transaction does not contain the root item then we ignore it and proceed with the next transaction. If it does it is checked for the first child-cell item. If that is present then we go deeper recursively, if not we continue with the next child-cell.

A refinement of the algorithm is to sort the items in the root in increasing order so that the most frequent one is in the rightmost cell of the root. This would mean that by moving to the left in the structure fewer transactions need

Depth-First():

T = the trie including only bucket in; for m = n− 1 downto 1

T0_{= T ;} T = T0 _{with i}

madded to the left and a copy of T0 appended to im; C = T \T0 _{(the subtrie rooted in i}

m); count(C);

delete the infrequent itemsets from T ;

Figure 5.6: The Depth-First algorithm

to be inspected in order to count the support of candidates. This improvement will be assumed in the following.

The algorithm is illustrated with the example in figures 5.7 to 5.10 which uses the database from table 5.1 and a support threshold of 2. For simplicity, the supports of the cells in the trie are not shown in the pictures. We start with the frequent 1-itemsets which here are a, b, c, d, e. They form the root node. First we consider a and we create a root-cell for it. We then consider item b and add it to the root. The subtrie to the right of it contains only a which is copied under b. The resulting trie is shown on figure 5.7 where the newly copied subtrie is drawn with dashed lines. The database is scanned for the support of the candidate itemset (b, a) which turns out to be frequent.

The algorithm proceeds with the root-item c. The subtrie to the right of it is copied under it, see figure 5.8. Again the database is scanned for the support of the new part. The candidates that are actually being counted are (c, b), (c, b, a) and (c, a). They turn out to be frequent and are kept in the trie. We then proceed with the item d and the situation from figure 5.9. The database is scanned again for the support of the following candidates: (d, c), (d, c, b), (d, c, b, a), (d, c, a), (d, b), (d, b, a) and (d, a). They all prove to be frequent and are therefore kept in the trie.

We finally consider the item e. The updated trie is shown in figure 5.10. The database is scanned for the new candidates. This time however some of them turn out to be not frequent: (e, d, c, b), (e, d, b) and (e, d, a). Therefore their supersets (e, d, c, b, a), (e, d, c, a) and (e, d, b, a) cannot be frequent either. The corresponding parts of the trie are pruned and the resulting structure is the one from figure 5.5.

In document Knowledge Discovery and Monotonicity (Page 121-123)