development is, but there would be no indisputable advantages.
Now, given any set t, its support is the same as that of ΓD({t}); know-
ing the closed sets of trees and their supports gives us all the supports of all the subtrees. As indicated, this includes all the closed trees, but has more information regarding their joint membership in closed sets of trees. We can compute the support of arbitrary frequent trees in the following manner, that has been suggested to us by an anonymous reviewer of this paper: assume that we have the supports of all closed frequent trees, and that we are given a tree t; if it is frequent and closed, we know its support, otherwise we find the smallest closed frequent supertrees of t. Here we depart from the itemset case, because there is no unicity: there may be sev- eral noncomparable minimal frequent closed supertrees, but the support of t is the largest support appearing among these supertrees, due to the antimonotonicity of support.
For further illustration, we exhibit here, additionally, a toy example of the closure lattice for a simple dataset consisting of six trees, thus providing additional hints on our notion of intersection; these trees were not made up for the example, but were instead obtained through six different (rather arbitrary) random seeds of the synthetic tree generator of Zaki [Zak02].
The figure depicts the closed sets obtained. It is interesting to note that all the intersections came up to a single tree, a fact that suggests that the exponential blow-up of the intersection sets, which is possible as explained in Section7.2.1, appears infrequently enough, see Section7.8.2for empirical validation.
Of course, the common intersection of the whole dataset is (at least) a “pole” whose length is the minimal height of the data trees.
7.4 Level Representations
The development so far is independent of the way in which the trees are represented. The reduction of a tree representation to a (frequently aug- mented) sequential representation has always been a source of ideas, al- ready discussed in depth in Knuth [Knu97, Knu05]. We use here a spe- cific data structure [NU03,BH80,AAUN03,NK03] to implement trees that leads to a particularly streamlined implementation of the closure-based mining algorithms.
We will represent each tree as a sequence over a countably infinite al- phabet, namely, the set of natural numbers; we will concentrate on a spe- cific language, whose strings exhibit a very constrained growth pattern. Some simple operations on strings of natural numbers are:
Figure 7.6: Lattice of closed trees for the six input trees in the top row • |x| the length of x.
• x · y the sequence obtained as concatenation of x and y
• x + i the sequence obtained adding i to each component of x; we represent by x+the sequence x + 1
We will apply to our sequences the common terminology for strings: the term subsequence will be used in the same sense as substring; in the same way, we will also refer to prefixes and suffixes. Also, we can apply lexico- graphical comparisons to our sequences.
The language we are interested in is formed by sequences which never “jump up”: each value either decreases with respect to the previous one, or stays equal, or increases by only one unit. This kind of sequences will be used to describe trees.
Definition 5. Alevel sequence or depth sequence is a sequence (x1, . . . , xn)
of natural numbers such that x1 = 0and each subsequent number xi+1belongs to
7.4. LEVEL REPRESENTATIONS For example, x = (0, 1, 2, 3, 1, 2) is a level sequence that satisfies|x| = 6 or x = (0) · (0, 1, 2)+· (0, 1)+. Now, we are ready to represent trees by means
of level sequences.
Definition 6. We define a function h·i from the set of ordered trees to the set of level sequences as follows. Let t be an ordered tree. If t is a single node, then hti = (0). Otherwise, if t is composed of the trees t1, . . . , tk joined to a common
root r (where the ordering t1, . . . , tkis the same of the children of r), then
hti = (0) · ht1i+· ht2i+· · · htki+ Here we will say that hti is the level representation of t.
Note the role of the previous definition:
Proposition 2. Level sequences are exactly the sequences of the form hti for or- dered, unranked trees t.
That is, our encoding is a bijection between the ordered trees and the level sequences. This encoding hti basically corresponds to a preorder traversal of t, where each number of the sequence represents the level of the current node in the traversal. As an example, the level representation of the tree
is the level sequence (0, 1, 2, 2, 3, 1). Note that, for example, the subse- quence (1, 2, 2, 3) corresponds to the bottom-up subtree rooted at the left son of the root (recall that our subsequences are adjacent). We can state this fact in general.
Proposition 3. Let x = hti, where t is an ordered tree. Then, t has a bottom-up subtree r at level d > 0 if and only if hri + d is a subsequence of x.
Proof. We prove it by induction on d. If d = 1, then hri + d = hri+and the
property holds by the recursive definition of level representation.
For the induction step, let d > 1. To show one direction, suppose that ris a bottom-up subtree of t at level d. Then, r must be a bottom-up sub- tree of one of the bottom-up subtrees corresponding to the children of the root of t. Let t0 be the bottom-up subtree at level 1 that contains r. Since ris at level d − 1 in t0, the induction hypothesis states that hri + d − 1 is a subsequence of ht0i. But ht0i+ is also, by definition, a subsequence of x.
Combining both facts, we get that hri + d is a subsequence of x, as desired. The argument also works in the contrary direction, and we get the equiva-
7.4.1 Subtree Testing in Ordered Trees
Top-down subtree testing of two ordered trees can be obtained by perform- ing a simultaneous preorder traversal of the two trees [Val02]. This algo- rithm is shown in Figure7.7. There, post traverses sequentially the level
representation of tree t and posstsimilarly traverses the purported subtree
st. The natural number found in the level representation of t at position postis exactly level (t, post).
Suppose we are given the trees st and t, and we would like to know if stis a subtree of t. Our method begins visiting the first node in tree t and the first node in tree st. While we are not visiting the end of any tree,
• If the level of tree t node is greater than the level of tree st node then we visit the next node in tree t
• If the level of tree st node is greater than the level of tree t node then we backtrack to the last node in tree st that has the same level as tree node
• If the level of the two nodes are equal then we visit the next node in tree t and the next node in tree st
If we reach the end of tree st, then st is a subtree of tree t.
ORDERED SUBTREE(st, t) Input: A tree st, a tree t.
Output: true if st is a subtree of t. 1 posst= 1
2 post= 1
3 while posst ≤ SIZE(st)and post ≤ SIZE(t)
4 do iflevel (st, posst) >level (t, post)
5 then whilelevel (st, posst)6= level (t, post)
6 do posst= posst− 1
7 iflevel (st, posst) =level (t, post)
8 then posst = posst+ 1
9 post = post+ 1
10 return posst >SIZE(st)
Figure 7.7: The Ordered Subtree test algorithm
The running time of the algorithm is clearly quadratic since for each node of tree t, it may visit all nodes in tree st. An incremental version of this algorithm follows easily, as it is explained in next section.