2. Preliminaries
2.3. Embedding Computation
solved in incremental polynomial time, and not to show that polynomial delay pattern mining is not possible. Furthermore, even for very simple graph classes, polynomial de- lay exact frequent subtree mining is most likely very difficult to achieve.
In the remainder of the thesis we hence focus on a relaxation of the frequent subtree mining problem that allows us to easily guarantee polynomial delay by giving up the de- mand on completeness of the output. As a result of one of our algorithms for this relaxed problem, however, we will discuss an exact algorithm for a novel graph class in Section 5.3. The open problem for cactus graph transactions formulated above shows the significance of this result. We discuss this connection in Section 5.4.
2.3. Embedding Computation
Once we have found the set of frequent subgraphs or subtrees of a given graph database, we are usually interested in doing something with them. First, we could use the patterns directly, for example by manually inspecting them to gain knowledge about the dataset. Another useful application is to use the patterns as a representation language for graphs in the input dataset and, more generally, graphs drawn from the same or a similar dis- tribution. A common way of defining the similarity between two graphs is to compute some similarity measure of their images in the Hamming-cube {0, 1}∣F∣spanned by the el- ements of the set of frequent patterns F. The binary feature vectors can then be regarded as the incidence vectors of subsets of F. Given a fixed set F of frequent patterns, we define a feature space as the power set 2F of F and a feature map as fF ∶ G ↦ {H ∈ F ∶ H ≼ G}. The image of G under fFis called embedding of G. For the experimental evaluation of the frequent subtree miners we develop in this thesis, we will often use the corresponding em- beddings of graphs, equipped with a suitable metric or kernel function for metric learn- ing. Formally the task of embedding computation in the context of frequent subtree min- ing is defined as follows:
Tree Embedding Computation (TEC) Problem: Given a graph G and a finite set F of trees, list all trees P ∈ F, that are subgraph isomorphic to G.
Note that the TEC problem is a special case of the FCSM problem for finite P = F, D = {G}, and t = 1 if F is closed downwards with respect to subgraph isomorphism. Hence, we can apply a variant of Algorithm 2.1 to solve the problem, as it is possible to decide whether a tree H is contained in a finite set of trees in linear time in the size of H (cf. Section 2.1). However, we discuss more efficient alternatives in Chapter 6.
Jaccard Similarity
One similarity function that we investigate in this thesis on the embedding vectors dis- cussed above is the Jaccard similarity. Given two binary feature vectors ⃗f1and ⃗f2repre- senting the sets S1and S2, respectively, their Jaccard-similarity is defined by
SimJaccard( ⃗f1, ⃗f2) ∶=SimJaccard(S1, S2) = ∣S1∩ S2∣ ∣S1∪ S2∣
with SimJaccard(∅, ∅) ∶= 0for the degenerate case. As long as the feature vectors are low di- mensional (i.e., ∣F∣ is small), the Jaccard-similarity can quickly be calculated. If, however, they are high dimensional, it can be approximated by the following fast probabilistic tech- nique based on min-hashing (Broder,1997): For a permutation π of F and feature vector ⃗f,
define hπ( ⃗f)to be the index of the first entry with value 1 in the permuted order of ⃗f. One
can show that the following correspondence holds for the feature vectors ⃗f1and ⃗f2above (seeBroder,1997, for the details):
SimJaccard(S1, S2) =P [hπ( ⃗f1) = hπ( ⃗f2)] ,
where the probability P is taken by selecting π uniformly at random from the set of all permutations of F. This allows for the following approximation of the Jaccard-similarity between ⃗f1and ⃗f2: Generate a set π1, . . . , πKof permutations of the feature set uniformly
at random and return K′/K, where K′is the number of permutations π
iwith hπi( ⃗f1) =
hπi( ⃗f2). The approximation of the Jaccard-similarity with min-hashing results in a fast
algorithm if the embedding into the feature space can be computed quickly.
2.4. Datasets
Any general frequent subgraph mining algorithm is expected to process a broad spectrum of graph databases. Most empirical evaluations, however, concentrate on some particu- lar type of graph data, mostly representing small molecules. These graphs share certain properties, e.g. sparsity, small vertex degree, near planarity, and, in particular, a natural set of frequent patterns corresponding to functional groups. While all these properties (especially the last one) motivate frequent subgraph mining in the first place, it is also im- portant to observe the behavior of a mining technique on data that may or may not have such properties. We therefore conducted experiments on molecular, social, and artificial datasets. Table 2.1 gives an overview of key statistics of these datasets. Below we briefly describe their semantics and how we obtained them.
MUTAG (Debnath et al,1991) is a dataset of 188 connected compounds labeled according to their mutagenic effect on Salmonella typhimurium. On average, each graph has 20 vertices and 22 edges.
PTC contains 344 connected molecular graphs, labeled according to the carcinogenic-
ity in mice and rats. The graphs have 26 vertices and edges on average. The dataset was released as part of the Predictive Toxicology Challenge (see https: //www.predictive-toxicology.org/ptc) held in 2000 and 2001.
NCI1, NCI109 (Wale et al,2008) consist of 4 110 (resp. 4 127) compounds of which 3 530 (resp. 3 519) are connected. Both are balanced sets of chemical molecules labeled according to their activity against non-small cell lung cancer (resp. ovarian cancer) cell lines. The average number of vertices is 30, the average number of edges is 32 in both datasets.