• No results found

5.3 Computing a maximum common subgraph of two outerplanar graphs

5.3.5 Time complexity

In this section, we outline the proof for the following theorem:

Theorem 5.1 The proposed algorithm computing an MCS under BBP subgraph isomorphism of two given outerplanar graphs in time O(|V (G)|5/2.

|V (H)|5/2).

Proof sketch: The result follows from counting the number of relevant sub- graphs of the outerplanar graphs to consider and using known bounds on the running times of the algorithms used in the dynamic programming step.

Consider the algorithm as explained in Sect. 5.3.3. We start with an optimisa- tion note. If ComputeWeight is called multiple times with the same arguments, we can avoid doing the computation again by remembering the result.

Let us first consider the time spent in MatchBlockPreservingSubgraph (Alg. 5.2). A call to ComputeWeight can be performed in constant time ex- cept if x and y refer to blocks. Line ?? is executed at most once for every or- dered pair (r, r0) and (s, s0), so for function ComputeWeight we can account

O(|V (G)|.|V (H)|) time.

There are at most O(|V (G)|.∆G) elements in BP S(G) where ∆G is the maxi-

mal degree of a vertex of G. MatchBlockPreservingSubgraph is thus called O(|V (G)|.∆G.|V (H)|.∆H) times. Every call to MatchBlockPreservingSub-

graph costs (except for the time spent in ComputeWeight discussed above) at most the time to solve a weighted maximal matching problem, which can happen in cubic time [Munkres, 1957]. Therefore, a rough bound on the time spent in MatchBlockPreservingSubgraph is O(|V (G)|5/2.∆

5.4 Related work 123 In function MatchBlockSplittingSubgraph (Alg. 5.3), the for loop is executed at most four times for every value of vG

b, vGm, veG, vbH, vHm and vHe .

Therefore, the time spent in MatchBlockSplittingSubgraph is bounded by O(|V (G)|3.|V (H)|3).

The above arguments show that there is an algorithm running in time bounded by a polynomial of degree 5.

5.4

Related work

There exist a number of approaches that compute MCSs of graphs and there are a lot of differences in the way they handle the NP-completeness of the problem. For example, certain algorithms solve the MCS problem optimally, while others do it approximately. Moreover, there exist slightly different notions of an MCS e.g., whether it is connected or whether it is subgraph-induced. We will first discuss these different notions.

First, the computed MCS can be connected or unconnected. When computing an unconnected MCS, a trade-off has to be made between the number of connected fragments in the MCS and its usefulness as a pattern. If too many connected components are allowed, the pattern becomes too general. If the number is too small, the pattern may be too specific. For example, when two molecules share two benzene rings but differ with respect to the connection between the rings, only one of these rings can appear in a connected MCS. Despite this potential drawback, we focus on finding connected MCSs here.

A second distinction is made between the maximum common induced subgraph (MCIS) and the maximum common edge subgraph (MCES). The MCIS requires the common subgraph to be induced, in the sense that if two vertices in the first graph are linked by an edge, they are mapped to two vertices in the second graph that are linked by an edge as well.3 The MCES, which does not have

this requirement, simply tries to maximise the number of edges in the common subgraph. Chemists have argued that the MCES more adequately expresses the notion of chemical similarity than does the MCIS, since the bonded interactions between atoms are most responsible for the perceived molecular activity [Raymond and Willett, 2002b]. An MCS that is computed between outerplanar graphs under the BBP subgraph isomorphism corresponds to neither of these notions, but tries to maximise the edges given that only bridges are mapped to bridges and block edges to block edges.

Here we also introduce a third distinction, which is the kind of subgraph iso- morphism used in the matching. Usually, the general subgraph isomorphism is used to compute the MCS, while we propose to use the BBP subgraph isomor- phism instead.

3The definition of an induced subgraph G of a graph H requires that for every u, v ∈ V (G): {u, v} ∈ E(G) iff {ϕ(u), ϕ(v)} ∈ E(H).

An overview of MCS algorithms has been given by Raymond and Willett [2002b] and Conte et al. [2004]. McGregor [1982] proposes a simple backtracking strategy that uses various heuristics in order to reduce the number of backtrack- ings. Akutsu [1993] proposes a polynomial algorithm for computing the connected MCES of “almost trees of bounded degree” under the general subgraph isomor- phism. This is another class of graphs which is suited for the representation of molecules. Koch [2001] introduces an algorithm for enumerating all connected MCESs in two graphs, based on a transformation of the two graphs into an as- sociation graph. The MCS problem is then transformed into a clique detection problem. Raymond et al. [2002] use the same problem transformation and pro- pose an exact multi-step algorithm which defines a similarity based on computing the unconnected MCES. The problem still remains theoretically NP-complete, but their algorithm makes use of advanced heuristics to reduce the number of match- ings required.

Cao et al. [2008] propose an improved backtracking algorithm that is spe- cialized for molecules and works directly on the chemical graph. Because of branch-and-bound heuristics, they can drastically reduce the search space and obtain an efficient algorithm for unconnected MCIS computation. They show that their MCS-based similarity measure outperforms traditional descriptor-based and topology-based similarity measures. In the next section, we will perform a runtime comparison with this algorithm.

Instead of using heuristics, an alternative approach to reduce the complexity could be to consider common substructures which can be computed more eas- ily, such as multisets of common vertex labels [Karunaratne and Bostr¨om, 2006]. However, an important drawback is that the more complex shared substructures are not taken into account.

Our algorithm is different from the other approaches in the sense that it com- putes a connected MCS that fulfils the requirements of the BBP subgraph isomor- phism in an exact way. The algorithm works directly on the chemical graph and does not require an advanced graph-theoretical problem transformation.

5.5

Experiments

In this section, we compare our polynomial MCS algorithm against the algorithm of Cao et al. [2008] in terms of efficiency. This algorithm computes an uncon- nected MCS under the general subgraph isomorphism (and is not bounded by the restriction to outerplanar graphs). Cao et al. [2008] make use of new heuristics in their algorithm to drastically reduce the search space compared to other MCS al- gorithms, so this algorithm should be one of the fastest available implementations of an algorithm that computes MCSs under the general subgraph isomorphism. To the best of our knowledge, other implementations of MCS algorithms are not freely available in the public domain.

5.5 Experiments 125

5.5.1

Dataset

The NCI dataset has been made publicly available by the National Cancer Institute and provides screening results for the ability of more than 70,000 compounds to suppress or inhibit the growth of a panel of 60 human tumour cell lines. Each of the compounds is described by its 2D representation, that is, the structure of their atoms and bonds. We use graphs to represent the molecules, in which the vertices are labeled with general atom types (e.g., carbon, nitrogen) and the edges are labeled single, double, triple, amide or aromatic. Hydrogen atoms are dropped. The NCI dataset can be downloaded from http://cactus.nci.nih.gov.

We have selected a subset of 50 molecules of different sizes from the NCI dataset, with the number of vertices ranging from 5 to 99 (average: 45.7) and the number of edges ranging for 4 to 107 (average: 49.4).

5.5.2

Method

For the set of 50 graphs, there are 1225 possible combinations of two graphs. For each of them, we will compute a connected MCSvwith the algorithm we proposed

in Sect. 5.3.3 and an unconnected MCS with the algorithm of Cao et al. [2008]

and measure runtimes. From now on, we will refer to these algorithms as MCSv

and MCS.

MCSv can be downloaded at http://www.cs.kuleuven.be/~dtai/PMCSFG,

while MCS, part of a package called ChemMiner, can be downloaded at http:

//biowb.ucr.edu/ChemMineV2/help/mcs.html. All experiments were run on an Intel Core2Quad 2.33 GHz processor. A time-out of 24 hours per MCS computa- tion was enforced.

5.5.3

Results

A comparison of the runtimes can be found in Fig. 5.9. Notice that a logarithmic scale is used. The figure clearly shows the exponential behaviour of MCS. For 18 of the 1225 MCS computations, MCS timed out at 24 hours; we do not show the runtimes of MCSv on these 18 comparisons either. We also report the order statistics of the computation time for both algorithms in Table 5.1. These show that MCSvcomputes an MCS in less than 0.713 seconds in 95% of the cases, while MCS needs 1643 seconds for this. However, in 204 cases, MCSwas faster than MCSv, which indicates that both algorithms are competitive when computing MCSs for smaller molecules. The graph in Fig. 5.9 confirms this.

While comparing these runtimes, we need to keep in mind that, next to the difference in subgraph isomorphism, there is also another difference between the MCS algorithms. MCScomputes an unconnected MCS, while the MCS of MCSv is connected. This factor also contributes to the improved runtimes of MCSv and

Time

(s)

|V(G)| ∗ |V(H)|

MCS

MCSv

Figure 5.9: Runtimes of MCSv and MCS (in seconds).

it would be interesting to be able to measure the impact of these two restrictions separately.

Conclusion The results of the experimental comparison between MCS and

MCSv show that MCSv outperforms MCS in terms of efficiency. This can be explained by four restrictions that are imposed by MCSv: it works with outerpla- nar graphs, makes use of the BBP subgraph isomorphism, computes the connected MCS and selects a single random one in the case there are multiple MCSs.

5.6

Conclusions

In this chapter, we have introduced an algorithm that computes an MCS of two outerplanar graphs under the BBP subgraph isomorphism. We have proved its correctness and shown that it runs in polynomial time, both theoretically and empirically.

5.6 Conclusions 127 Table 5.1: Distribution of runtimes of the two MCS algorithms for 1207 compar- isons (in seconds).

MCS MCSv Average 805.668 ± 5177.040 0.174 ± 0.282 Minimum 0.002 0.002 O.05 0.003 0.003 O.25 0.025 0.012 O.5 0.161 0.049 O.75 3.590 0.238 O.95 1643.850 0.713 Maximum 79625.900 2.951

The polynomial time complexity is realised by four restrictions. First, the algorithm only works on outerplanar graphs. Most molecular graphs, which are of most interest to us, are outerplanar. Second, the algorithm uses the BBP subgraph isomorphism as matching operator. Again, in the chemical context, this seems to be a sensible choice. Third, we only consider connected MCSs. Fourth, in the case there are multiple possible MCSs, we only select one randomly.

Finally, we have shown that our algorithm outperforms a state-of-the-art MCS algorithm that computes an unconnected MCS under the general subgraph iso- morphism.

Chapter 6

An Efficiently Computable

Metric for Outerplanar

Graphs

6.1

Introduction

In this chapter, we will present a metric that can be used on outerplanar graphs. A metric is a function which defines a distance between elements of a set. Metrics are important components of several machine learning methods. For example, they are used in classification techniques such as the k-nearest-neighbours algorithm, where examples are classified based on the class values of their nearest neighbours in the training set. They can also be used to perform clustering, where the task is to group similar examples together and where some kind of distance measure is used to define similarity. Recently, there has been an increased interest in metrics that express a similarity between structured objects. This is relevant for multiple application domains, including drug discovery [Ceroni et al., 2007], image recognition and computer vision [Shearer et al., 2001], and geographical databases [Li et al., 2005].

As mentioned in the outline of part II, the application on which we focus is the learning of structure-activity relationships. Since it is widely known that molecules with a similar structure tend to have the same function [Johnson and Maggiora, 1990], structural similarity search among small molecules is an important aspect of in silico drug development. The task then comes down to finding an appropriate similarity measure between molecules. Such a structural similarity measure should ideally fulfil two requirements: (1) it should be efficiently computable, which is im- portant when analysing large molecular databases, and (2) the notion of similarity

should discriminate well between molecules w.r.t. the activity at interest. Find- ing such similarity measures is one of the current challenges in chemoinformatics [Deshpande et al., 2005; Ceroni et al., 2007].

However, similarity measures between graphs that aim at using all available structural information often involve the matching of subgraphs or other combi- natorial operations trying to align graphs optimally. For this reason, typical ap- proaches have resorted to a transformation of molecules into vectors and compute a similarity between those vectors. In such transformations, either information is lost or the size of the resulting representation can grow exponentially [De Raedt, 1998].

Instead of resorting to a vector transformation to avoid the computational complexity, we will make use of an efficient algorithm for graphs. More specifi- cally, we will use the notion of a maximum common subgraph (MCS) to define a similarity measure between molecules. The idea behind this is that an MCS of two molecular graphs may reflect shared properties that relate to the molecular activity. In Chapter 5, we have introduced a polynomial algorithm that computes an MCS under the BBP subgraph isomorphism. In this chapter, we will show that the metric based on this algorithm has two important properties: (1) it is computable in polynomial time, and (2) as it reflects the size of the MCS, it has an intuitive meaning making results of methods such as instance-based learning interpretable. Whether the BBP subgraph isomorphism provides the right level of abstraction in order to discriminate between molecules with a different activity and how this influences the predictive performance of classification methods using the BBP subgraph isomorphism as matching operator, is an important issue that will be investigated in this chapter.

The contributions of this chapter are the following:

• We introduce an efficiently computable metric on outerplanar graphs based on the maximum common subgraph computed under the BBP subgraph isomorphism.

• We compare the metric against several other metrics, one of which is based on the MCS computed under the general subgraph isomorphism, in terms of predictive performance.

• We investigate the usefulness of the BBP subgraph isomorphism in general by performing additional experiments in the context of frequent subgraph mining.

The contents of this chapter are the result of joint work with Jan Ramon, Mau- rice Bruynooghe and Hendrik Blockeel, and was partially published in [Schietgat et al., 2008].

The chapter is organised as follows. We start by discussing related work in Sect. 6.2. In Sect. 6.3, we present the metric for outerplanar graphs. In Sect. 6.4, we answer three experimental questions and finally, in Sect. 6.5, we conclude.

6.2 Related work 131

6.2

Related work

Determining quantitative structure-activity relationships (QSAR) is the process where chemical structure is quantitatively correlated with a biological or chemical activity. Usually, a function is learned that takes as input a number of physic- ochemical and structural properties of a molecule and then returns its predicted activity. A number of learning approaches have been proposed for QSAR. Depend- ing on the representation they use, we will organise them in several categories.

Although it is known that better performance can be obtained using additional 3D information such as force field calculations [Ceroni et al., 2007], we will con- centrate on 2D-QSAR methods, that is, methods that only take into account the structural arrangement of atoms and bonds.