• No results found

3.3 Experimental results

4.1.2 Molecular graphs

This research was conducted at ChemAxon, a company developing software solutions and services for computational chemistry and biology. Therefore, the MCES problem was studied in this domain, applied to molecular graphs. Here we discuss the basic concepts and notations related to the representation and analysis of molecular graphs, as well as the applications of the MCES problem in this context.

In cheminformatics, the typical representation of molecules is by means of molecular graphs [235, 246]. A molecular graph is defined to be a simple, undirected, labeled graph in which the nodes represent the atoms, the edges represent the chemical bonds, and the labels of the nodes and edges represent the atom types (C, N, O, etc.) and bond orders (single, double, triple, aromatic), respectively. Therefore, we usually refer to the nodes, edges, and (short) cycles of molecular graphs as atoms, bonds, and rings, respectively.

With respect to structural comparison and graph matching, hydrogen atoms are typ-ically not included in molecular graphs. Only the other atoms and the bonds between them are represented explicitly, while hydrogens are assumed to fill the unused valences of the other atoms, so they are determined implicitly. This representation is required for the practical usage of the subgraph concept, not only for the sake of simplicity, as it is demonstrated in Figure4.1. In the followings, we always assume that hydrogen atoms are excluded from molecular graphs.

C

C C C

C

(a) Molecular graphs without hydrogen atoms

C C

(b) Molecular graphs with hydrogen atoms

Figure 4.1: Examples of different representations of molecules. The molecular graph of ethane (a carbon chain with two carbon atoms) is a subgraph of the molecular graph of propane (a car-bon chain with three carcar-bon atoms) only if the hydrogen atoms are excluded.

Figure 4.2 shows a typical example of depicting molecular graphs. According to the conventions of chemical drawing, labels of carbon atoms (C) are not displayed in the followings. Single, double, and triple chemical bonds are depicted as single, double, and triple lines, respectively. Aromatic rings are indicated with a circle inside the ring (which usually contains 5 or 6 bonds and is depicted as a pentagon or hexagon, respectively). For our study, nevertheless, these bonds only mean edges with a label that is distinguished from other bond types (single, double, and triple).

N

O O

Hydrogen atoms are not represented

Aromatic rings are depicted this way Labels of carbon atoms are not displayed

Figure 4.2: Example of displaying molecular graphs.

Finding maximum common subgraphs, by means of either the MCIS or the MCES problem, plays an important role in a wide range of applications in the field of computa-tional chemistry and, increasingly, biology [90]. A typical application is similarity search, which is essential in the early stages of the drug discovery process [76, 245, 246, 247].

Intuitive measures of the similarity of two molecular graphs can be defined on the basis of the size of their maximum common subgraph [49, 145, 204, 205]. Other applications include clustering [38,110,224,225], NMR spectral studies [56], design of chemical space networks [252], molecular alignment [152, 174], and reaction mapping [93,176,234].

In the cheminformatics literature, this problem is usually referred to as maximum common substructure (MCS) search or the maximum overlapping set (MOS) problem, and it is defined to be either the MCIS or the MCES problem. Although several algorithms solve the MCIS problem (e.g., [53, 151]), the MCES concept is considered to be more relevant in the case of molecular graphs since it is closer to the intuitive notion of chemical similarity [83, 176, 206]. Figure 4.3 illustrates the difference.

In some applications, connected common subgraphs are to be found. Figure4.4 illus-trates the difference between the connected and disconnected MCES of two molecular graphs. As it is shown in this example, the connected MCES can be larger than the largest connected component of the disconnected MCES. Intermediate approaches have also been developed applying constraints on the number of components or on their topo-logical relationships [151]. Another interesting approach is to find connected common subgraphs tolerating a few mismatches in atom and bond types [241].

A common subgraph of the input graphs G1 and G2 is usually represented as a partial mapping from the bonds of G1 to the bonds of G2. That is, a bond of G2 is assigned to each bond of a subgraph of G1 so that associated bonds represent the same bond in the common subgraph. Not only is this mapping practical for the algorithms, but it is also required in numerous applications, as it defines a correspondence between equivalent (isomorphic) parts of the two molecular graphs. (In the case of the MCIS problem, the

N N

(a) MCIS

N N

(b) MCES

Figure 4.3: Comparison of the MCIS and MCES of two molecular graphs.

O O

O O

(a) Connected MCES

O O

O O

(b) Disconnected MCES

Figure 4.4: Comparison of the connected and the disconnected MCES of two molecular graphs.

mapping is composed of atom pairs instead of bond pairs.) Note that multiple mappings may define isomorphic common subgraphs. Later, we elaborate upon the importance of the mapping considering other features apart from the common subgraph it defines.

Finally, we remark that finding the maximum common subgraph of more than two graphs, often called the multi-MCS problem, is also required in certain applications.

Although it can be approximated using consecutive MCES or MCIS computations between pairs of graphs, more efficient methods are often desirable [71,134], but this is not in the scope of this work.

4.2 Algorithms

There are many published algorithms for finding maximum common subgraphs. They can be categorized according to multiple aspects, for example, whether the algorithm is exact or heuristic, whether it solves the MCIS or the MCES variant of the problem, or whether it finds connected or disconnected common subgraphs. The applied approaches also show a wide variety, including backtracking [53,56, 160, 175], reducing the problem to finding

a maximum clique of a derived graph [86, 158, 203, 204], genetic algorithms [43, 240], and greedy heuristics [151]. Some methods also utilize the concept of reduced graphs [20, 110,202,227]. Remarkable surveys and experimental studies of various algorithms can be found in [65,84, 85, 90,206].

Applications of MCIS/MCES search impose strict and often conflicting requirements on the solution methods. Although the problem is NP-hard, algorithms are required to be fast, which is why heuristic methods are often applied. On the other hand, the results should be equal to, or at least very close to the actual maximum. Moreover, in cheminfor-matics applications, features that make sense to a chemist but are hard to grasp formally are sometimes desired, such as taking topological features into account when parts of the input molecules are mapped to each other.

In order to satisfy these requirements, we developed two efficient heuristic algorithms for MCES search: one based on the maximum clique approach and another one based on a greedy approach called the build-up method [151]. We also devised several novel heuristics to improve both their running time and the approximation of the optimal results.

In this section, we briefly introduce the two algorithms we implemented according to the available literature, while the next section presents the improvements and extensions, which are the main contribution of this work.