Multi-Objective De Novo Drug Design with Conditional Graph Generative Model

(1)

1

Multi-Objective De Novo Drug Design with

Conditional Graph Generative Model

Yibo Li, Liangren Zhang*_{, Zhenming Liu}*

State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, Beijing, 100191, China

Emails:

[email protected], [email protected], [email protected]

Abstract: Recently, deep generative models have revealed itself as a promising way of performing

de novo molecule design. However, previous research has largely focused on generating SMILES strings instead of molecular graphs. Although current graph generative models are available, they are often too general and computationally expensive, which restricts their application to molecules with small sizes. In this work, a new de novo molecular design framework is proposed based on a sequential graph generator. Compared with previous graph generative models, the proposed method is much more tuned for molecule generation and have been scaled up to cover significantly larger molecules in the ChEMBL database. It is shown that the graph-based model produces a higher fraction of valid output structures compared with SMILES-based methods. For the application of drug design tasks, conditional graph generative model is employed. This method offers higher flexibility compared to previous fine-tuning based approach and is suitable for generation based on multiple objectives. This approach is applied to solve several drug design problems, including generation of compounds containing a given scaffold, generation of compounds with specific drug-likeness and synthetic accessibility requirements, as well as generating dual inhibitors against JNK3 and GSK3 𝛽𝛽. Results show high enrichment rates for outputs satisfying the given requirements.

Keywords: Deep Learning; De Novo Drug Design; Graph Generative Model

Introduction

The ultimate goal of drug design is the discovery of new chemical entities with desirable pharmacological properties. Achieving this goal requires medicinal chemists to perform searching and optimization inside the space of new molecules. This task is proved to be extremely difficult, mainly due to the size and complexity of the search space. It is estimated that the number of synthetically

available molecules is around 1060_~10100 1_. Meanwhile, the space of chemical compounds exhibits a discontinues structure, making searching difficult to perform.

The automated de novo molecular design aims at assisting this processes with computer-based methods. Early works have developed various different algorithms to produce new molecular structures, such as atom based elongation or fragment based combination2-3_{. Those methods are} often coupled with global optimization techniques

(2)

2 such as ant colony optimization4_{, genetic} algorithms5-6_{or particle swamp optimization}7_for the generation of molecules with desired properties. Recent developments in deep learning8_{have shed} new light on the area of de novo molecule generation. Works have shown that deep generative models are very effective at modeling the SMILES representation of molecules using recurrent neural networks (RNN), an architecture that has been extensively applied to tasks related sequential data9_, and those models have been successfully applied to various drug design problems. One popular generation framework is language model (LM). Olivecrona et al used a GRU based language model trained on ChEMBL dataset to generate SMILES string. The mode is then fine-tuned using reinforcement learning for the generation of molecules with specific requirements. The method was applied to solve various molecule design problems such as generating analogs to an existing molecule and generation of compounds active against DRD210_{. Popova et al propose to integrate} the generative and predictive network together in the generation phase11_{. Segler et al applied SMILES} LM to the task of generating focused molecule libraries by fine-tuning the trained network with a smaller set of molecules with desirable properties12_. Another method is to used variational autoencoder (VAE)13_{combined with an RNN decoder.} Gómez-Bombarelli et al implemented this method to generate drug-like compounds from ZINC database14_{. This work aims at obtaining a} bi-directional mapping between molecule space and a continuous latent space so that molecular operations can be achieved by manipulating the latent representation. The model was applied to problems like cLogP and QED optimization. Blaschke et al compared different architectures for VAE and applied it to the task of designing active compounds against DRD215_.

The works described above demonstrated the effectiveness of SMILES based model in terms of molecule generation. However, producing valid SMILES strings requires the model to learn rules that are irrelevant to molecular structures, such as the gramma from SMILES strings and atom ordering used when decoding the SMILES, which

adds unnecessary burden to the training process. This makes the SMILES string a less preferable representation compared to molecular graphs. Research in deep learning has recently enabled the direct generation of molecular graphs. Johnson et al proposed a sequential generation approach for graphs16_{. Though their implementation is mainly} for reasoning tasks, the framework provided is potentially applicable to molecule generation. Compared with this approach, a more recent method was proposed to generating the entire graph all at once. This model has been successfully applied t to the generation of small molecular graphs17_{. The implementation that is most similar to} ours is by the recent work18_{using a sequential} decoding scheme similar to that by Johnson et al. Decoding invariance is introduced by sampling different atom ordering from a predefined distribution. This method has been applied to the generation of molecules with less than 20 heavy atoms from ChEMBL dataset. Though inspiring, the methods discussed above have a few common problems. First of all, the generators proposed are relatively general. This allows those methods to be applied to various different scenarios. But for molecule generation, those methods are unnecessarily complexed and require further optimization. Secondly, many of those methods suffer from scalability issue, which restricts the application to molecules with small sizes.

In this work, we propose a graph architecture that is more fitted to molecule generation and is scaled to generate larger molecules in ChEMBL dataset. A conditional version of the model is employed to solve various drug design related tasks with multiple objectives, and promising performance has been demonstrated according to the results.

Methods

Graph Generative Model

The generative model used in this work follows a step-wise generation scheme that builds molecules by iteratively refining old molecular graphs. Consider the molecule 𝐺𝐺 represented by the graph 𝐺𝐺 = (𝑉𝑉, 𝐸𝐸). The building processing starts from the

(3)

3 empty graph 𝐺𝐺0= (∅, ∅) . At step 𝑖𝑖 , a graph

transition 𝑡𝑡𝑖𝑖 is selected from the set of all available

transition action 𝑇𝑇(𝐺𝐺𝑖𝑖) based on the intermediate

structure 𝐺𝐺𝑖𝑖= (𝑉𝑉𝑖𝑖, 𝐸𝐸𝑖𝑖), and 𝑡𝑡𝑖𝑖 is preformed to get

graph structure for the next step 𝐺𝐺𝑖𝑖+1= 𝑡𝑡𝑖𝑖(𝐺𝐺𝑖𝑖). The

selection of 𝑡𝑡𝑖𝑖 is done by sampling from a

probability distribution 𝑡𝑡𝑖𝑖∼ 𝑝𝑝𝜽𝜽(𝑡𝑡𝑖𝑖|𝐺𝐺𝑖𝑖) , which is

usually modeled by neural networks. In the last step, termination operation 𝑡𝑡∗_{is performed to end the}

graph generation.

The entire process is illustrated in Figure 1a. We call the mapping 𝑇𝑇, which determines all available graph transitions for each step, a decoding scheme. The sequence 𝑟𝑟 = ((𝐺𝐺0, 𝑡𝑡0), (𝐺𝐺1, 𝑡𝑡1), … , (𝐺𝐺𝑛𝑛, 𝑡𝑡𝑛𝑛)),

where 𝐺𝐺0= (∅, ∅) , 𝐺𝐺𝑛𝑛= 𝐺𝐺 , 𝑡𝑡𝑛𝑛= 𝑡𝑡∗ and 𝐺𝐺𝑖𝑖+1=

𝑡𝑡𝑖𝑖(𝐺𝐺𝑖𝑖) for 𝑖𝑖 = 1, … , 𝑛𝑛 − 1 , is called a decoding

route of 𝐺𝐺. The distribution 𝑝𝑝𝜽𝜽(𝑡𝑡𝑖𝑖|𝐺𝐺𝑖𝑖) is called a

decoding policy.

Previous step-wise graph generative models are usually too general and are designed to be applicable to various different scenarios. This means that those models are less optimized for the generation of molecular graphs. Here we offer the following optimization to the generator:

(1) A much simpler decoding scheme 𝑇𝑇 is used to decrease the number of steps required for generation. The resulted scheme requires exactly |𝐸𝐸| + 2 steps for the molecular graph 𝐺𝐺 = (𝑉𝑉, 𝐸𝐸), which is usually much shorter than the corresponding SMILES string.

(2) No recurrent unit is used in the decoding policy 𝑝𝑝𝜽𝜽(𝑡𝑡𝑖𝑖|𝐺𝐺𝑖𝑖). The decision at each step depends

only on the structure of the intermediate Figure 1: a) A schematic representation of the step-wise graph generation procedure. Starting with the empty graph 𝐺𝐺0, initialization is performed to add the first atom. At each step, a graph transition is sampled and performed on the intermediate molecule structure. Finally, termination operation is performed to end the generation. b) Details about the

operations performed at each step. For each intermediate graph 𝐺𝐺𝑖𝑖, the set of all available transition actions 𝑇𝑇(𝐺𝐺𝑖𝑖) is

determined, which can be categorized into three types, append, connect, and terminate. The probability of performing each transition is given by 𝑝𝑝𝜽𝜽. The final transition is sampled from 𝑝𝑝𝜽𝜽, and performed on 𝐺𝐺𝑖𝑖 to get the next intermediate

graph 𝐺𝐺𝑖𝑖+1. c) Architecture of graph convolutional layer. At each layer, the output representation for atom 𝑖𝑖 is given by:

(1) the input representation of 𝑖𝑖 from previous layers, (2) representations of direct neighbors and (3) representations of distant neighbors.

(4)

4 molecule 𝐺𝐺𝑖𝑖−1. This means that the atom

representation 𝐡𝐡 is not propagate to the next generation step. This helps to decrease the computational cost and increase the scalability of the model. Moreover, optimization at different step can be easily parallelized, allowing for faster computation.

(3) Different from previous implementations18_, where either fully randomized or fully deterministic decoding route was used to calculate the log-likelihood loss, we use a distribution 𝑞𝑞𝛼𝛼(𝑟𝑟|𝐺𝐺) for the selection of 𝑟𝑟 .

This distribution is parametrized by 𝛼𝛼, which can be used to control the degree of randomness for the sampling of 𝑟𝑟 . This implementation offers higher flexibility, and is shown to improve model performance. The following sections provide detailed discussions of the optimizations above.

Decoding Scheme

Let the 𝐴𝐴 be the set of all allowed atom types, and 𝐵𝐵 be the set of all allowed bond types. The set 𝑇𝑇(𝐺𝐺𝑖𝑖) for intermediate graph 𝐺𝐺𝑖𝑖 is restricted to the

following four types:

(1) Initialization: For the empty graph 𝐺𝐺0, the only

allowed action is to add the first atom of certain type 𝑎𝑎 ∈ 𝐴𝐴 to the graph.

(2) Append: This action adds a new atom of type 𝑎𝑎 ∈ 𝐴𝐴 to 𝐺𝐺𝑖𝑖 and connect it to an existing atom

𝑣𝑣 ∈ 𝑉𝑉𝑖𝑖 with bond of type 𝑏𝑏 ∈ 𝐵𝐵.

(3) Connect: This action connects two existing atoms 𝑣𝑣1, 𝑣𝑣2∈ 𝑉𝑉𝑖𝑖 with bond of type 𝑏𝑏 ∈ 𝐵𝐵. For

simplicity, we only allow connections to start from the latest appended atom 𝑣𝑣∗_{, which means}

𝑣𝑣1= 𝑣𝑣∗. The position of 𝑣𝑣∗ is provided to the

neural network during generation. (4) Terminate: End the generation process. The generation process is illustrated in Figure 1b. It is easy to show that the number of steps required for the generation of graph 𝐺𝐺 equals exactly to |𝐸𝐸| + 2. This value is generally much smaller compared with the length of the corresponding SMILES string.

Decoding Policy

For each step, the transition 𝑡𝑡𝑖𝑖 is sampled from the

distribution 𝑝𝑝𝜽𝜽(𝑡𝑡𝑖𝑖|𝐺𝐺𝑖𝑖) , which is parameterized by a

graph convolutional neural network. The implementation of the network is described as follows:

(1) An atom embedding 𝐡𝐡e is first generated based

on atom type and the location of the last appended atom:

𝐡𝐡e= Embedding(𝐡𝐡a, 𝐡𝐡m) 1

Where 𝐡𝐡a is a vector of length |𝑉𝑉𝑖𝑖−1| with integers

indicating the atom type, and 𝐡𝐡m is a binary vector

of the same size to mark the last appended atom. (2) 𝐡𝐡e is subsequently feed to the graph

convolutional network to get the final embedding 𝐡𝐡: 𝐡𝐡 = GraphConv(𝐺𝐺𝑖𝑖−1, 𝐡𝐡e) 2

𝐡𝐡 is a matrix of size |𝑉𝑉𝑖𝑖−1| × 𝐹𝐹ℎ, where 𝐹𝐹ℎ is the

dimension of the embedding.

(3) The embedding 𝐡𝐡 is then used to calculate the probability for different transitions in 𝑇𝑇(𝐺𝐺𝑖𝑖). The

activation value for each transition is first calculated as follows:

𝐬𝐬append= MLPappend(𝐡𝐡) 3

𝐬𝐬connect= MLPconnect(𝐡𝐡) 4

𝑠𝑠terminate= MLPterminate�AvgPool(𝐡𝐡)� 5

MLPappend , MLPconnect and MLPterminate are

networks with fully connected layers. 𝐬𝐬append and

𝐬𝐬connect are tensors with shape |𝑉𝑉𝑖𝑖−1| × |𝐴𝐴| × |𝐵𝐵|

and |𝑉𝑉𝑖𝑖−1| × |𝐵𝐵| , while 𝑠𝑠terminate is a scalar.

Normalizing the activation values give the probability for each type of action:

𝐩𝐩append= exp�𝐬𝐬append� /𝑠𝑠 6

𝐩𝐩append= exp�𝐬𝐬append� /𝑠𝑠 7

𝑝𝑝terminate= exp(𝑠𝑠terminate) /𝑠𝑠 8

Where 𝑠𝑠 = ∑ exp(𝐬𝐬𝑖𝑖𝑖𝑖𝑖𝑖 a)𝑖𝑖𝑖𝑖𝑖𝑖+ ∑ exp(𝐬𝐬𝑖𝑖𝑖𝑖 c)𝑖𝑖𝑖𝑖+

exp (st). 𝐩𝐩append, 𝐩𝐩connect and 𝑝𝑝terminate have the

same shape as activations 𝐬𝐬append, 𝐬𝐬connect, 𝑠𝑠terminate , and contains the

(5)

5 that a lot of items in 𝐩𝐩append and 𝐩𝐩connect can be

masked zero according to rules such as valence restrictions for each atom type. To test the ability for the model to learn those rules, no constraints are enforced during model training. No recurrent unit is used in this network. In this way, the network is entirely convolutional, which is easier to parallelize and is less expensive compared with RNN.

Graph Convolution Layers

The implementation of graph convolutional layers is discussed as follows. Various implementation of graph convolution is available19_{, and in this work,} the architecture for graph convolutional layers is specified as follows (also illustrated in):

𝐡𝐡𝑖𝑖𝑙𝑙= 𝜎𝜎 ⎝ ⎜ ⎜ ⎛ bn ⎝ ⎜ ⎛ 𝐖𝐖 𝑙𝑙_𝐡𝐡 𝑖𝑖𝑙𝑙−1+ ∑𝑏𝑏∈𝐵𝐵𝚯𝚯𝑏𝑏𝑙𝑙 ∑𝑖𝑖∈𝑁𝑁_𝑏𝑏bond(𝑖𝑖)𝐡𝐡𝑖𝑖𝑙𝑙−1 + ∑ 𝚽𝚽𝑑𝑑𝑙𝑙 ∑_{𝑖𝑖∈𝑁𝑁} 𝐡𝐡𝑖𝑖𝑙𝑙−1 𝑑𝑑path(𝑖𝑖) 1<𝑑𝑑≤𝐷𝐷 + 𝐛𝐛𝑙𝑙 ⎠ ⎟ ⎞ ⎠ ⎟ ⎟ ⎞ 9

Where 𝐡𝐡𝑖𝑖𝑙𝑙 is the embedding of atom 𝑖𝑖 at layer 𝑙𝑙, 𝐖𝐖𝑙𝑙,

𝚯𝚯𝑏𝑏𝑙𝑙 , 𝚽𝚽𝑑𝑑𝑙𝑙 and 𝐛𝐛𝑙𝑙 are the parameters for layer 𝑙𝑙 .

𝑁𝑁𝑏𝑏bond(𝑖𝑖) is the set of all atoms connected to atom 𝑖𝑖

with bond of type 𝑏𝑏, and 𝑁𝑁𝑑𝑑path(𝑖𝑖) is the set of all

atoms whose distance to atom 𝑖𝑖 equals to 𝑑𝑑. 𝜎𝜎 is the activation function, and bn denotes batch normalization.

This implementation is similar to the edge conditioned convolution by Simonovsky el al20_. Briefly speaking, at each layer, the new representation of atom 𝑖𝑖 is calculated using the following information: (1)Representation of atom 𝑖𝑖 from previous layer, (2) representation of direct neighbors of atom 𝑖𝑖 and (3) representation of distant neighbors of atom 𝑖𝑖. Distant information helps to reach larger receptive field with fewer layers, thus decrease computational cost. This is better illustrated in Figure 1c.

Likelihood Function

To train the generative model, we need to maximize the log-likelihood 𝑝𝑝𝜽𝜽(𝐺𝐺) for the training samples.

However, for the step-wise generative model discussed above, the likelihood is only tractable for

a given decoding route 𝑟𝑟 = ((𝐺𝐺0, 𝑡𝑡0), (𝐺𝐺1, 𝑡𝑡1), … , (𝐺𝐺𝑛𝑛, 𝑡𝑡𝑛𝑛)):

log 𝑝𝑝𝜽𝜽(𝐺𝐺, 𝑟𝑟) = ∑ log 𝑝𝑝𝑛𝑛𝑖𝑖=0 𝜽𝜽(𝑡𝑡𝑖𝑖|𝐺𝐺𝑖𝑖) 10

While the marginal likelihood can be computed as: log 𝑝𝑝𝜽𝜽(𝐺𝐺) = log ∑𝑟𝑟∈𝑅𝑅(𝐺𝐺)𝑝𝑝𝜽𝜽(𝐺𝐺, 𝑟𝑟) 11

Where 𝑅𝑅(𝐺𝐺) is the set of all possible decoding route for 𝐺𝐺 . The marginal likelihood function is intractable for most molecules encountered in drug design. One way to resolve this problem is to consider only the "preferred" way of generation for each molecule, such as the depth first traversal based on the canonical ordering of atoms:

log 𝑝𝑝𝜽𝜽(𝐺𝐺) ≥ log 𝑝𝑝𝜽𝜽�𝐺𝐺, 𝑟𝑟∗(𝐺𝐺)� 12

Where 𝑟𝑟∗_{(𝐺𝐺) denotes the preferred generation}

route. This provides a lower bond for the true likelihood function. Another way is to approach the value using Monte Carlo sampling:

log 𝑝𝑝𝜽𝜽(𝐺𝐺) = log �_{|𝑅𝑅(𝐺𝐺)|}1 ∑𝑟𝑟∈𝑅𝑅(𝐺𝐺)𝑝𝑝𝜽𝜽(𝐺𝐺, 𝑟𝑟)× |𝑅𝑅(𝐺𝐺)|�

= log 𝐸𝐸𝑟𝑟~𝑞𝑞(𝑟𝑟|𝐺𝐺)�𝑝𝑝_𝑞𝑞�𝜽𝜽_𝑟𝑟(𝐺𝐺,𝑟𝑟)_�_𝐺𝐺_��

13

Where 𝑞𝑞(𝑟𝑟|𝐺𝐺) is the distribution on 𝑅𝑅(𝐺𝐺) such that 𝑞𝑞(𝑟𝑟|𝐺𝐺) =_{|𝑅𝑅(𝐺𝐺)|}1 for all 𝑟𝑟 ∈ 𝑅𝑅(𝐺𝐺). The lower bond can then be obtained by sampling multiple values 𝑟𝑟𝑖𝑖

from 𝑞𝑞(𝑟𝑟|𝐺𝐺): log 𝑝𝑝𝜽𝜽(𝐺𝐺) = log 𝐸𝐸𝑟𝑟~𝑞𝑞(𝑟𝑟|𝐺𝐺)�𝑝𝑝_𝑞𝑞�𝜽𝜽_𝑟𝑟(𝐺𝐺,𝑟𝑟)_�_𝐺𝐺_�� ≥ log1_𝑖𝑖∑ 𝑝𝑝𝜽𝜽(𝐺𝐺,𝑟𝑟𝑖𝑖) 𝑞𝑞�𝑟𝑟_𝑖𝑖�𝐺𝐺� 𝑖𝑖 𝑖𝑖=1 14

Both the deterministic approach (eq. 12) and the fully randomized approach (eq. 13) were explored in the previous work18_{. However, a more desirable} solution would lie in somewhere between deterministic decoding and fully randomized decoding.

In this work, instead of sample from the distribution 𝑞𝑞(𝑟𝑟|𝐺𝐺) , we sample 𝑟𝑟 from distribution 𝑞𝑞𝛼𝛼(𝑟𝑟|𝐺𝐺)

that is parameterized by 0 ≤ 𝛼𝛼 ≤ 1 . 𝑞𝑞𝛼𝛼(𝑟𝑟|𝐺𝐺) is

designed such that the decoding will largely follow depth first decoding with canonical ordering, but at each step, there is a small possibility 1 − 𝛼𝛼 to make a mistake. In this way, the parameter 𝛼𝛼 measures

(6)

6 can be used to control the randomness of the distribution 𝑞𝑞𝛼𝛼.

The lower bond can them be obtained using importance sampling: log 𝑝𝑝𝜽𝜽(𝐺𝐺) = log 𝐸𝐸𝑟𝑟~𝑞𝑞𝛼𝛼(𝑟𝑟|𝐺𝐺)� 𝑝𝑝𝜽𝜽(𝐺𝐺,𝑟𝑟) 𝑞𝑞(𝑟𝑟|𝐺𝐺) × 𝑞𝑞(𝑟𝑟|𝐺𝐺) 𝑞𝑞𝛼𝛼�𝑟𝑟�𝐺𝐺�� = log 𝐸𝐸𝑟𝑟~𝑞𝑞𝛼𝛼(𝑟𝑟|𝐺𝐺)� 𝑝𝑝𝜽𝜽(𝐺𝐺,𝑟𝑟) 𝑞𝑞𝛼𝛼�𝑟𝑟�𝐺𝐺�� ≥ log1_𝑖𝑖∑ 𝑝𝑝𝜽𝜽(𝐺𝐺,𝑟𝑟𝑖𝑖) 𝑞𝑞𝛼𝛼�𝑟𝑟𝑖𝑖�𝐺𝐺� 𝑖𝑖 𝑖𝑖=1 15

For 𝛼𝛼 = 1 , the distribution falls back to the deterministic decoding (eq. 12). The parameter 𝛼𝛼 is treated as a hyperparameter, and are optimized for model performance. We tried 𝛼𝛼 = 1.0, 0.8, 0.6, and found that best performance can be achieved when 𝛼𝛼 = 0.8.

Conditional Generative Model

Most molecule design tasks require producing compounds satisfying some certain criteria, such as being synthetically available or having a high affinity for a certain target. For models containing no latent variables, as LM for SMILES based model, a popular solution is to fine-tune the existing model so that it can be suited for a specific task10-12_. However, modeling multiple objectives is challenging for this type of models. Herein, we propose to apply conditional generative models for generation tasks with specified requirements. The conditional generative model has the following advantages compared with previous retraining based methods: First of all, the model can be used to control multiple conditions, such as the activity profile against multiple targets. Secondly, both continuous and discrete variables can be used, which offer higher flexibility to the model. Thirdly, conditional generative models are much easier to train compared to methods such as reinforcement learning, which is much more unstable during training and usually requires a well-designed reward function for good performance.

The graph generation model is modified to be conditioned on the code 𝐜𝐜 , which describes the requirement for the output molecules. The code can contain various information depending on the generation task, such as QED 21_{for the generation} of drug-like molecules, and IC50 for the generation

of compounds active against certain target. The model then becomes:

log 𝑝𝑝𝜽𝜽(𝐺𝐺, 𝑟𝑟|𝐜𝐜) = ∑ log 𝑝𝑝𝑛𝑛𝑖𝑖=1 𝜽𝜽(𝑡𝑡𝑖𝑖|𝐺𝐺𝑖𝑖, 𝐜𝐜) 16

Here, the decision of transition action 𝑝𝑝𝜽𝜽(𝑡𝑡𝑖𝑖|𝐺𝐺𝑖𝑖−1, 𝐜𝐜)

not only depends on the intermediate molecular graph 𝐺𝐺𝑖𝑖−1, but also the conditional code 𝐜𝐜.

Conditional models have already been used by previous works 17_{for molecule generation, but was} restricted to small molecules and have only used simple properties such as the number of heavy atoms as conditional codes. Here, the model is applied to tasks that are much more related to drug design, including scaffold-based generation, property-based generation and the design of dual inhibitor of JNK3 and GSK3𝛽𝛽.

Scaffold-Based Generation

The concept of molecular scaffold has long been of significant importance in medicinal chemistry22_. Though various definition is available, the most widely accepted definition is given by Bemis and Murcko23_{, who proposed derive the scaffold of a} given molecule by removing all side chain atoms. Studies have found various scaffolds that have privileged characteristics in terms of the activity of certain target24-26_{. Once such privileged structure is} found, a related task is to produce compound libraries containing such scaffolds for subsequent screening.

Here, conditional graph generative model is applied to generate compounds containing scaffold 𝑠𝑠 , which is drawn from the pre-defined scaffold set 𝑆𝑆 = {𝑠𝑠𝑖𝑖}𝑖𝑖=1𝑁𝑁𝑆𝑆. The set 𝑆𝑆 is extracted from the list of

approved drugs in DrugBank 27_{. Two types of} structures are extracted from the molecules to construct 𝑆𝑆: (1) the Bemis-Murcko scaffolds, and (2) ring assemblies. Ring assemblies are included in 𝑆𝑆 since we found that including extra structural information beside Bemis-Murcko scaffolds helps to improve the conditional generation performance. The extraction process is performed using RDKit28_. Scaffolds with a molecular weight larger than 300 are removed. Fragments that are tautomer to each other are merged into a single entity, as they are unidentifiable during substructure matching (since the matching algorithm in RDKit is design to ignore

(7)

7 hydrogens). The resulted 𝑆𝑆 contains a total of 1129 scaffold structures.

For each molecule 𝐺𝐺, the conditional code 𝒄𝒄 is set to be the binary vector such that 𝑐𝑐𝑖𝑖= 1 if 𝐺𝐺

contains 𝑠𝑠𝑖𝑖 as substructure, and 𝑐𝑐𝑖𝑖= 0 otherwise.

In this way, 𝒄𝒄 can be viewed as a substructure fingerprint based on scaffold set 𝑆𝑆 . During generation, 𝐜𝐜 is first constructed as a structural query. To generate molecule containing substructure 𝑠𝑠 ∈ 𝑆𝑆, the fingerprint 𝐜𝐜𝑠𝑠 for 𝑠𝑠 is used

as conditional code. The output should contain two type of molecules:

(1) Molecules containing 𝑠𝑠 as its Bemis-Murcko scaffold.

(2) Molecules whose Bemis-Murcko scaffold contains 𝑠𝑠 but does not reside inside 𝑆𝑆. The procedure is better demonstrated in Figure 2. Using this method, detailed control on the output molecule can be performed.

Generation Based on Synthetic Accessibility and Drug-likeness

Drug-likeness and synthetic accessibility are two properties that have significant importance in the development of novo drug candidate. Drug-likeness measures the consistency of a given compound with the currently known drugs in terms of the structural or physical properties and is frequently used to filter

out obvious non-drug like compounds in the early phase of screening29-30_{. Synthetic accessibility is} also important for de novo drug design since subsequent experimental validation requires synthesis of the given compound31_{. In this task, the} model is required to generate molecules according to a given level of drug-likeness and synthetic accessibility. The drug-likeness is measured using the Quantitative Estimate of Drug-likeness (QED)21_, and synthetic accessibility is evaluated using the SA score31_{. The conditional code 𝐜𝐜 is defined as 𝐜𝐜 =} (QED, SA), where the QED and SA score is all calculated using RDKit.

In practice, instead of specifying a single value of QED and SA, we often use intervals to express the requirements for desired output molecules. This means that we are required to sample molecules from the distribution 𝑝𝑝𝜽𝜽(𝐺𝐺|𝐜𝐜 ∈ ∁) =

𝐸𝐸_{𝒄𝒄∼𝑝𝑝�}_𝐜𝐜_�_{𝐜𝐜 ∈ ∁}_�[𝑝𝑝𝜽𝜽(𝐺𝐺|𝐜𝐜)] , where the generation

requirement is described as a set ∁ instead of a single point 𝐜𝐜. The sampling involves a two-step process by first drawing 𝐜𝐜 from 𝑝𝑝(𝐜𝐜|𝐜𝐜 ∈ ∁), and then drawing 𝐺𝐺 from 𝑝𝑝𝜽𝜽(𝐺𝐺|𝐜𝐜) . Sampling from

𝑝𝑝(𝐜𝐜|𝐜𝐜 ∈ ∁) can be achieved by first sample 𝒄𝒄 from 𝑝𝑝(𝐜𝐜) using molecule from the validation set, then filter 𝐜𝐜 according to requirement 𝐜𝐜 ∈ ∁.

Designing Dual Inhibitor Against JNK3 and GSK3𝜷𝜷

Figure 2: Workflow for scaffold based molecule generation. Scaffold set 𝑆𝑆 is first extracted from compounds in

DrugBank. The conditional code 𝐜𝐜 is set to be the substructure fingerprint based on 𝑆𝑆. Training is performed with the training samples labeled with 𝐜𝐜𝐺𝐺. After training, scaffold based generation is performed using the fingerprint 𝐜𝐜𝑠𝑠 of the

(8)

8 With the ability to model multiple requirements at once, conditional generative models can be used to design compounds with specific activity profiles for multiple targets. Here, we consider the task of designing dual inhibitors against both c-Jun N-terminal kinase 3 (JNK3) and glycogen synthase kinase-3 beta (GSK3𝛽𝛽). Both of the two targets are serine/threonine (S/T) kinases, and have shown to be related to the pathogenesis of various types of diseases32-33_{. Notably, both JNK3 and GSK3𝛽𝛽 are} shown to be potential target in the treatment of Alzheimer’s disease (AD). Jointly inhibiting JNK3 and GSK3𝛽𝛽 may provide potential benefit for the treatment of AD.

The conditional code is set to be 𝐜𝐜 = (𝑐𝑐𝐽𝐽𝑁𝑁𝐽𝐽3, 𝑐𝑐𝐺𝐺𝐺𝐺𝐽𝐽3𝛽𝛽) , where 𝑐𝑐𝐽𝐽𝑁𝑁𝐽𝐽3, 𝑐𝑐𝐺𝐺𝐺𝐺𝐽𝐽3𝛽𝛽 are binary

values indicating whether the compound is activity against JNK3 and GSK3𝛽𝛽. For compounds in the ChEMBL dataset, 𝑐𝑐𝐽𝐽𝑁𝑁𝐽𝐽3 and 𝑐𝑐𝐺𝐺𝐺𝐺𝐽𝐽3𝛽𝛽 are labeled

using a separately trained predictor. Note that a better approach would be training the predictor and generator jointly. This topic is left for future work. The predictive model is trained using activity data from ExCAPE-DB 34_{. ExCAPE-DB is an integrated} database with activity values from ChEMBL35_and PubChem36_{. We use the activity flag provided by} the database to separate the active and inactive compounds against the specified target. The resulted set contains 3334 active compounds and 300186 inactive compounds for GSK3𝛽𝛽, as well as 923 active compounds and 59412 inactive compounds for JNK3. Random forest (RF) classifier, which has been demonstrated to provide good performance for kinase activity prediction37_, is used as the predictor for GSK3 𝛽𝛽 and JNK3 activity, with ECFP6 (extended connectivity fingerprint38_{with a diameter of 6) as the descriptor.} The RF models for each target are implemented

using Scikit-learn39_{with the number of estimators} (decision trees) set to 100. RDKit is used to calculate the ECFP6. As the dataset is highly imbalanced, random under-sampling of the inactive set is performed to improve result.

It is noticed that there is only 1.2% of molecules in ChEMBL that is predicted to be active against JNK3 or GSK3𝛽𝛽. This imbalance results in low enrichment rate during conditioned generation. For better result, those molecules are up-sampled by 10 times, and the number of training epochs is decreased by half in order to reduce over fitting.

Implementation Details

The model architectures of both the unconditional and conditional graph generator are shown in Figure 3. For the unconditional generator, the model architecture is described as follows: The output size of embedding layer Embedding(𝐡𝐡a, 𝐡𝐡m) is set to 16. Graph

convolutional network GraphConv(𝐺𝐺𝑖𝑖−1, 𝐡𝐡e)

contains 6 layers with 32, 64, 128, 128, 256, 256 units each. The output from each layer is concatenated and feed to two subsequent fully connected layers with 256 and 512 hidden units. MLPappend , MLPconnect and MLPterminate are all

implemented with a fully connected network with one hidden layer of 256 units.

For conditional model, the architecture is modified slightly as illustrated in Figure 3. The conditional code is first projected to a higher dimensional embedding, which is performed by a fully connected network MLPcond with one hidden layer.

The size of the hidden layer depends on specific code 𝐜𝐜. For scaffold based modeling, hidden layer size is set to 1024, and for other tasks, hidden size is set to 10. For each graph convolution layer, the Figure 3: Architectures for unconditional and conditional graph generators implemented in this work.

(9)

9 input atom representation is concatenated with the corresponding conditional embedding, and the feed to the graph convolution layer.

Models are trained using structures extracted from ChEMBL35_{, a large-scale database for both} structural and bioactivity information of chemical compounds. Data processing workflow is largely similar to the previous work10_{. Molecule structures} from ChEMBL are first standardized using RDKit. This process involves salt removal, molecule neutralization, removing isotopes, and conversion to canonical SMILES strings. We only keep molecules containing less than 50 heavy atoms and whose elements belong to the list 𝐻𝐻, 𝐵𝐵, 𝐶𝐶, 𝑁𝑁, 𝑂𝑂, 𝐹𝐹, 𝑃𝑃, 𝑆𝑆, 𝐶𝐶𝑙𝑙, 𝐵𝐵𝑟𝑟, 𝐼𝐼 . This result in a dataset containing 1.5 million molecules. During training, 80% of molecules are randomly selected as training set, and the rest are treated as validation set.

The network is implemented using Tensorflow40_. Adam optimizer41_{is used to minimize training loss,} with the learning rate set to be initially 0.00025, and decay every 500 iterations with rate 0.01. We use a batch size of 50, and for each mini-batch, 𝑘𝑘 = 5 samples are drawn from 𝑞𝑞𝛼𝛼(𝑟𝑟|𝐺𝐺) to calculate the

training loss. Gradients are clipped to [−3, 3] during training. The training lasts for 10 epochs, and is performed asynchronously on 4 Nvidia GeForce GTX 1080 GPUs.

SMILES based methods

The proposed graph-based model is compared with several SMILES based models for model performance and sample quality. Two type of methods, variational autoencoder (VAE) and language model (LM), is considered in this comparison. The implementation of SMILES VAE follows the previous work14_{. The encoder contains} three 1D convolutional layers, with 9, 9, 10 filters and 9, 9, 10 kernels each, and a fully connected layer with 435 hidden units. The model uses 292 latent variables and a decoder with three GRU layers with 501 hidden units. VAE for sequential data suffers from the issue of optimization challenge42-43_{. While the original implementation} uses KL-annealing to tackle this problem, we follow the method provided by Kingma et al44_by

controlling the level of free bits. This offers higher flexibility and stability compared with KL-annealing. We restrict the minimal level of free bits to 0.025 for each latent variable. For LM, two types of architecture are adopted. The first (SMILES LM1) adopt the same structure as the decoder of VAE discussed above. The second (SMILES LM2) follows the previous work10_{, which used a wider} architecture with three GRU layers with 1024 hidden units each. Training is performed on a single Nvidia GeForce GTX 1080 GPUs for 10 epochs.

Evaluation Metrics

The model performance is evaluated by estimating the negative log-likelihood (NLL) using the validation set {𝐺𝐺𝑖𝑖}𝑖𝑖=1𝑁𝑁 . For graph generative models,

the estimation is obtained using importance sampling: 𝑁𝑁𝑁𝑁𝑁𝑁𝛼𝛼′= −_𝑁𝑁1∑ log1_𝑖𝑖∑ 𝑝𝑝_𝑞𝑞𝜽𝜽�𝐺𝐺𝑖𝑖,𝑟𝑟𝑖𝑖𝑖𝑖� 𝛼𝛼′�𝑟𝑟𝑖𝑖𝑖𝑖� 𝑖𝑖 𝑖𝑖=1 𝑁𝑁 𝑖𝑖=1 17

The number of samples (𝑘𝑘) is set to be 1,000 during validation. 𝛼𝛼′_{may use a different value from 𝛼𝛼.}

Evaluated is performed under different levels of 𝛼𝛼′_(𝛼𝛼′_{= 1.0, 0.8, 0.6) for a more rigorous result.}

For comparison with SMILES based methods, 𝛼𝛼′_is

set to 1.0, which means that the decoding route used during training is set to be exactly the depth first decoding route according to the canonical ordering:

𝑁𝑁𝑁𝑁𝑁𝑁0= −_𝑁𝑁1∑ log 𝑝𝑝𝑁𝑁𝑖𝑖=1 𝜽𝜽(𝐺𝐺𝑖𝑖, 𝑟𝑟𝑖𝑖∗) 18

The NLL value for SMILES based method are evaluated using the canonical SMILES {𝒙𝒙𝑖𝑖}𝑖𝑖=1𝑁𝑁 of

the validation set. For VAE, importance sampling is performed to approximate NLL as follows:

𝑁𝑁𝑁𝑁𝑁𝑁 = −_{𝑁𝑁 � log}1 1_{𝑘𝑘 �}𝑝𝑝𝜽𝜽�𝐱𝐱𝑖𝑖, 𝐳𝐳𝑖𝑖𝑖𝑖� 𝑞𝑞𝝓𝝓�𝐳𝐳𝑖𝑖𝑖𝑖�𝐱𝐱𝑖𝑖� 𝑖𝑖 𝑖𝑖=1 𝑁𝑁 𝑖𝑖=1 19

Where 𝐳𝐳𝑖𝑖𝑖𝑖 is sampled from the approximate

posterior distribution 𝑞𝑞𝝓𝝓(𝐳𝐳𝑖𝑖𝑖𝑖|𝐱𝐱𝑖𝑖), and 𝑘𝑘 is also set

to be 1,000.

The sample quality is evaluated by calculating the percentage of valid outputs as well as the percentage of outputs that are both valid and novel. 10,000 structures are generated for each model, and

(10)

10 the structures are evaluated using RDKit. Those structures are subsequently converted to canonical SMILES and compared with training set to find replications.

Several metrics are used to access the performance of conditional graph generative models. For discrete conditional codes 𝐜𝐜 , let 𝑀𝑀𝐜𝐜 be the set

containing molecules sampled from distribution 𝑝𝑝𝜽𝜽(𝐺𝐺|𝐜𝐜). We set the number of generated samples

|𝑀𝑀𝐜𝐜| to 1,000. Let 𝑁𝑁𝐜𝐜𝐜𝐜′ be the set of molecules in

𝑀𝑀𝐜𝐜 that satisfy the condition 𝐜𝐜′ (𝐜𝐜′ may be different

from 𝐜𝐜). The ratio 𝐾𝐾𝐜𝐜𝐜𝐜′ is defined as:

𝐾𝐾𝐜𝐜𝐜𝐜′=�𝑁𝑁𝐜𝐜𝐜𝐜′�

|𝑀𝑀𝐜𝐜| 20

The matrix 𝐾𝐾𝐜𝐜𝐜𝐜′ can be used to evaluate the ability

of the model to control the output based on conditional code 𝐜𝐜. When 𝐜𝐜 = 𝐜𝐜′_{, this value gives}

the rate of correctly generated outputs, denoted by 𝑅𝑅𝐜𝐜. High quality conditional models should have a

high value of 𝑅𝑅𝐜𝐜 and low values of 𝐾𝐾𝐜𝐜𝐜𝐜′ for 𝐜𝐜 ≠ 𝐜𝐜′. Let 𝑅𝑅𝐜𝐜0 be the rate of molecules in the training data

that satisfy condition 𝐜𝐜 . The enrichment over random EOR𝐜𝐜 is defined as:

EOR𝐜𝐜=_𝑅𝑅𝑅𝑅𝐜𝐜

𝐜𝐜

0 21

The definition is similar to that used in previous work12_{, except that in their implementation 𝑅𝑅}

𝐜𝐜 0_is

calculated using the generated samples from the unconditioned model 𝑝𝑝𝜽𝜽(𝐺𝐺).

For continuous codes, a subset ∁ of the conditional code space is used to describe the generation requirements. 𝑀𝑀∁ is sampled from 𝑝𝑝𝜽𝜽(𝐺𝐺|𝐜𝐜 ∈ ∁) ,

and values for 𝐾𝐾∁∁′, 𝑅𝑅∁ and EOR∁ can be calculated

in a similar manner as eq. 20 and eq. 21.

Results

Model Performance and Sample Quality

Several randomly generated samples from the graph-based model are shown in Figure 4a. The comparison between SMILES based model and graph-based model is performed with result summarized in Table 1. In terms of NLL, the SMILES LM1 gives the best NLL result, while graph model only ranks the second. This can be explained by the differences in model capacity, as the wider version of SMILES LM has 10 times more parameters compared with the graph-based architecture. More parameters generally lead to better performance in large datasets. Another Figure 4: a) Output samples by the graph based generator, grouped by molecular weight. b) and c) Common mistakes

(11)

11 reason may be the lack of recurrent unit in the graph based architecture. This will decrease the model depth of the graph generator, which hurts the expressiveness of the model, as a price paid for decreasing computational cost. Nonetheless, graph generator still yields higher performance compared with SMILES VAE and SMILES LM2. It should also be noted that the NLL values used in this comparison are only relatively loose bonds as it is evaluated using only deterministic decoding route, and a much tighter bond can be obtained for graph generator using importance sampling 𝑁𝑁𝑁𝑁𝑁𝑁0.8=

23.6.

Table 1: Comparison between SMILES based and

graph-based generators in terms of NLL and sample quality

Model Estimated _NLL %valid %valid and _novel VAE14 _30.2 _82.1 _81.2

LM210 _25.5 _93.7 _91.6

LM1 28.4 89.0 87.4 Graph Model

(𝛼𝛼 = 0.8) 27.7 96.8 94.7

In terms of the rate of valid outputs and the rate of valid and novel outputs, graph generative model outperform all SMILES based methods. The high validity in output structures is not surprising as the generation of SMILES poses much stricter rules to the output compared with the generation of molecular graphs. Figure 4b and Figure 4c summarize respectively the common mistakes made by SMILES-based and graph-based model during generation. Results in Figure 4b show that the most common cause of invalid output for SMILES based models is grammar mistakes, such as unclosed parentheses or unpaired ring numberings. But for the graph-based model, the majority of invalid output is caused by broken aromaticity, as demonstrated in Figure 4c. This is likely a result of stepwise decoding pattern of graph-based models, as the decoder can only see part of the aromatic structure during generation, while the determination of aromaticity requires the information of the entire ring. It is also observed that mistakes related to atom valance are relatively minor, meaning that those rules are easy to learning using graph convolution. Graph-based methods also

have the advantage of giving the highly interpretable outputs compared with SMILES. This means that a large portion of invalid outputs can be easily corrected if necessary. For example, broken aromaticity can be restored by literately refining the number explicit hydrogens of aromatic atoms, and unclosed aromatic rings can be corrected simply by connecting the two ends using a new aromatic bond. Though possible, those corrections may introduce additional bias to the output samples depending on the implementation, thus not adopted in the subsequent evaluations.

Table 2: The effect of parameter 𝛼𝛼 to model

performance

Estimated NLL

%valid %valid and novel 𝛼𝛼′ = 1.0 𝛼𝛼 ′ = 0.8 𝛼𝛼 ′ = 0.6 𝛼𝛼 = 1.0 28.7 25.2 25.3 95.6 94.1 𝛼𝛼 = 0.8 27.7 23.6 24.3 96.8 94.7 𝛼𝛼 = 0.6 31.4 24.8 24.9 97.2 95.3

The effect of the parameter 𝛼𝛼 to model performance is also discussed and results are summarized in Table 2. The NLL values are evaluated under different level of 𝛼𝛼′_{. It is found that model trained}

with 𝛼𝛼 = 0.8 constantly gives the best NLL result comparing to those trained under other conditions. More interestingly, model using 𝛼𝛼 = 0.8 provide higher NLL value for deterministic decoding compared with that trained using 𝛼𝛼 = 1.0. Further increasing 𝛼𝛼 provides worse performance. This might be a result of increased sample variance during training. In terms of the rate of valid output, it is found that increasing 𝛼𝛼 helps to improve the validity of output sample. Nonetheless, 𝛼𝛼 = 0.8 already gives a high rate of valid samples, and is chosen as the final model according to the result of NLL.

Scaffold-Based Generation

In the first task, conditional graph generative model is trained to produce molecules according to a given scaffold. To illustrate the result, scaffold 1,

(12)

12 extracted from the antihypertensive drug Candesartan, is used as an example, along with several related scaffolds (scaffold 2-4) derived from scaffold 1 (Figure 5). Conditional codes 𝐜𝐜 is constructed for each type of scaffold, and output structure is produced according to the corresponding code. Result for 𝑅𝑅𝐜𝐜, 𝑅𝑅𝐜𝐜0 and EOR𝐜𝐜 is

demonstrated in Table 3. It is noted that although the model is unable to achieve 100% correctness in terms of the scaffold of generated compounds, the 𝑅𝑅𝐜𝐜 rate is above 50% for all scaffold selected, which

is a significant enrichment compared with 𝑅𝑅𝐜𝐜0. The

enrichment over random EOR𝐜𝐜 is over 1,000 for

scaffold 1-3, and is 122.5 for scaffold 4, showing promising ability for the model to produce enriched output according to the given scaffold query. Table 4 reports the value of 𝐾𝐾𝐜𝐜𝐜𝐜′, which shows that even

though the four selected scaffolds share high structural similarity, there are only few cross overs in generated result, demonstrating that the model’s

is capable of providing fine grained control over the output scaffold.

Several generated samples are given for each scaffold in Figure 5. Recall that the outputs given scaffold 𝑠𝑠 should contain two type of molecules: (1) molecules with 𝑠𝑠 as its Bemis-Murcko scaffold and (2) molecule whose Bemis-Murcko scaffold contains 𝑠𝑠 but does not reside inside 𝑆𝑆. Both types are observed for scaffold 1-4 as shown in Figure 5. By further investigating the generated samples, it is observed that the model seems to have learnt about the side chains characteristics each scaffold. For example, samples generated from scaffold 1-3 usually have their substitutions occur at restricted positions, and frequently contains a long aliphatic side chain. Interestingly, this actually reflects the structural activity relationship (SAR) for angiotensin II (Ang II) receptor antagonists45_{. In} fact, scaffold 1-3 have long been treated as a privileged structure against Ang II receptors46_{, and} as a result, molecules with scaffold 1-3 are largely Figure 5: Results of scaffold based molecule generation using scaffold 1-4 as conditions

(13)

13 biased to those who matches the SAR rules for the target. When trained with the biased dataset, the model can memorize the underlying structural activity relationship as a byproduct of scaffold based learning. This characteristic is beneficial for the generation of libraries containing specified privileged structures.

Table 3: Resulted 𝑅𝑅𝐜𝐜, 𝑅𝑅𝒄𝒄0 and EOR𝐜𝐜 for scaffold based

generation tasks Conditions (𝐜𝐜) 𝑅𝑅𝐜𝐜 𝑅𝑅𝐜𝐜0 EOR𝐜𝐜 Scaffold 1 0.595 <0.0001 >1,000 Scaffold 2 0.566 <0.0001 >1,000 Scaffold 3 0.507 <0.0001 >1,000 Scaffold 4 0.712 0.005 122.5

Table 4: Result of matrix 𝐾𝐾𝐜𝐜𝐜𝐜′ for the scaffold generation tasks. The diagonal element is equal to 𝑅𝑅𝐜𝐜,

and the off diagonal elements measures the rate of crossovers in the generated outputs.

Conditions (𝐜𝐜)

Output (𝐜𝐜′₎

Scaffold

1 Scaffold 2 Scaffold 3 Scaffold 4

Scaffold 1 0.595 0.001 0.027 0.002

Scaffold 2 0.004 0.566 0.036 0.001

Scaffold 3 0.0 0.0 0.507 0.072

Scaffold 4 0.0 0.0 0.0 0.712

Generation Based on Drug-likeness and Synthetic Accessibility

In this task, the generative model is used to produce molecules according to the requirement on drug-likeness and synthetic accessibility. The conditional code is specified as 𝐜𝐜 = (QED, SA) . In the first experiment, the molecule generation is conditioned on a single point 𝐜𝐜 in the condition space. Here, we use four different conditions as specified as follows:

𝐜𝐜1= (0.84, 1.9) 22

𝐜𝐜2= (0.27, 2.5) 23

𝐜𝐜3= (0.84, 3.8) 24

𝐜𝐜4= (0.27, 4.8) 25

The values are determined from the distribution of QED and SA using the 90% and 10% quantile. The four selected conditions are illustrated in Figure 6b. The distributions of QED and SA for the output molecules are shown in Figure 6c. Results show

that although the condition is a single point in the conditional code space, the distribution of QED and SA score for output samples are relatively dispersed. This may due to the fact that QED and SA score are relatively abstract descriptions of structural features of molecules compared with scaffold fingerprint. This means that small modification of molecule structure may lead to large changes in QED and SA, and thus lead to the diffused distribution observed in Figure 6c. Nonetheless, it can be observed that the generated samples are enriched around the corresponding code 𝐜𝐜. It is also observed that the distribution of SA is more concentrated than that of QED. This is probably because that SA is direct measurement of molecular graph complexity, which may be easier to model for the graph based generator. In contrast, QED is a more abstract descriptor related to various molecular properties. In the following experiments, the model is required to generate molecules based on the following conditions. The four conditions are expressed as subsets of conditional code space:

∁1= (0.84, 1) × (0, 1.9) 26

∁2= (0, 0.27) × (0, 2.5) 27

∁3= (0.84, 1) × (3.4, +∞) 28

∁4= (0, 0.27) × (4.8, +∞) 29

As illustrated in Figure 6d. The four sets represent four classes of molecules respectively and the first class ∁1, which contains molecules with high

drug-likeness and high synthetic accessibility, defines the set of molecules that are most important for drug design. The corresponding distribution of QED and SA for each class are shown in Figure 6e. The results for property distribution are similar to that in Figure 6c with a concentrated distribution of SA and a relatively diffused distribution of QED. For a clearer demonstration, the distribution of QED for the generated samples from the four classes are plotted in Figure 6d. The result shows that although dispersed, QED distributions of molecules decoded under different QED condition achieves significant differences. Furthermore, random samples are chosen for each class and are visualization in Figure 6g-j. The structural features for the output samples are largely consistent with the predefined

(14)

14 conditions, with small and simple molecules for ∁1

and highly complexed molecules for ∁4.

Quantitative evaluation using 𝑅𝑅∁, 𝑅𝑅∁0 and EOR∁ are

demonstrated in Table 5, The results indicate lower 𝑅𝑅∁ values compared with scaffold based task, but

nonetheless showing enrichments for all conditions over the distribution from ChEMBL. Table 6

reports the values of 𝐾𝐾∁∁′. Few crossovers exist between the four condition sets, largely because the four sets are highly separated in the conditional code space. Overall, the model produces enriched output based on given conditions, and is suitable for the generation of compound libraries with specific Figure 6: a) QED and SA score distribution of molecules from ChEMBL dataset. b) Location of the conditional codes

𝐜𝐜1 , 𝐜𝐜2, 𝐜𝐜3 and 𝐜𝐜4 used in the first experiment. c) QED and SA score distributions of output molecules decoded using

code 𝐜𝐜1 , 𝐜𝐜2, 𝐜𝐜3 and 𝐜𝐜4. The distribution is relatively dispersed, but still enrichment around the given conditional code

compared with the ChEMBL dataset. d) Location of sets ∁1, ∁2, ∁3 and ∁4 used in the second experiment. e) QED and

SA score distribution and f) QED distribution of output molecule conditioned on ∁1, ∁2, ∁3 and ∁4. Similar to the result

of the first experiment, the distribution is dispersed but still provide significant concentration. g), h), i) and j) Randomly

(15)

15 requirement on molecule properties, such as high drug-likeness and high synthetic accessibility. Table 5: Resulted 𝑅𝑅∁, 𝑅𝑅∁0 and EOR∁ for QED and SA

based generation task

Conditions 𝑅𝑅∁ 𝑅𝑅∁0 EOR∁

∁1 0.187 0.009 19.1

∁2 0.191 0.012 16.6

∁3 0.167 0.011 15.6

∁4 0.603 0.008 71.6

Table 6: Result of matrix 𝐾𝐾∁∁′ for QED and SA based

generation task Conditions (∁) _∁ Output (∁′) 1 ∁2 ∁3 ∁4 ∁1 0.187 0.001 0.0 0.0 ∁2 0.001 0.191 0.0 0.0 ∁3 0.0 0.0 0.167 0.001 ∁4 0.0 0.0 0.002 0.603

Generating Dual Inhibitors for JNK3 and GSK3𝜷𝜷

In this task, the model is used to generate dual inhibitor for JNK3 and GSK3𝛽𝛽. A predictive model is first use to label the conditional code for ChEMBL dataset, and the conditional graph generator is trained on the labeled training set. The two predictors yields good results in general, with AUC=0.983 for JNK3 and AUC=0.984 for GSK3𝛽𝛽. Several examples of output molecules are given in Figure 7e. To better demonstrate the structural distribution of the generated samples, visualization based on t-SNE 47_{is performed using the ECFP6} fingerprint. The generated sample under different selectivity specification and molecules in the validation set for each target are projected into two-dimensional embeddings and are shown in Figure 7a-d. The result illustrates good matching between the structural distribution generated molecules and the test set. Further investigation into Figure 7a-d shows that the conditional generator tends to produce molecules near the test set samples, which is consistent with observations based on other methods12_.

The result for 𝑅𝑅𝐜𝐜, 𝑅𝑅𝐜𝐜0 and EOR𝐜𝐜 is shown in Table 7,

and the value for 𝐾𝐾𝐜𝐜𝐜𝐜′ matrix is reported in Table 8.

It is noted that when generating compounds that is active to both JNK3 and GSK3 𝛽𝛽 , there is a significant amount of outputs fall into the category of GSK3 𝛽𝛽 positive and JNK3 negative. Nonetheless, in terms of the enrichment over random EOR𝐜𝐜, the model is able to achieve high

performance for all selectivity combinations. The selective inhibitor for GSK3𝛽𝛽 is relatively enriched in ChEMBL database, according to the result of the predictor. In comparison, the selective inhibitors against JNK3 and the dual inhibitor for both JNK3 and GSK3𝛽𝛽 are much rarer. However, the model is still able to achieve significant enrichment for the two types of selectivity. The result shows potential application for target combinations that have low data enrichment rate.

Table 7: Resulted 𝑅𝑅𝐜𝐜, 𝑅𝑅𝒄𝒄0 and EOR𝐜𝐜 for various

selectivity combinations of JNK3 and GSK3𝛽𝛽

Conditions 𝑅𝑅𝐜𝐜 𝑅𝑅𝐜𝐜0 EOR𝐜𝐜 JNK3 (+) and GSK3𝛽𝛽(-) 0.276 0.0008 341.5 JNK3 (-) and GSK3𝛽𝛽(+) 0.548 0.01 52.9 JNK3 (+) and GSK3𝛽𝛽(+) 0.255 0.0008 300.3

Table 8: Result of matrix 𝐾𝐾𝐜𝐜𝐜𝐜′ for target based generation task Conditions (𝐜𝐜) Output (𝐜𝐜′) JNK3 (+) and GSK3𝛽𝛽(-) JNK3 (-) and GSK3𝛽𝛽(+) JNK3 (+) and GSK3𝛽𝛽(+) JNK3 (+) and GSK3𝛽𝛽(-) 0.276 0.057 0.072 JNK3 (-) and GSK3𝛽𝛽(+) 0.001 0.548 0.027 JNK3 (+) and GSK3𝛽𝛽(+) 0.072 0.247 0.255

(16)

16

Conclusion

In this work, a new framework for de novo molecular design is proposed based on graph generative model and is applied to solve different

drug design problems. The graph generator is designed to be more fitted to the tasks of molecule generation by using a simple decoding scheme and a graph convolutional architecture that is less computationally expensive. The method is trained using molecules in ChEMBL dataset and has been demonstrated to provide a higher fraction of valid Figure 7: a), b), c) and d) t-SNE visualization of generated molecules under different selectivity condition and known

active molecules for GSK3b and JNK3 in the validation set. e) Generated samples specified under different selectivity

(17)

17 samples compared with previous SMILES based methods. Furthermore, a more flexible way of introducing decoding invariance is also suggested. This method helps to improve the density estimation result as well as the output validity of the graph generator.

To generate molecules with specific requirements, we propose to use conditional generative model, which provides higher flexibility and is much easier to train compared with previous fine-tuning based methods. The model is applied to solve problems that is highly related drug design, such as generating molecules based on a given scaffold, generating molecules with good drug-likeness and synthetic accessibility and the generation of molecules with specific profile against multiple targets. The high enrichment rates presented in the results show that the conditional generative model provides a promising solution for many real-life drug design tasks.

This work can be extended in various different aspects. First of all, the models used in this work completely ignores the stereochemistry information for molecules. In fact, stereochemistry is extremely important in the process of drug development, and introducing this information helps to improve the applicability of existing models. Secondly, for the target based generation, it will be much more helpful to jointly train the generator and the decoder, utilizing strategies such as semi-supervised learning48-49_{. Finally, besides the three tasks} experimented in this work, conditional graph generator can be used in many other tasks, such as generation based on docking scores, as well as tasks with different combinations of conditional codes, such as generating target specific compounds that are also highly drug-like and synthetically accessible. To summarize, the graph generative architecture proposed in this work gives promising result in various drug design tasks, and it is worthwhile to explore other potential applications using this method.

REFERENCES

1. Schneider, G.; Fechner, U., Computer-based de novo design of drug-like molecules. Nat.

Rev. Drug Discov. 2005, 4 (8), 649-663.

2. Mauser, H.; Stahl, M., Chemical fragment spaces for de novo design. J. Chem. Inf. Model.

2007, 47 (2), 318-324.

3. Böhm, H.-J., The computer program LUDI: a new method for the de novo design of

enzyme inhibitors. J. Comput. Aided Mol. Des. 1992, 6 (1), 61-78.

4. Reutlinger, M.; Rodrigues, T.; Schneider, P.; Schneider, G., Multi‐Objective Molecular De

Novo Design by Adaptive Fragment Prioritization. Angew. Chem. Int. Ed. 2014, 53 (16),

4244-4248.

5. Dey, F.; Caflisch, A., Fragment-based de novo ligand design by multiobjective

evolutionary optimization. J. Chem. Inf. Model. 2008, 48 (3), 679-690.

6. Yuan, Y.; Pei, J.; Lai, L., LigBuilder 2: a practical de novo drug design approach. J. Chem.

Inf. Model. 2011, 51 (5), 1083-1091.

7. Hartenfeller, M.; Proschak, E.; Schüller, A.; Schneider, G., Concept of Combinatorial De

Novo Design of Drug‐like Molecules by Particle Swarm Optimization. Chem. Biol. Drug Des. 2008,

72 (1), 16-26.

8. Goodfellow, I.; Bengio, Y.; Courville, A., Deep learning. MIT press: 2016.

9. Lipton, Z. C.; Berkowitz, J.; Elkan, C., A critical review of recurrent neural networks for

sequence learning. arXiv preprint arXiv:1506.00019 2015.

10. Olivecrona, M.; Blaschke, T.; Engkvist, O.; Chen, H., Molecular de-novo design through

deep reinforcement learning. J. Cheminform. 2017, 9 (1), 48.

11. Popova, M.; Isayev, O.; Tropsha, A., Deep Reinforcement Learning for De-Novo Drug

(18)

18

12. Segler, M. H.; Kogej, T.; Tyrchan, C.; Waller, M. P., Generating focussed molecule

libraries for drug discovery with recurrent neural networks. arXiv preprint arXiv:1701.01329 2017.

13. Kingma, D. P.; Welling, M., Auto-encoding variational bayes. arXiv preprint

arXiv:1312.6114 2013.

14. Gómez-Bombarelli, R.; Duvenaud, D.; Hernández-Lobato, J. M.; Aguilera-Iparraguirre, J.;

Hirzel, T. D.; Adams, R. P.; Aspuru-Guzik, A., Automatic chemical design using a data-driven

continuous representation of molecules. arXiv preprint arXiv:1610.02415 2016.

15. Blaschke, T.; Olivecrona, M.; Engkvist, O.; Bajorath, J.; Chen, H., Application of

generative autoencoder in de novo molecular design. Mol. Inform. 2017.

16. Johnson, D. D., Learning Graphical State Transitions. In International Conference on

Learning Representations, 2017.

17. Anonymous, GraphVAE: Towards Generation of Small Graphs Using Variational

Autoencoders. 2017.

18. Anonymous, Learning Deep Generative Models of Graphs. 2017.

19. Wu, Z.; Ramsundar, B.; Feinberg, E. N.; Gomes, J.; Geniesse, C.; Pappu, A. S.; Leswing,

K.; Pande, V., MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 2018.

20. Simonovsky, M.; Komodakis, N., Dynamic Edge-Conditioned Filters in Convolutional

Neural Networks on Graphs. arXiv preprint arXiv:1704.02901 2017.

21. Bickerton, G. R.; Paolini, G. V.; Besnard, J.; Muresan, S.; Hopkins, A. L., Quantifying the

chemical beauty of drugs. Nat. Chem. 2012, 4 (2), 90-98.

22. Barreiro, E. J., Privileged Scaffolds in Medicinal Chemistry: An Introduction. 2015.

23. Bemis, G. W.; Murcko, M. A., The properties of known drugs. 1. Molecular frameworks.

J. Med. Chem. 1996, 39 (15), 2887-2893.

24. Reis, J.; Gaspar, A.; Milhazes, N.; Borges, F. M., Chromone as a privileged scaffold in

drug discovery–recent advances. J. Med. Chem. 2017.

25. Schuffenhauer, A.; Ertl, P.; Roggo, S.; Wetzel, S.; Koch, M. A.; Waldmann, H., The

scaffold tree− visualization of the scaffold universe by hierarchical scaffold classification. J. Chem.

Inf. Model. 2007, 47 (1), 47-58.

26. Varin, T.; Schuffenhauer, A.; Ertl, P.; Renner, S., Mining for bioactive scaffolds with

scaffold networks: improved compound set enrichment from primary screening data. J. Chem. Inf.

Model. 2011, 51 (7), 1528-1538.

27. Wishart, D. S.; Knox, C.; Guo, A. C.; Shrivastava, S.; Hassanali, M.; Stothard, P.; Chang,

Z.; Woolsey, J., DrugBank: a comprehensive resource for in silico drug discovery and exploration.

Nucleic Acids Res. 2006, 34 (suppl_1), D668-D672.

28. RDKit: open source cheminformatics. http://www.rdkit.org/.

29. Kadam, R.; Roy, N., Recent trends in drug-likeness prediction: A comprehensive review

of In silico methods. Indian J. Pharm. Sci. 2007, 69 (5), 609.

30. Tian, S.; Wang, J.; Li, Y.; Li, D.; Xu, L.; Hou, T., The application of in silico drug-likeness

predictions in pharmaceutical research. Adv. Drug Deliv. Rev. 2015, 86, 2-10.

31. Ertl, P.; Schuffenhauer, A., Estimation of synthetic accessibility score of drug-like

molecules based on molecular complexity and fragment contributions. J. Cheminform. 2009, 1 (1),

8.

32. Koch, P.; Gehringer, M.; Laufer, S. A., Inhibitors of c-Jun N-terminal kinases: an update.

J. Med. Chem. 2014, 58 (1), 72-95.

33. McCubrey, J. A.; Davis, N. M.; Abrams, S. L.; Montalto, G.; Cervello, M.; Basecke, J.;

Libra, M.; Nicoletti, F.; Cocco, L.; Martelli, A. M., Diverse roles of GSK-3: tumor promoter-tumor suppressor, target in cancer therapy. Adv. Biol. Regul. 2014, 54, 176.

34. Sun, J.; Jeliazkova, N.; Chupakhin, V.; Golib-Dzib, J.-F.; Engkvist, O.; Carlsson, L.;

Wegner, J.; Ceulemans, H.; Georgiev, I.; Jeliazkov, V., ExCAPE-DB: an integrated large scale

(19)

19

35. Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.;

McGlinchey, S.; Michalovich, D.; Al-Lazikani, B., ChEMBL: a large-scale bioactivity database for

drug discovery. Nucleic Acids Res. 2011, 40 (D1), D1100-D1107.

36. Bolton, E. E.; Wang, Y.; Thiessen, P. A.; Bryant, S. H., PubChem: integrated platform of

small molecules and biological activities. Annual reports in computational chemistry 2008, 4,

217-241.

37. Merget, B.; Turk, S.; Eid, S.; Rippmann, F.; Fulle, S., Profiling prediction of kinase

inhibitors: toward the virtual assay. J. Med. Chem. 2016, 60 (1), 474-485.

38. Rogers, D.; Hahn, M., Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50

(5), 742-754.

39. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel,

M.; Prettenhofer, P.; Weiss, R.; Dubourg, V., Scikit-learn: Machine learning in Python. J. Mach.

Learn. Res. 2011, 12 (Oct), 2825-2830.

40. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis,

A.; Dean, J.; Devin, M., Tensorflow: Large-scale machine learning on heterogeneous distributed

systems. arXiv preprint arXiv:1603.04467 2016.

41. Kingma, D.; Ba, J., Adam: A method for stochastic optimization. arXiv preprint

arXiv:1412.6980 2014.

42. Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A. M.; Jozefowicz, R.; Bengio, S., Generating

sentences from a continuous space. arXiv preprint arXiv:1511.06349 2015.

43. Chen, X.; Kingma, D. P.; Salimans, T.; Duan, Y.; Dhariwal, P.; Schulman, J.; Sutskever,

I.; Abbeel, P., Variational lossy autoencoder. arXiv preprint arXiv:1611.02731 2016.

44. Kingma, D. P.; Salimans, T.; Jozefowicz, R.; Chen, X.; Sutskever, I.; Welling, M. In

Improved variational inference with inverse autoregressive flow, Advances in Neural Information

Processing Systems, 2016; pp 4743-4751.

45. Almansa, C.; Gómez, L. A.; Cavalcanti, F. L.; de Arriba, A. F.; García-Rafanell, J.; Forn,

J., Synthesis and structure− activity relationship of a new series of potent AT1 selective angiotensin

II receptor antagonists: 5-(Biphenyl-4-ylmethyl) pyrazoles. J. Med. Chem. 1997, 40 (4), 547-558.

46. Bräse, S., Privileged scaffolds in medicinal chemistry: design, synthesis, evaluation. Royal

Society of Chemistry: 2015.

47. Maaten, L. v. d.; Hinton, G., Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9

(Nov), 2579-2605.

48. Kingma, D. P.; Mohamed, S.; Rezende, D. J.; Welling, M. In Semi-supervised learning

with deep generative models, Advances in Neural Information Processing Systems, 2014; pp

3581-3589.

49. Siddharth, N.; Paige, B.; de Meent, V.; Desmaison, A.; Wood, F.; Goodman, N. D.; Kohli,

P.; Torr, P. H., Learning Disentangled Representations with Semi-Supervised Deep Generative