Chapter 3 Distributed Learning of Molecular Feature Representations
3.2 Related Work
One commonly used method for representing molecular structures is by encoding them using the smiles (simplified molecular-input line-entry system) format. A smiles string encodes the atoms and bonds, as well as the types of the bonds, of a molecule. A given smiles representation uniquely identifies a molecule, and for a given molecule their may exist a number of valid permutations of these unique identifiers. As this format is a string representation of a molecule, it is necessary to compute a vector representation of a molecules smiles representation for use as input to a machine learning system.
Drug molecules can also be understood as graph structured data. Each node in the molecular graph G represents an atom, and each edge euv ∈ G can be thought of
as a bond between two atoms u and v, with the edge euv labeled by the type of bond
between the atoms u, v. Leveraging this graph representation Gm of a molecule m,
one can extract a vector representation describing Gm in some context.
We briefly review some applications in which these representations have been used in machine learning, highlighting some of their limitations and thus motivating the need for data driven modeling to be used in practice.
Molecular Descriptors
Molecular descriptors are the output of some computational process “which trans- forms chemical information encoded within a symbolic representation of a molecule into a useful number. . . ” [37]. Several solutions exist for the calculation of molecular
descriptors that can be used as input to machine learning models for tasks to predict molecular properties and activities [6, 26, 36]. A popular option is the Dragon soft- ware suite which is able to calculate 5,270 unique molecular descriptors that include features such as simplest atom types, functional groups and fragment counts, topolog- ical and geometrical descriptors, three-dimensional descriptors, as well as estimates of various properties relevant to computational drug discovery such as solubility (logP) and drug-like indices. Determining which of these features may be appropriate for a task either requires domain expertise, trial and error, or the simplest solution in using all of the available features and using some feature selection process to filter out those that are less informative according to some threshold. While using a feature selec- tion process does help to remove the burden of domain expertise, a disadvantage to this approach is that unnecessary dimensionality may be introduced, restricting the possible approaches one may use to learn a task in a timely manner as a consequence of the curse of dimensionality.
Numerous studies have been performed in the context of drug-protein binding interactions that have made use of molecular descriptors as feature representations of molecules [7, 14, 18, 19, 22, 30]. In addition to traditional machine learning methods, Deep Neural Networks (DNNs) trained on molecular descriptors as feature represen- tations have been successfully applied to problems in drug discovery. In the work by [7], Multi-Task (DNNs) were successfully applied to predict the targets of multiple PubChem assays using 3764 molecular descriptors gathered from the Dragon soft- ware suite. The authors compared their methods to several benchmarking methods, including a single-task DNN, and show that in the majority of cases their method exceeds the baseline performance in terms of the pearson correlation coefficient (i.e. R2). In further work on applying DNNs to drug discovery, a competition launched
by Merck & Co. on the Kaggle data science platform was used to generate fur- ther interest in applying modern machine learning techniques to the prediction of
molecular properties relevant to drug discovery. The data provided in the Merck & Co. molecular activity challenge consisted of molecular descriptors along with a set of activities as labels for each distinct molecule. The winning team subsequently published their results with the assistance of Merck & Co. in which they detail a crucial component behind their successful ensemble learning method, a DNN [22], showing that the DNN in most cases is able to outperform the RF across a number of hyperparameter settings on each of 15 selected datasets. While the results of early applications of DNNs were impressive in their own right given the time context, the use of molecular descriptors imposes a prior belief that all relevant or task-specific information is contained within these sets of descriptors, an obvious limitation that should be addressed in future methodologies.
Molecular Fingerprinting
Rather than explicitly computing features for a molecule such as a drug-like prop- erty or index, it is possible to instead compute a vector representation that identifies the molecule in a vector space with some intrinsic meaning. Algorithms for do- ing this calculation are known as molecular fingerprinting methods. Examples of these include morgan fingerprints [25] and extended connectivity circular fingerprints (ECFP) [33]. These algorithms compute a unique binary valued vector that identifies a given molecule based upon the atoms that it contains and their features. The state of the art fingerprinting algorithm, ECFP, computes a representation that can be used to understand properties such as similarity between molecules which can be helpful in predicting specific properties of a possibly unknown molecule as well as possible identification of active binding molecules to a target protein. However, in terms of extracting a representation for a machine learning task, a limitation of these methods is that they are computed independently of the task of interest, potentially limiting the performance of machine learning models trained using these representations.
Neural Fingerprinting
Recently, a deep learning “neural” fingerprinting method was proposed in the work by [9]. The authors alter the ECFP algorithm by introducing a differentiable function to compute molecular fingerprints, allowing the method to be optimized for specific tasks, an advantage over previous methods. Subsequently, a number of variations of this type of network using graph convolutions have been introduced by various groups. The work by [13] summarizes each of these techniques, and provides a unified definition of these methods named Message-Passing Neural Networks, or MPNNs, which can be roughly defined in two steps:
• Message-Passing Phase mt+1v = X w∈N (v) Mt(htv, h t w, evw) (3.1) ht+1v = Ut(htv, m t+1 v ) (3.2)
where Mt is a specified message passing function, Ut is a specified update func-
tion, ht
v is the hidden state and mtv is the message received by node v at time t.
This step is computed for all nodes v ∈ G and is repeated for each node a spec- ified number of iterations given as T . The result is the flow of local information from each node v ∈ G across the molecular graph.
• Readout Phase
ˆ
y = R({hTv | v ∈ G}) (3.3) where R is a specified “readout” function of the hidden node states at the end of message passing at time T and ˆy is the predicted target value of the network. MPNNs are powerful in that they are able to learn on graphs that vary in topol-
While the hidden states ht
v can be updated simultaneously for all nodes in a graph
Gm at time t, the message passing phases are sequential operations in that ht+1v is
not able to be computed until the previous value ht
v is known, so one must iterate T
times to complete the message passing. When learning on large amounts of training data, this may become a troublesome property and so we explore a distributed opti- mization algorithm, HOGWILD! [31], as a possible method to address this concern in a practical application.