Quantitative Structure-Property/Activity Relationship Methods

List of Tables

2.1 Quantitative Structure-Property/Activity Relationship Methods

studies by Hammett (1935), Taft Jr (1952) and Hansch & Fujita (1964). Since then, the methodology has been increasingly used in drug discovery (e.g. Ekins et al. (2007); Kubinyi (1997a); Lill(2007); Lipinski et al. (2012); Martin (1998);

Perkins et al. (2003)) and chemical technology (e.g. Du Xihua & Keying (2009);

Katritzky & Fara (2005); Katritzky et al. (2000)). These studies received a big boost with the application of computers and increase of the computing power.

This has not only led to the proposition of newer and more complex molecular representations, but also in the application of prediction techniques that were either not feasible or were previously too time consuming. Today the methodol-ogy has become interdisciplinary, with an extensive number of available tools for generating and harvesting information about chemical structure and linking this structural information with experimental measurements of properties or activities using machine learning algorithms in order to extract new knowledge.

The three major difficulties in the development of QSPR/QSAR models are quantifying the inherently abstract molecular structure, determining which struc-tural features most influence the given property (representation problem) and then establishing the functional relationship that best describes the relationship between these structure descriptors and the property data (mapping problem) (Kubinyi,1997b). The first difficulty is often overcome by calculating molecular descriptors, which are developed to quantify various aspects of molecular struc-ture. In fact, this solution is the cause of the second difficulty since, as described in the following sections, hundred of molecular descriptors exist describing a wide range of constitutional, topological, geometrical, electronic and quantum me-chanic features. These descriptors may in turn be highly redundant, since many descriptors are related to each other or to the same underlying property. Further-more, some descriptors may be completely irrelevant from the desired property’s point of view, and others may have been calculated with methods producing noisy values. The problem lies in the identification of the appropriate set of descriptors that allow the desired property of the compound to be adequately predicted. To accomplish this and to find the optimum relationship between these structure descriptors and the property data, several statistical and machine learning meth-ods are used for dimensionality reduction or feature selection and regression or classification. Models can be grouped into two main categories depending on the

nature of the property to be predicted. Models predicting quantitative properties, such as the degree of binding to a target, are known as regression models. On the other hand, classification models predict qualitative properties. However, there are also difficulties in the modelling phase, namely the properties used to build models often originate from complicated and uncertain measurements, resulting in noisy y-values. The values may also have been collected from different public sources with varying reliability or obtained in different experimental conditions.

Another common problem is the unbalanced nature of the available data, that is, the majority of the compounds in a database are inactive, whereas only a few compounds are active or vice-versa. Even if all descriptors and output are measured and calculated as accurately as possible, it is still problematic to make good models. Particularly difficult are compounds in a chiral pair. The two iso-mers have identical attributes, but very often completely different activities — one isomer might be toxic and the other one might not. This phenomenon of molecules having essentially different properties though very similar structure is generally known as activity cliff (Stumpfe & Bajorath, 2012).

In order to build the model, the pool of molecules with known activity is usually split into a training set and a test set. The training set is used to learn the model. The learning problem consists in constructing a model that is able to predict properties of molecules in the training set, without over-learning it.

Choosing a model among the profusion of existing models is related to the final goal of the study, and while complex models can for instance have a great predic-tive ability, this often comes in detriment of their interpretability and overfitting.

The overfitting phenomenon can for instance be controlled using diﬀerent valida-tion techniques (described below) that quantify the ability of the model to predict a subset of the training set that was left out during the learning phase. The test set is used to evaluate the generalization of the learned model, corresponding to its ability to make correct prediction on a set of unseen molecules (Tropsha, 2010). This is a key step in QSPR/QSAR modelling, as pointed out by Truchon

& Bayly (2007), the major reason why these models fail is attributed to the vast number of equivalent models and deficient external validation. In other words, it is because model overfits the training data without detecting the true structure-activity relationship. Furthermore, QSPR/QSAR models retain a limited scope

2.1 Quantitative Structure-Property/Activity Relationship Methods

of application (Jaworska et al.,2005). The uncertainty and variance are expected for predictions made beyond the scope.

The process of a general QSPR/QSAR problem is summarized in Figure2.1.

The flowchart shows the fact that a QSPR/QSAR model is an alternate path to the prediction of molecular properties since its direct calculation is generally not feasible.

Figure 2.1: Outline of the steps involved in predicting molecular properties from molecular structure in a QSPR/QSAR problem.

Several statistical and data-mining techniques have been employed and soft-ware incorporating all the workflow for the determination of QSPR/QSAR as a black-box has been created, the vast majority of these being available on a commercial basis only (Baumann et al., 2008; Gasteiger, 2003). ADAPT (Au-tomated Data Analysis and Pattern Recognition Toolkit)¹ is a commercial soft-ware system for UNIX operating system distributed by Jurs for the develop-ment of QSPR/QSAR. It impledevelop-ments an inductive approach where the QSPRs or QSARs are developed from a set of known values for compounds in a train-ing set. ADAPT has a large selection of molecular structure descriptor genera-tion routines (Stuper & Jurs, 1976). The commercial computer program PASS (Prediction of Activity Spectra for Substances)² developed by the Academy of Medical Sciences, Moscow, predicts biological activity for a compound on the basis of its structural formula using Multilevel Neighbourhoods of Atoms (MNA) and Quantitative Neighbourhoods of Atoms (QNA) descriptors (Lagunin et al.,

1ADAPT:http://research.chem.psu.edu/pcjgroup/adapt.html

2PASS:http://www.pharmaexpert.ru/PASSOnline/

2000). The CODESSA¹ commercial software combines a large variety of classical non-empirical molecular descriptors together with more novel quantum chemical and combined descriptors, derived solely from the molecular structure, and in-vokes both standard and advanced statistical data treatment techniques for the development of QSPR/QSAR correlations in very large descriptor spaces. Open-MolGRID² is a free software implementing data mining techniques used for the development of predictive models for estimating various chemical properties and biological activities. OpenMolGRID system provides a flexible infrastructure for automating this kind of scientific workflows. OpenMolGRID system has Grid adapters for several existing software packages that are required for carrying out tasks in the QSPR/QSAR model development workflows (Darvas et al., 2004).

Chembench is a free web-based tool for QSPR/QSAR modelling and prediction.

The Chembench³ provides tools for data visualization and embeds a workflow for creating and validating predictive QSPR/QSAR models (Walker et al.,2010). In addition to specific software to QSPR/QSAR correlations, several general statis-tics or data-mining software can be used for the same purpose, such as SAS⁴, SPSS⁵, STATISTICA⁶, MatLab⁷, R⁸ and Weka⁹.

In document Machine learning methods for quantitative structure-property relationship modeling (Page 59-62)