List of Tables
2.4 Approaches to establish structure-property relationships using multivariate methods
structures with crystal imperfections. This SA process for optimization can be formulated as a problem of finding a solution with minimal cost among the very large number of possible states. The physical annealing process can be modelled by computer simulation methods based on Monte Carlo techniques. The slower the cooling schedule, or rate of decrease, the more likely the algorithm is to find an optimal or near-optimal solution. Annealing with a slow cooling schedule is very slow and expensive computationally. The method cannot determine whether it has found an optimal solution (Dudek et al., 2006; Fodor, 2002; Laarhoven &
Aarts, 1987). Some examples of studies applying SA for feature selection in QSPR/QSAR problems include Ghosh & Bagchi (2009); Sharma et al. (2012).
2.4 Approaches to establish structure-property relationships using multivariate methods
“There are known unknowns; that is to say, there are things that we now know we don’t know.
But there are also unknown unknowns – there are things we do not know we don’t know.”
∼ Donald Rumsfeld, United States Secretary of Defense (2002)
Once the molecular descriptors are calculated and reduced to a subset of optimal descriptors the problem lies in building a model that better correlates the structure of the molecule with the desired property (Dudek et al., 2006). A wide range of mapping function methods can be employed, including linear and non-linear ones. The linear models predict the property as a linear function of molecular descriptors and in general, they are easily interpretable and accurate for small datasets of similar compounds and molecular descriptors selected for the given property. The non-linear models predict the property as a non-linear function of molecular descriptors and in general, the models became more ac-curate, especially for large and diverse datasets but they are more complex and harder to interpret. Complex non-linear models may also fall prey to over-fitting (low generalization to unknown compounds during testing). In the framework of supervised learning another important division of the methods is based on the nature of the desired property: (1) classification tasks which approximate a discrete-valued function to map a pattern into a M-dimensional decision space,
where M is the number of categories or classes, and (2) regression tasks which approximate a real-valued target function to map a pattern into a continuous space. Furthermore, it is possible to follow two main strategies to predict prop-erties of new compounds: (1) eager or model-based learning in which a model is build using a training set and then this model can be applied to all unseen cases to make predictions, and (2) lazy or instance-based learning in which each test instance is considered individually and information is extracted from the training set specifically for the prediction of that instance. The main advantage of lazy learning is that it is possible to make the most of the information about a test instance. It is impossible to give a detailed overview of all existing methods but a general overview of the most widely used methods in QSPR/QSAR is given below.
2.4.1 Instance-based Learning Approaches
Instance-based methods construct local approximations to the modelled function that applies in the neighbourhood of the new query instance. Thus it describes a complex target function as a collection of less complex local approximations based on the distance between instances. These algorithms have several advan-tages: they are simple but robust learning algorithms, can tolerate noise and irrelevant attributes, and can represent both probabilistic and overlapping con-cepts and naturally exploit inter-attributes relationships. Because the algorithm delays all processing until a new classification/prediction is required, significant processing is needed to make the prediction. Furthermore, the instances should be represented in such a way that allows the calculation of distance between them.
2.4.1.1 k-Nearest Neighbours
k-Nearest Neighbours (k-NN) is a simple method for classification or prediction that with increase in training data converges to the optimal prediction error (It-skowitz & Tropsha, 2005). The training phase of the algorithm consists only of storing the feature vectors and property values or classes of the training samples.
For a given test compound, the method analyses its k-nearest neighbouring com-pounds from the training set and predicts the property based on the similarity
2.4 Approaches to establish structure-property relationships using multivariate methods
principle by majority voting, according to equation 2.1 where Nk(x) is the neigh-bourhood defined by the k closest observations in the training set (Eklund et al., 2014). The method is very sensitive to the metric used to map the compounds in the feature space and the training compounds available.
f(x) = 1
Model-based approaches (Eklund et al.,2014), on the other hand, represent what has been learned in a quantitative computational model that describes a mapping or transformation between a set of features and responses and that is richer than the language used to describe this data. Learning methods of this kind construct explicit generalizations of the training cases, rather than allowing generalization to flow implicitly from a similarity or distance measure.
2.4.2.1 Multiple Linear Regression
Multiple Linear Regression (MLR) models the property as a linear function of all the molecular descriptors weighted by coefficients adjusted and optimized from the training set (Dudek et al.,2006; Kovdienko et al., 2010). The coefficients are chosen to minimize the sum of square errors between the observed and predicted values of the property (Eklund et al., 2014). This method is not appropriate to apply when handling a large number of descriptors per compound.
2.4.2.2 Partial Least Squares
Partial Least Squares (PLS) is a linear regression method that overcomes the MLR’s problem of dealing with a large number of descriptors per compound (Dudek et al., 2006; Wold et al., 2001). The method assumes that the model is influenced by a relatively small number of latent independent variables. These linear combinations of the original variables are obtained as already explained in section 2.3.1.2 and are then used as input of a regression model. What distin-guishes PLS from principal component regression is that, in PLS, the features
are weighted by the strength of their univariate effect on the output variable in the construction of each latent feature (Eklund et al.,2014).
2.4.2.3 Artificial Neural Networks
Artificial Neural Networks (ANN) is a non-linear method for classification or pre-diction based on the parallel architecture of a biological neural network (Abra-ham, 2005; Dudek et al., 2006; Eklund et al., 2014). An ANN consists of a weighted interconnection group of artificial neurons that modulate the effect of the associated molecular descriptors represented by a transfer function. The learning capability of the ANN is achieved by adjusting the weights in accor-dance to the chosen learning algorithm. In supervised learning, an input vector of molecular descriptors is presented together with a set of desired property re-sponses, one for each neuron, at the output layer. A forward step is done, and the discrepancies between the desired and actual property for each neuron in the output layer are found and used to determine weight changes in the net according to the learning rule.
2.4.2.4 Support Vector Machines
Support Vector Machines (SVMs) is a non-linear supervised learning algorithm used for a variety of classification and regression problems. Burbidge et al.(2001) published the first studies that featured SVMs tested in QSAR problems and this methodology proved superior to other machine learning tools, either in results or computational efficiency.
Differently from other methodologies based on heuristic optimization meth-ods, SVMs are based on the solution of a convex quadratic programming problem, for which it is guaranteed to reach a minimum solution, which is deemed to be unique. The foundation of SVMs is the discovery of instances in the data (the support vectors) which construct a decision hyperplane or set of hyperplanes in a high-dimensional space that maximizes the margin according to a mathemat-ical transformation of the variable space through a kernel function applied to the support vectors. Kernel functions are usually linear, polynomial, radial or
2.4 Approaches to establish structure-property relationships using multivariate methods
sigmoid, and generally machine learning libraries provide implementations to all these kernel functions.
Some of its unique characteristics include the capability of handling a very large number of descriptor variables with minimal over-fitting (as it is often the problem with other methodologies like ANN). The main disadvantages of SVMs are the lack of transparency of results due to its non-parametric nature and the sensitivity of the algorithm to the choice of kernel parameters (Burges, 1998;
Dudek et al., 2006).
2.4.2.5 Random Forests
Random Forests (RFs) are an ensemble method for classification or regression (Breiman, 2001). Ensemble methods are based on the iterative application of a simple classification or regression algorithm over a randomly defined subset of the data and use a consensus voting procedure for determining the outcome of its application. RFs use as a basic classification or regression algorithm, simple decision trees fitted where the leaves represent the property/activity value and branches represent conjunctions of descriptors that represent the structure of the compounds. Each tree is constructed independently of previous trees using a different bootstrap sample of data with replacement and where each node is split using the best subset of predictors randomly chosen at that node.
The basic process of RFs building can be summarily described in the following sequence of steps. The process is repeated once for every iteration (i = 1..N), according to a value specified by the user (N). One iteration will produce a simple decision tree from a set of variables and instances. For each fitted tree a distinct set of variables and instances is used. From the training dataset, a bootstrapping procedure is ran selecting with reposition a set of instances (Υi), with size equal to the training set. Also small subset of independent variables are specified by the user and randomly selected from all the available variables (∆i). Then a decision tree model DTi = f(Υi,∆i) is fitted to Υi and ∆i. The set of all decision tree models (DTi, where i = 1, ..., N) is a random forest. Using it for prediction implies running all trees to a new dataset and produce a consensus result from the classification or prediction outcomes of the individual decision trees. RFs
allow natively for an out-of-the bag validation, that is, each tree is validated with the instances that were not selected for its training (about one-third of the set) and global consensus statistic can be produced.
The generalization of this method depends on the strength of the individual trees in the forest and the correlation between them. The algorithm RFs has several characteristics that make it suitable for QSPR/QSAR datasets (Breiman, 2001;Statnikov & Aliferis,2008): a) it can be used when there are more variables than observations; b) it has a good predictive performance even when noisy vari-ables are present; c) it is not very sensitive to the algorithm parameters, therefore there is a minimal necessity to tune the default parameters to achieve a good per-formance; d) due to its nature encompassing a large number of simple models, it largely reduces the problems caused by over fitting; e) it can handle a mixture of categorical and continuous descriptors; f) it returns measures of descriptor importance; g) there are high quality and free implementations of the method.
Furthermore, there is no need for cross-validation as it is estimated internally considering that each tree is constructed using a different bootstrap sample from the original data.