1.7. In silico Methods in Drug Discovery
1.7.1. Quantitative structure-activity Relationships (QSAR)
1.7.1.2. QSAR Model Development and Validation
1.7.1.2.2. Validation of QSAR Models
The best fit models may not be the best ones for prediction. Only a stable and predictive model can be usefully interpreted for its mechanistic meaning, even though this is not always easy or feasible (Gramatica, 2011). The use of these statistical techniques in this context leads to ‘statistical learning’ from data that can be used for predictions. So far, much effort has been placed into performing some form of validation on QSAR models. Usually, this has been in terms of a model’s statistical fit and more recently the focus has turned to using an external test set (Cronin, 2010).
Various strategies can be used for validation of QSAR models. According to Wold and Eriksson (1995) the most important validation strategies are: 1. internal validation set or a standard cross-validation method, 2. external validation by splitting the dataset into training set for model development and to evaluate the predictive ability of the model, 3. blind external validation (by using the model on a new external set), 4. data randomisation or Y-scrambling for verifying the absence of chance correlation between the dependent variable and descriptors (Wold and Eriksson, 1995).
66
The general idea of V-fold cross-validation is to divide the overall sample into a number of subgroups (V-folds). Subgroups are removed from the training set one at a time to serve as the internal test set and the model is developed successively for the remaining compounds (V – 1 folds). For each modelling run, some index of predictive validity is computed for the subgroup that is left out and the results of the v replications are averaged to yield a single measure of the stability of the respective model. The V-fold cross-validation technique is used in various analytical procedures to avoid overfitting of the data (Burden, 1989). V-fold cross validation is especially useful when the data is not large enough to allow for external validation of the model. The leave-one-out (LOO) method can be considered as a special case of V-fold cross validation. The outcome of this procedure is cross-validated R2 (q2), which is may regarded as a criterion of both robustness and predictive ability of the model. The robustness of LOO procedure has been debated recently (Kubinyi et al., 1998; Golbraikh et al., 2003).
Y-randomization is a widely used approach in validation of QSARs which is often used along with the cross-validation (Golbraikh et al., 2003). It consists of repeating the model calculation procedure with randomized activities and subsequent probability assessment of the resultant statistics (Golbraikh et al., 2003).
A more robust way for validation is to use external validation by splitting the dataset into training set, for model development, and validation set, to evaluate the predictive ability of the model. This is done before building the models so the validation set is kept external and not involved at any stage of model development. There are different methods for splitting the data into training and validation sets. It has been suggested that splitting data should be performed in a way that all representative compounds of the validation set are close to the training set compounds in the multidimensional descriptor space, and the representative points of the training set must be distributed within the whole area occupied by the entire dataset (Golbraikh and Tropsha, 2002.). The rational division of a dataset into training and test sets can be done by randomly allocating a fixed proportion of a homogeneous dataset to the validation set. In order for the training and validation sets covering similar activity ranges, the data could be ranked according to the
67
magnitude of the biological response, and every third or fourth chemical could be removed for validation set (Sharifi and Ghafourian, 2014). Other selection methods include selection on the basis of relevant physicochemical descriptors for example through multivariate design; this results in a test series of compounds in which all major structural and chemical properties are systematically varied at the same time (Eriksson et al., 2003). An example of the other methods that can ensure similar distribution of training and validation set data is K-means-cluster based division of training and prediction sets (Leonard and Roy, 2008).
1.7.1.2.2.1. Applicability Domain
It is usually noted that QSAR is applicable only to compounds that are similar to the training set compounds (Katritzky et al., 2001). Structurally limited training sets, when the dataset is small or when the chemical diversity is low, are a limitation of QSAR models in terms of their application for future predictions (Dimitrov et al., 2005). A good model performance on the training set does not guarantee that a model will be predictive for validation set or external compounds (Stouch et al., 2003). In other words, QSAR models sometimes are not applicable to the new compounds. As a result of this, there needs to be conditions set for the applicability of QSAR models (Eriksson et al., 2003). This is very important in light of the increasing number of commonly termed global QSAR models which can be built on small datasets of low diversity (Weaver and Gleeson, 2008), or with poorly homogeneous training sets that contain partially overlapping clusters of compounds e.g. several classes of chemical compounds or chemotypes (Eriksson et
al., 2003). Defining a model’s applicability domain is essential in order to
determine the space of chemical structures that could be predicted reliably.
According to Weaver and Gleeson (2008) the domain of applicability is an important concept in quantitative structure-activity relationships (QSAR) that allows one to estimate the uncertainty in the prediction of a particular molecule based on how similar it is to the compounds used to build the model. In practice, there are various methods available for determining the range of applicability of QSAR models. For example, Dimitrov et al (2005) utilized a stepwise approach for
68
determining the applicability domain of QSAR models based on physicochemical properties in the training set of toxicity and skin sensitization datasets. This method involved four stages to account for the diversity and complexity of the QSAR models. First, the range of variation of the physicochemical properties of the training set compounds was specified. Then the structural similarities between chemicals that are correctly predicted by the model were assessed. At the third stage, the domain was defined based on a mechanistic understanding of the modelled phenomenon. Finally, the reliability of simulated metabolism was considered in assessing the reliability of predictions, if metabolic activation of chemicals is a part of the (Q)SAR model (Dimitrov et al., 2005).
Sahigara et al (2012) has reviewed the applicability domain methods (Sahigara et
al., 2012). Accordingly, they have classified all the methods into: 1. range-based
and geometric methods; 2. distance-based methods; 3. probability density distribution-based methods; 4. other approaches that may include decision trees and decision forests approach and stepwise approaches, such as the method suggested by Dimitrov et al (2005). Range based methods are the simplest approaches which may use a ‘bounding box’ defined on the basis of maximum and minimum values of each descriptor used to build the model or principle components of PCA (Netzeva et al., 2005). In distance based methods, first the distance between an individual molecule will be computed from a defined point within the descriptor space of the training data using common distance measures e.g. Euclidean distance. Then, a threshold is applied to separate the compounds that are outside the domain of applicability. The threshold is a user defined parameter (Xu and Gao, 2003). As a distance based method, k nearest neighbour method can be used to measure the similarity by calculating the distance between the compound and the nearest neighbour compound in the training set (Xu and Gao, 2003). Probability density distribution-based methods are some of the most advanced approaches for defining applicability domain, as they are able to identify the internal empty regions within the data.
69