3.4 Conclusions
4.2.3 Filtering approaches for Dataset 1
In this study, both global and local models (as described in section 2.3.5, Chapter 2) were created and compared for the prediction of solute coefficients. The five solute coefficients of the HSM were predicted as functions of molecular descriptors for their subsequent use in retention prediction. A popular version of a GA-PLS algorithm, originally written by Leardi [23] in Matlab (The Mathworks Inc., Natick, MA, USA) was modified and used for descriptor selection and modelling solute coefficients. To cope with the variability of the results arising from the intrinsic random selection nature of the genetic algorithm, the GA-PLS modelling
89
was repeated five times and the results were averaged. This part of work was performed for the 88 compounds in Dataset 1 for the GL Inertsil ODS-3 column with different resources of molecular descriptors (Dragon, VolSurf+ and the combined descriptors).Leave-one-out (LOO) filtering: In this filtering method one compound was taken out as the target compound, the rest of the compounds were used as a training set for QSRR modelling. In this way, each target compound has its own local model for the prediction of five solute coefficients, created using information from the compounds in the entire dataset. Different resources of molecular descriptors were investigated and compared.
The Global approach: rather than building a model for each compound, another option for QSRR modelling which is easy to interpret is building just one model for all the compounds using a randomly allocated training set while the rest of the compounds in the database were treated as an external test set. In this work, a D-optimal algorithm suggested by Todeschini et al. [24] was used to allocate compounds into a training set (70%) and a test set (30%), respectively. This distance-based selection approach ensures homogenous sampling from a database leading to a uniform distribution of compounds between the resulting subsets [24]. Compounds in the training set were used to build QSRR models through GA-PLS, and compounds in the test set were employed to evaluate the predictive ability of the constructed QSRR models. Again, different resources of molecular descriptors were investigated and compared.
Local Compound Type (LCT) filtering: Another approach for QSRR modelling involves the classification of compounds. In this way, instead of building a local model for each compound separately, or one model to describe the whole dataset, a model was built for a group of compounds which lay within the same classification. For LCT, compounds were clustered according to their type (bases, acids, and neutrals), therefore, compounds belonging to the same type were classified into the same cluster. Each cluster was then divided, as for the global model, into a training set and a test set. A QSRR model was then derived for each cluster for the prediction of solute coefficients and the resources of molecular descriptors compared.
Local Second Dominant Interaction (LSDI) filtering: in Wilson’s work [25], the HSM solute coefficients were experimentally generated based on the classification of compounds according to the interaction between compounds and stationary phase. “Ideal compounds” for which the retention was determined entirely by the hydrophobic interaction were used to yield the hydrophobicity coefficient [25]. Then, other solute coefficients were generated using a group of compounds that have been clustered according to their secondary dominant interaction after the hydrophobic interaction. Thus, compounds in Wilson's study were
90
allocated to different clusters, each corresponding to one of the terms in the HSM [25]. Following the subtraction of the effect of hydrophobicity, retention for compounds in each cluster was assumed to be predominantly influenced by the type of interaction linked to that cluster. Five clusters were identified, namely η' (hydrophobicity only), σ' (steric bulk), β' (hydrogen bonding basicity), α' (hydrogen bonding acidity), and κ' (charged compound) clusters, containing 25, 21, 4, 16 and 7 compounds, respectively [25]. Compounds for which retention appeared to be substantially influenced by more than one type of interaction (excluding hydrophobicity) were not assigned to any cluster by Wilson and co-workers, but have been allocated into a separate cluster (cluster 6) in the present study. After compound classification using this approach, and separation of each cluster into training and test sets, a QSRR model was built for each cluster and the solute coefficients were predicted and the resources of molecular descriptors compared.4.2.4 Filtering approaches for the combined dataset
The prediction of ported solute coefficients and retention times was also performed on a combined dataset which contained 148 compounds (88 compounds from Dataset 1 and 60 compounds from Dataset 2). Again, both global and local QSRR models were built through PLS equations using a Matlab platform. This part of the study was performed for the retention data of the combined 148 compounds on a common column with molecular descriptors generated using VolSurf+ only.
Local Tanimoto Similarity (LTS) and Local Log D (LLD) filtering: previous results have shown that with a group of compounds in a training set having a sufficient level of similarity to the target, acceptable performance of retention prediction can be obtained [26]. In the LTS method, filtering was performed based on the Tanimoto Similarity (TS) index where the dataset was sorted based on the compounds’ pairwise TS indices in relation to the target. Then, the top five compounds with pairwise TS indices of at least 0.5 were used as a training set to derive a separate QSRR model [26, 27]. If five compounds with a TS index of greater than 0.5 could not be found, the compound was not modelled.
In the LLD, compounds were sorted based on the ratio of their log D value to the log D of the target. Then, the top five compounds with log D ratio less than 1.1 were used as the training set [26]. Other conditions were the same as for the LTS approach. It is worth pointing out that, using either the LTS or the LLD approach, each compound has its own separate model for the prediction of solute coefficients.
The Global approach: this approach used all the compounds in the dataset to build one global model without any compound classification for the retention prediction of all
91
compounds [27]. Here, global PLS models (using 126 or 34 VolSurf+ descriptors, refer to the G126 and G34 below) were derived as the benchmark to gauge the improvements that the implementation of GA-PLS could bring to the final models. Before the modelling process, a D-optimal approach was used for splitting the combined dataset into a training set and an external test set. Then, a global QSRR model was built to predict solute coefficients and retention by fitting the predicted solute coefficients and their complementary column coefficients into the HSM.Local Compound Type (LCT) filtering: the LCT filter as mentioned above was also applied to the 148 combined compounds based on their chemical nature, three clusters containing 13 acids, 12 bases, and 123 neutrals were obtained, respectively. Approximately 70% compounds of each cluster were then selected using a D-optimal approach for modelling and the remaining compounds were used for the external validation of the corresponding model.
Local Second Dominant Interaction (LSDI) filtering: as previously described, six clusters were obtained for 88 compounds from Dataset 1 using this filtering approach. For modelling the combined 148 compounds, the β' cluster, representing compound hydrogen-bond basicity, was expanded to seven by adding three more compounds with the same property from Dataset 2. Similarly, 70% of compounds in each cluster were allocated into the training set using a D- optimal algorithm, where the rest of the compounds in each cluster were taken as test sets. The applicability of this approach for predicting retention time for new test compounds was also demonstrated by using Dataset 2 as an external test set. To allocate new test compounds into the corresponding cluster so that the correct model could be used, TS searching was introduced. The structural similarity of each compound in Dataset 2 was investigated against training compounds (Dataset 1) in each cluster with the aim of finding one training compound with a pairwise Tanimoto structural similarity of at least 0.5. If such a similar compound was found the target compound was assigned to the same LDSI cluster as the compound with the greatest pairwise similarity. In total, 28 compounds out of 57 in Dataset 2 were assigned to clusters as follows: ' cluster (19 compounds), σ' cluster (6 compounds) and cluster 6 (3 compounds). The other 29 compounds in Dataset 2 were excluded since their pairwise TS indices were less than 0.5 when calculated against every compound in Dataset 1.
4.2.5 Statistics
The coefficient of determination (R2), the slope of the regression with no forced
intercept and root-mean-squared error of prediction (RMSEP) were used to evaluate model fitness with the requirement for the slope to be within the range of 0.85 to 1.15 [28]. The percentage root-mean-square error of prediction (RMSEP%) of retention time for the test set
92
was measured to externally validate the accuracy of GA-PLS models generated from the training set. The equations of the RMSEP and the RMSEP% were detailed in section 2.3.6, Chapter 2 (Eq. 2.2 and Eq. 2.3).The predictive ability of the models was evaluated by inspecting the Regression Error Characteristic (REC) curves obtained by plotting the prediction error range against the percentage of data points predicted within that range [29, 30]. The null model, which can be regarded as the baseline model, was obtained by using the mean of the dependent variable (response) as a naïve predicted value for all compounds. REC curves were used to show the differences between regression models and have the advantage that the ranking of models is independent of the error measure used [30].
In addition, the overall performance of all generated models was further compared using the sum of ranking difference (SRD) approach [31-33] where parameters for each model were compared to a series of reference values, and each model ranked according to how large was the difference between its parameters and the reference values. The rankings were also compared to a confidence interval generated by using randomly ranked numbers [32, 34]. This ordering method provides a simple way to evaluate the models by comparing their SRD values (the closer the SRD value to zero, the better the approach) [31].