3.2.1 Database
Three retention datasets used in the present study have been described previously in section 2.1, Chapter 2, including the names of compounds (Tables 2.1, 2.3 and 2.4 in Chapter 2), the retention information (Tables 2.5, 2.6 and 2.7 in Chapter 2), the characteristics of the columns (section 2.1 and Table 2.1 in Chapter 2), and the chromatographic conditions used. Retention predictions were performed for compounds in Dataset 1 on ten columns (column number 1 to 10), compounds in Dataset 2 on five columns (column number 11 to 15), and compounds in Dataset 3 on a Zorbax Eclipse Plus C18 column (column number 16).
3.2.2 Calculation of molecular descriptors
In this study, Dragon descriptors were calculated and employed. The Dragon 6.0 software [40] is able to calculate in excess of 4000 molecular descriptors, consisting of constitutional, topological, geometrical, electrostatic, physical, and quantum chemical descriptors [41]. The calculations of molecular descriptors were performed as detailed in section 2.3.2, Chapter 2. The resulting descriptor sets were used to build predictive models for the experimental
60
chromatographic retention data. Finally, 1448 descriptors were calculated and exported for each compound in the three datasets.3.2.3 Similarity ranking
In the present study, several filters for similarity ranking were applied to the datasets to generate suitable training sets. Each compound in each dataset was subsequently utilised as a ‘target compound’. Its retention time was then predicted using models made up from a subset of the other compounds in the dataset, by treating the other compounds with various filters. Filters included Tanimoto (based on the similarity of chemical structure), physico-chemical properties such as Log D and Log P, the chromatographic similarity reflected by the ratio of retention factor (k-ratio), and a dual filter, using Tanimoto or log D as the primary, and k-ratio as the secondary filter.
Tanimoto similarity: in the first modelling approach, filtering was performed based on the Tanimoto Similarity (TS) index. One compound (the target compound) was left out of the dataset and the rest sorted based on their pairwise TS index in relation to the target. Then the top ten compounds were used as a training set to derive a separate QSRR model for retention factors. If ten compounds could not be found in the training set, the target compound was not modelled. The derived models were then used to predict the retention factor for the target and this process was repeated for each of the compounds in the dataset. The TS-values were calculated using JChem for Excel (ChemAxon, Budapest, Hungary).
Physico-chemical parameter similarity (represented by log D and log P): some representative descriptors such as log D and log P can be used as filters to derive training sets. In the log D approach, filtering was performed based on the log D values. One compound (the target compound) was left out of the dataset with the rest being sorted based on the difference of log D in relation to the target. The largest difference allowed for filtering was 0.2. Similarly, for the log P approach, compounds in the training set were selected based on the ratio of log P (with the ratio always > 1) to the target compound and the largest ratio allowed was 1.2. Finally, the selected compounds were used as a training set to derive a QSRR model for the retention factor of the target compound. The minimum number of compounds in the training set was five, if five compounds below the respective cut-off similarities could not be found, the target compound was not modelled. The log D and log P values at the pH of the RPLC mobile phase (pH = 2.8) were calculated using InstantJChem (ChemAxon).
Chromatographic similarity (represented by k-ratio index): it is well known that the design of a practically useful similarity index should in fact correspond to the chromatographic similarity between compounds in order to establish accurate predictive QSRR models [5].
61
Therefore, the ratio of retention factors was also considered as a filter in the present study although it cannot be applied in practice. For k-ratio filtering, the compounds in the database were ranked according to the ratio of the compound’s retention factor k with the k-value of the reference compound (with k-ratio always > 1). The training set was then built using a certaink-ratio threshold (k-ratio < 1.5 in this study) to construct predictive QSRR models. The minimum number of compounds in the training set was five.
Dual filter: the k-ratio approach cannot be applied in practice because the retention of the target compound is unknown. The proposed k-ratio filter is therefore useful only as a benchmark. However, the retention of the compound could be used as a secondary filter after the initial application of a Tanimoto or log D (or log P) filter. Therefore, a secondary k-ratio filter was applied to datasets that had been determined using Tanimoto, log D, or log P as the primary filter. The rationale is to first select a training set based on a primary filter (such as Tanimoto or log D) and to then scrutinise the retention times of the training set compounds and to remove any compounds which have very diverse retention times.
3.2.4 QSRR modelling
In the present study, the QSRR models were obtained via a PLS regression in combination with a GA as the variable selection method [5, 35]. The parameters of the GA were detailed in section 2.3.3, Chapter 2. The similarity ranking, descriptor selection, and QSRR modelling were performed in an automated fashion using Matlab software. To enable this, the original GA-PLS Matlab routines from Leardi were modified [34].
3.2.5 Statistics
The coefficient of determination (R2), the slope of the regression with no forced intercept
and the mean absolute prediction errors (MAE) were used to evaluate model fitness (as detailed in section 2.3.6, Chapter 2) with the requirement for the slope to be within the range of 0.85 to 1.15 [42]. The correlation coefficient R2 between the predicted and the experimental
retention factorswas calculatedby constructing the corresponding scatter plot and performing a linear regression in Excel.