• No results found

1.4.1 Theory and background

QSRR, as the name suggests, are techniques for relating the variations between compound structures and their retention, and represent a powerful tool in chromatography [10, 89]. QSRR is a technique for relating the variations in one response variable (Y-variable) to the variations in several descriptors (X-variables), with predictive or explanatory purposes. Y-variables are often used as the dependent and X-variables as independent variables, Therefore, in QSRR generally Y-variables are related to the chromatographic retention of solutes, and X-variables encode the molecular structure of solutes [89, 90]. Thus, in chromatography, the principal aim of QSRR is to predict retention data from the molecular structure. QSRR has been applied for

14

the characterisation of columns by quantitative comparison of separation properties or utilised to provide information for the interpretation of retention mechanisms for various chromatographic conditions (stationary phase, mobile phase, etc.) [72, 91, 92]. Additionally, the QSRR method can also offer unique opportunities to predict retention of solutes or to identify analytes [67, 92]. In chromatography, the typical QSRR study comprises several steps: the compilation of a retention database of compounds with known chemical structures, the calculation of molecular descriptors for each structure, a descriptor selection method, QSRR model building, and validation [93, 94]. A scheme of the QSRR methodology is shown in Figure 1.5.

Figure 1.5. Scheme of the QSRR methodology in chromatography.

1.4.2 Molecular descriptors

There are several common ways to represent structures [95], including whole molecule 1D descriptors (sometimes known as 0D), 2D descriptors, and 3D descriptors. 1D descriptors express simple chemical information of a solute such as molecular weight or number of oxygen atoms in the structure, where 2D descriptors are computed from the chemical structure of the solutes of interest when represented by a connection table or a molecular graph. 3D molecular descriptors provide molecular information about the 3D arrangement of structural features and general molecular surfaces and volumes [95-97].

In QSRR modelling, one of the crucial problems is how to represent the molecular structure for QSRR. Usually, molecular descriptors that encode the chemical structures are classified as physico-chemical, quantum-chemical, topological, etc. descriptors [95, 96]. One advantage of physico-chemical descriptors is that these descriptors are generally strongly related to the retention of solutes. However, they are often not available or have relatively large errors [89, 90, 93]. Quantum-chemical descriptors provide insights into the mechanism

15

of chromatographic retention on a molecular level [89, 93] but the correlation to the retention of solutes is often weak, and the calculation is also time-consuming. Topological descriptors are easily generated with present computing tools, but they are not necessarily related to retention phenomena [10, 89].

Computing software like Dragon and VolSurf+ is widely used to generate molecular descriptors based solely on their chemical structures [98-100]. These generated descriptors have been used to evaluate QSRR, quantitative structure-property relationships (QSPR) or quantitative structure-activity relationships (QSAR), as well as for similarity analysis and high-throughput screening of molecule databases [101, 102]. Typically, over 4000 molecular descriptors can be generated using Dragon 6.0 software [98, 103]. The 29 categories of Dragon molecular descriptors are detailed in Table 1.1.

Table 1.1. The categories of molecular descriptors from Dragon and VolSurf+ software

Block ID Dragon VolSurf

1 Constitutional descriptors Size and shape descriptors

2 Ring descriptors Descriptors of hydrophilic regions

3 Topological indices Descriptors of hydrophobic regions

4 Walk and path counts INTEraction enerGY (= INTEGY) moments

5 Connectivity indices Descriptors of H-bond donor / acceptor regions

6 Information indices Mixed descriptors

7 2D matrix-based descriptors Charge State descriptors

8 2D autocorrelations 3D pharmacophoric descriptors

9 Burden eigenvalues ADME model descriptors

10 P_VSA-like descriptors

11 ETA indices

12 Edge adjacency indices 13 Geometrical descriptors 14 3D matrix-based descriptors 15 3D autocorrelations 16 RDF descriptors 17 3D-MoRSE descriptors 18 WHIM descriptors 19 GETAWAY descriptors

20 Randic molecular profiles 21 Functional group counts 22 Atom-centred fragments 23 Atom-type E-state indices

24 CATS 2D 25 2D Atom Pairs 26 3D Atom Pairs 27 Charge descriptors 28 Molecular properties 29 Drug-like indices

16

Unlike Dragon, where a large number of molecular descriptors are calculated, VolSurf+ software can only generate 128 descriptors for the compounds of interest [99, 100, 104]. VolSurf+ can produce and explore the physico-chemical property space of a molecule (or library of molecules) starting from 3D maps of interaction energies between the molecule and chemical probes (GRID based Molecular Interaction Fields, or MIFs) [99, 105, 106]. One advantage of using VolSurf+ is that it compresses the information present in 3D maps into numerical descriptors optimised for ADME (absorption, distribution, metabolism, and excretion) models and virtual screening, making them are simple to understand and easy to interpret [105, 107]. Those 128 molecular descriptors can be classified into nine categories [100, 108]. The 9 categories of VolSurf+ molecular descriptors are also listed in Table 1.1. In QSRR modelling, chemometric methods are commonly utilised to identify the most suitable subset of molecular descriptors which shows the strongest ability to predict retention times and to build the mathematical relationships [109, 110].

1.4.3 Feature selection and regression analysis

The objective of utilising variable selection methods in QSRR modelling is to use the smallest number of molecular descriptors commensurate with a valid prediction of retention times from among a large number of generated molecular descriptors [38, 67, 72, 92]. A lot of variable selection methods have been elaborated and the proper feature selection is a key to building successful QSRR models. A reason that a proper feature selection method is important in QSRR modelling is because in a given data set some variables may be redundant, irrelevant or represent noise [38, 72]. An good feature selection method is capable of helping to avoid overfitting, reducing the model dimensions, and improving the performance of models [109, 111]. As reported, many feature selection methods like genetic algorithms (GA) and artificial neural networks (ANN) [112, 113] combined with multiple linear regression (MLR) [114] or partial least squares (PLS) [71, 111, 115], have been given intensive attention to build final model in QSRR studies.

As a statistical tool that is commonly used in a QSRR study, MLR has been used widely to handle the selection of molecular descriptors for the construction of QSRR models [10, 89]. With the significant increase in the number of molecular descriptors that can be computed, some new chemometric modelling techniques have been introduced to QSRR modelling in order to manage the greater number of descriptors [109, 116, 117]. PLS is a linear, multiple regression method and it has been used frequently in chemometric and multivariate calibration studies. PLS is particularly useful in handling databases with a large number of variables compared to the number of objects and in the presence of co-linear, redundant, and noisy variables [71, 72, 91]. A PLS method can be expressed as

17

𝑦 = a1LV1+ a2LV2+··· +amLVm 1.1

where y is the dependent variable, a1, a2, ···, am are the regression coefficients, and LVi is

the i-th latent variable. As can be seen from Eq. 1.1, PLS summarises the variation in the independent variables into a small set of linear, orthogonal, and latent variables (LVs) by maximising the covariance between descriptors and the dependent variable [111, 115, 118]. In addition, over-fitting in the models can be minimised by optimising the number of LVs.

1.4.4 Model validation

In QSRR modelling, a training set is used to build QSRR models, and a test set is needed for validation. For this purpose, the measured retention data of the test compounds is extracted and compared with the predicted retention data calculated from the derived QSRR models [67, 72, 73, 102]. The statistical reliability of the formed QSRR models needs to be validated, and this can be performed by several approaches. The coefficient of determination (R2), the slope

of the regression with no forced intercept, the mean absolute error (MAE) and the root-mean- square error of prediction (RMSEP) are commonly used to evaluate the fitness and the predictive ability of the constructed QSRR models [72, 73, 91]. Additionally, the percentage root-mean-square error of prediction (RMSEP%) of retention time for the test set is also a frequently used error reporting method for external validation of the accuracy of QSRR models generated from the training sets.

1.4.5 QSRR accuracy

In many cases, the precision and accuracy of the QSRR models is low, but may still be useful for the interpretation of the retention mechanisms, or the optimisation for the separation of complex mixtures, or the preparation of experimental designs [10, 89, 90, 93]. The predictive accuracy of the QSRR models can be influenced by a number of factors: (i) the feature selection method employed to choose the most informative descriptors, (ii) the modelling approach used to build QSRR models, (iii) the model validation approach utilised, (iv) the number of molecular descriptors incorporated into the QSRR models, (v) the geometry optimisation method used, (vi) the size of the dataset employed in the study, and (vii) the range of diversity or similarity of the molecular structures or characteristics.

In terms of the modelling approach employed for the construction of QSRR models, compound classification may provide greater predictive ability compared with the QSRR models derived from a diverse dataset [119]. As reported [119], compound classification has been achieved based on the log D profile similarity of compounds in a database and the performance of the subset-specific models was improved compared with a QSRR model using no compound classification. Another example can be found from the work by Muteki and co-

18

workers [120], using a compound-classification-based QSRR methodology to improve the retention time predictability compared with the global models.

1.4.6 Molecular similarity

As the name suggests, structurally similar molecules are more likely to exhibit similar properties [121, 122]. From this, the interest has been increased for the prediction of properties for compounds based on molecular similarity [123, 124]. Compared to a diverse training set, a much more structurally similar subset of compounds in a training set could be generated using this concept and is likely to produce better prediction results. The degree of structural similarity between two compounds can be calculated with the assistance of some chemometric tools, allowing a similarity coefficient to be obtained [125].

The Tanimoto coefficient, as the most commonly used similarity measurement of compounds, appears to be the gold standard in computing the fingerprint-based similarity used in QSRR or QSAR modelling [74, 91, 122, 125]. The Tanimoto coefficient for molecules A and B can be calculated using Eq. 1.2:

𝑆𝐴,𝐵=

𝑐

𝑎 + 𝑏 − 𝑐 1.2

Where, a and b are the bit sets in the fingerprints for A and B, and c is the bit set in common between the two fingerprints. The Tanimoto coefficient takes values between zero and unity, with 0 corresponding to no bits in common and 1 to identical fingerprints [121, 126]. In this thesis, the Tanimoto similarity was employed as a basic filter to select structurally similar compounds to the target compound to form a training set to be used for the subsequent construction of the QSRR models.