5. Effect of P-gp Binding on Biliary Excretion
5.3. Results
5.3.2. Prediction of Biliary Excretion Using Predicted P-gp Binding Values
Predicted log Ki by the six models reported in section 5.3.1 were used as
independent variables along with the molecular descriptors for the prediction of biliary excretion (log BE%). These were log Ki (RT), log Ki (CHAID), log Ki (I-tree), log
Ki (BT), log Ki (RF), andlog Ki (MARS). Models for log BE% were developed using
stepwise regression analysis, C&RT, CHAID, boosted trees, random forest and MARS. The results of these analyses have been summarised in Table 5.6. As it can be seen in this table, none of the predicted log Ki values were picked by C&RT,
CHAID, stepwise regression analysis (eight parameters), Chi square feature section, MARS feature selection (based on GCV error) or the 20 most important features by random forest, as a significant factor in the estimation of biliary excretion of compounds; the exception to this was the selected BT model. As a result, the multiple linear regression model was the same as MLR (1) (section 4.3.1), and regression trees and random forest models were those reported in section 4.3 (RT (1) and RF (1)).
Model MAE for training set MAE for validation set
RT (2) 0.398 0.543 CHAID (1) 0.471 0.603 I-tree (3) 0.690 0.706 BT (3) 0.316 0.568 RF (2) 0.501 0.618 MARS (1) 0.487 0.577
144
Table 5.6. Summary of model development for log BE% using molecular descriptors and predicted log Ki values
In this study, in addition to the methods investigated in chapter 4, CHAID and MARS methods were also used for model development. The resulting CHAID model (CHAID (2) in Table 5.6) did not pick any predicted log Ki parameter. This
CHAID model has been presented in Figure 5.8.
Figure 5.8 shows that hydrophilic volume (vsurf_W4) is the dominant variable of this tree (node 1), with a binary classification. According to this model, compounds with large hydrophilic volumes are excreted in higher quantities through bile. Other descriptors of CHAID (2) show similar trend to C&RT models presented in Chapter 4 for biliary excretion. For example, hydrophilic compounds with higher acid/base ionisation have higher biliary excretion (node 6), especially if they are non-lead like (node 12). Even compounds with small hydrophilic volumes can have considerable biliry excretion if they are non-lead like (node 4). The high biliary excretion of non- lead-like compounds is in agreement with the results in section 5.3.1 that indicated non lead-like compounds to be suitable P-gp substrates, thereby aiding their excretion by the efflux system. The prediction accuracy of CHAID (2) model is reasonably good (see Table 5.7). The risk estimate and standard error are 0.322 for training set and 0.254 for the validation set.
Method Predicted log Ki parameter picked Resulting Model
Stepwise regression none MLR (1)
C&RT none RT (1)
RF none RF (1)
CHAID none CHAID (2)
BT Log Ki (MARS),
Log Ki (RF)
BT (4)
MARS none MARS (2)
145
Figure 5.8. CHAID (2) Developed using the training set with the descriptors selected by CHAID algorithm
Table 5.7. Error of biliary excretion (log BE%) prediction by the selected models
As seen in Table 5.6, log Ki predicted by MARS (1) and RF (2) (log Ki (MARS) and
log Ki (RF)) models were two of the most important features in the boosted trees
analysis for the prediction of biliary excretion. The selected BT model (BT (4)) has similar prediction accuracy to the BT models without P-gp information (compare BT (1) and BT (2) models in Table 4.5 with BT (4) in Table 5.7). Lipophilicity parameters (LogD (6.5), LogD (7.4)), shape indexes (Kier2, Kier3 and Kier A3)
Model MAE for training set MAE for validation set
BT (4) 0.339 0.416
CHAID (2) 0.432 0.359
MARS (2) 0.438 0.428
MARS (3) 0.436 0.442
CHAID graph for log BE%
Num. of non-terminal nodes: 7, Num. of terminal nodes: 8
ID=1 N=168 Mu=1.04 Var=0.58 ID=2 N=83 Mu=0.60 Var=0.59 ID=5 N=72 Mu=0.52 Var=0.57 ID=3 N=85 Mu=1.48 Var=0.23 ID=6 N=59 Mu=1.63 Var=0.14 ID=10 N=58 Mu=1.65 Var=0.11 ID=12 N=44 Mu=1.70 Var=0.05 ID=4 N=11 Mu=1.13 Var=0.43 ID=8 N=71 Mu=0.55 Var=0.53 ID=9 N=1 Mu=-1.31 Var=0.00 ID=14 N=42 Mu=1.73 Var=0.04 ID=15 N=2 Mu=1.09 Var=0.12 ID=13 N=14 Mu=1.50 Var=0.25 ID=11 N=1 Mu=0.049 Var=0.00 ID=7 N=26 Mu=1.16 Var=0.26 vsurf_W4 <= 172.37 > 172.37 lip_druglike <= 0 > 0 SHBint5 <= 35.731 > 35.731 fU <= 0.0009 > 0.0009 SssssC <= 0 > 0 opr_leadlike <= 0 > 0 SHBint3 <= 34.714 > 34.714
146
and Volsurf descriptors indicating hydrophilic ratio (vsurf_CW2 and vsurf_CW4) were amongst the top 15 descriptors of BT (4) model. The optimal number of trees in this graph was 156 (Figure 5.9). Statistical parameters of this boosted tree are reported in Table 5.7.
Figure 5.9. Average squared error of log BE% against the number of trees in the boosted trees model BT (4) for the training and internal test sets
MARS models were developed using a number of descriptor sets as explained in the methods section. The best MARS model was MARS (2) using the features selected by Chi square feature method (Table 5.8). The second best model was MARS (3) in which, in addition to Chi square feature predictors, the predicted log Ki values (from RF model) were also used as independent variables. According to
MARS (2) and (3), increasing the number of sulphur atoms upto two will increase biliary excretion, with no further increase observed with more sulphore atoms. All the remaining molecular descriptors of MARS (2) are volsurf descriptors of hydrophilic volume and hydrogen bond donor capacity measured at different
147
energy levels. MARS (3) equation in Table 5.9 indicates that weaker P-gp binders (compounds with higher predicted log Ki values) will have reduced the log BE%. In
MARS (3), in addition to the Volsurf (vsurf) variables similar to MARS (2), lipinski’s lead-like compounds have been indicated to have lower biliary excretion which is a similar pattern to that observed with P-gp binding.
Table 5.8. The selected MARS (2) model (Feature selection)
Log BE% = -3.14 + 4.99*max(0, vsurf_HB3-8.58) - 3.74*max(0, 9.12-vsurf_W2) + 1.63*max(0, vsurf_W4-1.49) + 3.21*max(0, vsurf_W2-1.24) - 1.99*max(0, 2.00- a_nS) - 1.17*max(0, vsurf_W3-8.07) + 8.547*max(0, 8.07-vsurf_W3) - 1.14*max(0, vsurf_HB4-1.96)
N = 168 GCV error = 0.398 Mean residual = 0.000 SD(residual) = 0.573
Table 5.9. The selected MARS (3) model (Feature selection and RF predictor) Log BE% = 8.270- 1.240 (0, vsurf_HB4-2.67) + 2.867*max(0, vsurf_HB3-8.58) + 5.52*max(0, 8.58-vsurf_HB3) - 3.98*max(0, vsurf_W2-9.12) + 6.88*max(0, vsurf_W4-1.49) + 3.33*max(0, vsurf_W2-1.24) - 1.59*max(0, 2.00-a_nS) - 5.70*max(0, log Ki(RF)-1.90) - 3.66*max(0, lip_druglike-0.00)
N = 168 GCV error = 0.397 Mean residual = 0.000 SD(residual) = 0.565