Prediction errors analysis - Web-based Information Systems and Tools

Quantitative Structure-Property Relationship Modelling

3.3 Data and Methods

3.4.5 Web-based Information Systems and Tools

3.5.1.2 Prediction errors analysis

The experimental values of enthalpy of formation in gas phase (kJ/mol) were compared to the predicted values using the independent validation set and repre-sented in a scatter plot, with an RMSE of 48.64 and a Q² of 0.9607 (Figure 3.16

-Figure 3.16: a) Plot of experimental versus predicted values of enthalpy of for-mation in gas phase (kJ/mol) using the independent validation set. b) Density plot of the diﬀerences between the observed values and the predicted values us-ing the independent validation set. The structure of the compounds with most extreme prediction errors are indicated, the positive errors correspond to com-pounds with triple bonds (hexa-2,4-diyne and hex-1-ynylbenzene) and the neg-ative errors correspond to compounds with more than one cycle (coronene and bicyclo[4.4.1]undeca-2,4,7,9-tetraene).

a). The majority of the data points are concentrated around the line of equality between the experimental and predicted value of the property (45-degree line) therefore, the relationship between them is strong. The distance of each symbol from the 45-degree line corresponds to its deviation from the related experimen-tal value. The regression line indicates that generally the model predicts values close to the equality with a small deviation showing that the model is predicting with smaller values than the observed ones. The prediction errors obtained for the independent validation set were then further analysed and are represented in the Figure 3.16b). Similarly to what has already been observed, the model is predicting the enthalpy of formation with a left bias (smaller values than ex-pected) and the most probable error is 4.10. The compounds with higher errors are the alkynes, probably due to the fact that this type of compounds are over-represented in the validation set with 12 compounds while only 4 alkynes exist in the training set and the latter is more than 3.5 times larger than the former.

Therefore, this under-representation may be aﬀecting the selection of descriptors

3.5 Discussion

to represent this type of compounds and their relationship with the property of interest. Removing the two alkynes (hexa-2,4-diyne and hex-1-ynylbenzene) with higher prediction errors, the RMSE decreases around 11.6% to 42.99 and a Q² of 0.9684, which is an indicator that these type of compounds are not well repre-sented in the training set. Another class of hydrocarbons with high error rate are the polycyclic compounds, although the experimental confidence on these values is lower than for the rest of the dataset, the fact that they have complex struc-tures and conformations may be the cause for a higher diﬃculty establishing a relationship between their representation and the property of interest.

Summarizing, the feature selection step yields lower prediction errors (RMSE

= 34.10) with a small number of variables (89). When comparing it to using the model with all the available descriptors (1485), the current 89-variable model was able to produce models with an RMSE 23% lower. These reduced errors are rel-evant in thermochemistry with significant chemical and economical importance.

3.5.2 Case A2 - Predicting Enthalpy of formation and phase change for ThermInfo’s dataset

3.5.2.1 Selected chemical descriptors

The list of most important descriptors selected using variable importance calcu-lated by RFs were individually analysed for each property and are made available in Appendix C.2. There are always 30 to 50 variables with substantially higher mean importance and from this point on the importance decreases asymptoti-cally. Figure 3.17 presents the 20 most important variables for each property grouped into general classes of molecular descriptors. In general, it is possible to observe that there are several common descriptors in the top 20 most important variables for each property. Additionally, to predict Standard Molar Enthalpy of Formation (crystalline, gas and liquid phases) the most important variables belong mainly to a class of descriptors that consider the contributions of molar refractivity, partial charges, estate indices, LogP and surface area while to predict Standard Molar Enthalpy of Phase Change (fusion, vaporization and sublimation) the most important variables belong mainly to classes of descriptors that are de-rived from the constitution and topology of the molecule. It is clear that there

Figure 3.17: List of the 20 most important variables by classes of descriptors for each property (Standard Molar Enthalpy of Formation: crystalline (crys), liquid (liq) and gas phases and Standard Molar Enthalpy of Phase Change:

fusion (phasecl), vaporization (phaselg) and sublimation (phasecg)) in case-study A2. More information about the descriptors and their meaning can be found at https://code.google.com/p/rdkit/wiki/DescriptorsInTheRDKit and http://openbabel.org/docs/dev/Fingerprints/intro.html.

are many common descriptors in all physical phases of each group of properties (enthalpy of formation and enthalpy of phase change), especially between the gas and liquid phases.

3.5.2.2 Prediction errors analysis

The prediction errors obtained for each property using the independent validation set were analysed and are represented in the Figure 3.18. AppendixC.2 provides a detailed table of predictive results for the testing set obtained for all properties of case-study A2 using the best model (selected based on training cross-validated results). Similarly to what has already been observed, the model is predicting the enthalpy of formation with small bias, namely the most probable error for a)

3.5 Discussion

Figure 3.18: Density plots of the diﬀerences between the observed values and the predicted values for all properties in the testing sets. The structure of the compounds with most extreme prediction errors are indicated, the positive errors correspond to the compounds Sulphonylbismethane, Spiropentane, 1,2-Butadiene, Carbonic acid diphenyl ester, Hexanedinitrile and Cyclotetradecane (using the order of the plots) and the negative errors correspond to the compounds 1,3,5-Triazine-2,4,6(1H,3H,5H)-trione, (Z)-2-Butenenitrile, 11-Decylheneicosane, (E)-1-Methyl-4-(1-propenylsulphonyl)benzene, 4-Chlorophenol and Octadecanoic acid (using the order of the plots).

crystalline phase is 1.12, b) gas phase is -1.76, c) liquid phase is 1.60, d) fusion is -1.70, e) vaporization is 0.25 and f) sublimation is 0.05. The compounds with higher errors were also analysed and are represented in Figure 3.18. Once again, the higher errors are mostly compounds with triple bonds or more than one cycle.

Small structures with rigid conformations, such as Spiropentane, also showed a higher diﬃculty establishing a relationship between their representation and the property of interest.

Summarizing, the feature selection step yields lower prediction errors with a smaller number of variables. The number of variables in the model with all the available descriptors (1168) can be reduced by about 80%, increasing the predic-tive results by 2 to 40%. These reduced errors are relevant in thermochemistry with significant chemical and economical importance.

3.5.3 Case D - NIEHS-NCATS-UNC DREAM Toxicoge-netics Challenge

One of the main insights gained during this challenge is that models are as good as the data they are based on, therefore a key limitation to the subsequently use of the produced models is that the set of compounds used to develop the relationship should be similar to those compounds for which predictions are desired. The results obtained for the training set using 10-fold cross validation and test sets are significantly diﬀerent, however given that the distribution of the test set was biased (as presented in Appendix A in Figure A.7) we cannot conclude about the predictive performance of the produced models, neither if they are applicable to a real-world situation. Either the training or test set is not representative of the real-world distribution and in such case to develop predictive models, the priors of such distribution in a real-world scenario should be known. To test this hypothesis, we merged the training and testing sets of this challenge and randomly sampled it again into training and test set with the same size. The distribution of the median EC10 in new datasets is clearly more similar and is presented in Figure 3.19. The RMSE obtained for the new randmly sampled training set is 0.6371 and for the testing set is 0.7146 which are clearly more similar.

3.5 Discussion

Figure 3.19: Density plot showing the distribution and variation of the median EC10 in the train and test sets randomly sampled.

Regarding the objective that we proposed, minimizing the RMSE, the predic-tive performance of our models ranked position 1 and 2 among 229 participants organized in 24 teams, showing the ability of the model to cope with the de-fined purpose. However, due to the fact that the testing set and the training set do not show the same statistical properties (meaning they belong to diﬀerent populations, in the statistical sense), the respective correlation coeﬃcient and the RMSE obtained for both sets appear very weakly correlated. Accordingly, we cannot conclude that the models produced are the best ones to use in a real-world scenario for cytotoxicity prediction, even though the approach used was able to give solutions that ranked very competitively among all the other submissions (Figure 3.20).

Figure 3.20: Final scoring for NIEHS-NCATS-UNC DREAM toxicogenetics chal-lenge. The submissions were evaluated on a final test set of 50 held-out com-pounds based on the ability of teams to predict the distribution of log(EC10) values for each compound in the population, in terms of median log(EC10) values and interquantile (q95-q05) distance. The performance of each submission was assessed using Pearson correlation (PC), Spearman correlation (SC) and Root Mean Squared Error (RMSE).

3.5.4 Case G - Blood-Brain Barrier (BBB) Penetration

In document Machine learning methods for quantitative structure-property relationship modeling (Page 141-148)