Quantitative structure–property relationship studies for the prediction of the vapor pressure of volatile organic compounds

(1)

JSCS–5273 Original scientific paper

1405

Quantitative structure–property relationship studies for the

prediction of the vapor pressure of volatile organic compounds

MOUNIA ZINE1_{, AMEL BOUAKKADIA}1,2_{, LEILA LOURICI}3 and DJELLOUL MESSADI1_*

1_{Environmental and Food Safety Laboratory, Badji Mokhtar-Annaba University, BP. 12,}

23000 Annaba, Algeria, 2_{Abbes Laghrour University, Faculty of sciences and technology,}

Khenchela, Algeria and 3_{Chadeli Ben Djedid University, BP. 73, 3600 El Taref, Algeria}

(Received 6 March, accepted 13 June 2019)

Abstract:A theoretical model (QSPR) using multiple linear regression analysis for predicting the vapor pressure (pv) of volatile organic compounds (VOCs) has been developed. A series of 51 compounds were analyzed by multiple lin-ear regression analysis. First, the data set was separated arbitrarily into a train-ing set (39 chemicals) and a test set (12 chemicals) for statistical external valid-ation. A four-dimensional model was developed using as independent variables theoretical descriptors derived from Dragon software when applying the GA (genetic algorithm)–VSS (variable subset selection) procedure. The obtained model was used to predict the vapor pressure of the test set compounds, and an agreement between experimental and predicted values was verified. This model, with high statistical significance (R2_{= 0.9090,}_Q2

LOO = 0.8748, Q2ext = 0.8307,

s = 0.24), could be used adequately for the prediction and description of the log p_v value of other VOCs. The applicability domain of MLR model was investigated using a William‘s plot to detect outliers and outsides compounds. Keywords:molecular descriptors; VOCs; log pv; multiple linear regression.

INTRODUCTION

Volatile organic compounds (VOCs) are molecules which can contain H and

C atoms but also other elements such as O, N, Cl, F, P, S,... and metals and/or

metalloids, and which are almost entirely in the vapor state under normal

condit-ions of temperature and pressure. They include 210 species and 23 large families.

These compounds can be of natural origin (terpenes) but very often they are

con-taminants mainly from human activity.

1

The sectors of activities that are more strongly transmitters of VOCs are road

transport, industry, agriculture, and the tertiary sector. Other air pollutants that

could be cited are biological contaminants (bacteria, pollen, fungi), physical

(2)

taminants (metals, particles, dust, radioactivity), and chemical

contaminants that

are part of the VOCs represented by gases (CO, O

₃

, NO

_x

, SO

₂

, fluorocarbons),

dioxins and furans.

1

Experience is a direct way to obtain activity data for organic compounds,

which have many shortcomings, such as the need for large test organisms, high

costs, long time duration, and value differences measured between different

res-earchers. Consequently, it would be impossible to test experimentally the activity

values for all organic compounds.

As new compounds are emerging, other difficulties will also arise.

There-fore, it is necessary to use theoretical methods to waiver the disadvantages of the

experiment and to predict the data of compounds exactly.

With the rapid development of computer science and theoretical quantum

chemical studies, quantum chemical parameters of compounds can speedily and

precisely obtained by computation. These structural parameters along with the

introduction of the quantitative structure–activity relationship (QSAR) models

can increase the interpretability and predict the activity of new organic

com-pounds.

2

Quantitative structure–property relationships (QSPR) have gained wide

attention in the area of separation science recently. These models are based on

the relationship between structures and property

of compounds.

3

The vapor pressure of different compounds can be predicted from their

for-mula and even unknown compounds can be identified using this method. In

gen-eral, QSPR models attempt to predict the vapor pressure of a molecule by

char-acterizing it with a series of molecular descriptors. These models can effectively

be used for the prediction of molecular structures and determination of vapor

pressure.

The aim of this study was to find a statistical model for the prediction of

vapor pressures of some volatiles organic compounds

.

The model was validated

by dividing the data set arbitrarily into training (39 compounds) and test (12

compounds) sets.

(3)

EXPERIMENTAL Data set

The pv experimental values of 51 selected, structurally heterogeneous VOCs were col-lected from previous works,4_{and converted to log}_p

v values.

Molecular descriptors generation

The structures of the molecules were drawn using Hyperchem 6.03 software.5_{The final} geometries were obtained with the semi empirical PM3method. All calculations were realized at the RHF (restricted Hartree–Fock) level with non configuration interaction. The molecular structures were optimized using the Polak–Ribiere algorithm and a gradient norm limit of 0.01 kcal Ǻ-1_mol-1_{. The resulting geometry was transferred into the software Dragon version 5.5 to} calculate 1600 descriptors of the type geometrical and GETAWAY (Geometry, Topology and Atoms Weighted AssemblY).6

Descriptors with constant or near constant values inside each group were discarded. For each pair of correlated descriptors (with correlation coefficient r ≥ 0.95), the one showing the highest pair correlation with the other descriptors was excluded. The genetic algorithm (GA)7 was considered superior to other methods of the variable selection techniques. Thus, variable selection was performed on the training set using GA in the MobyDigs version of Todeschini8 by maximizing the cross-validated explained variance Q2

LOO.

Model development and validation

Models with four variables were performed by the software MOBYDYGS.8_The good-ness of fit of the calculated models were assessed by means of the multiple determination coefficients, R2_{, and the standard deviation error in calculation (}_SDEC_):

(

)

2

i i

1

1 _ˆ

=



n −

i

SDEC y y

n (1)

Cross validation techniques allow the assessment of the internal predictivity (Q2 LMO cross validation; bootstrap) in addition to the robustness of the model (Q2

LOO cross valid-ation). Cross validation methods consist in leaving out a given number of compounds from the training set and rebuilding the model, which is then used to predict the compounds left out. This procedure is repeated for all compounds of the training set, obtaining a prediction for everyone. If each compound is taken away one at a time, the cross validation procedure is called the leave-one-out technique (LOO), otherwise the leave-more-out technique (LMO).

The LOO or LMOcorrelation coefficient, generally indicated with Q2_{, was computed by} evaluating the accuracy of the prediction of these “test” compounds:

2 / 1 2 2 1 ˆ ( ) 1 1 ( ) = = − = − = − −



n

i i i i n i i y y PRESS Q TSS y y (2)

(4)

=

SDEP PRESS n (3)

A value Q2_{> 0.5 is generally regarded as a good result and}_Q2_{> 0.9 as excellent.}9,10 However, studies11,12_{have indicated that while}_Q2_{is a necessary condition for high predictive} power a model, is not sufficient. To avoid overestimating the predictive power of the model the LMO procedure (repeated 5000 times, with 4 objects left out at each step) was also per-formed (Q2

L(4)O).

In the bootstrap validation technique K n-dimensional groups are generated by a ran-domly repeated selection of n-objects from the original data set. The model obtained on the first selected objects is used to predict the values for the excluded sample, and then Q2_is cal-culated for each model. The bootstrapping was repeated 8000 times for each validated model. By using the selected model, the values of the response for the test objects are calculated and the quality of these predictions is defined in terms of Q2

ext, which is defined as:

ext tr / ext ext 1 2 ext tr 2 tr tr 1 ˆ

( )² /

/ 1 1 / ( ) / = = − = − = − −



n

i i i i

n

i i

y y n

PRESS n Q

TSS n

y y n

(4)

Here next and ntr are the number of objects in the external set (or left out by bootstrap) and the number of training set objects, respectively. The data set was divided arbitrary into a training set (39 objects) used to develop the QSAR models and a validation set (12 objects), used only for statistical external validation. Other useful parameters are R2_{, calculated for the} validation chemicals by applying the model developed from the training set, and external standard deviation error of the prediction (SDEP_ext), which is defined as:

( )

ext ₂ ext ext 1 1 =

= n



i−

i

SDEP y y

n (5)

where the sum runs over the test set objects (n_ext). According to Golbraikh and Tropsha,12_a QSPR model is successful if it satisfies several criteria as follows:

2

CVext>0.5

R (6)

R2_{> 0.6} ₍₇₎

(

r²−r² / ² 0.1 or ²0

)

r <

(

r −r²’ / ² 0.10

)

r < (8)

0.85 < K < 1.15 or 0.85 <K’ < 1.15 (9) Here:

(

)

(

)

(

)

2

(

)

2

= − − − −



     

i i i

i i

y y y y

r

y y y y

(10) 0 2 2 ( ) ² 1 ( ) − = − −



r i i i y y r

y y (11)

(

)

(

)

02 i 2 0 2

(5)

(

)

( )

'

2 =



 

i i

i

y y k

y (13)

(

)

( )

2 =





i i

i

y y k

y (14)

where r is the correlation coefficient between the calculated and experimental values in the test set; r²0 (calculated versus observed values) and r02 (observed versus calculated values) are the coefficients of determination; k and k’ are the slopes of regression lines through the origin of the calculated versus observed and observed versus calculated, respectively yr0

i,yr0i; are defined as yr0

i = kyi and yr0i = k’yi and the summations runs over the test set.

The applicability domain (AD) was discussed by the Williams plot8,9_{of jacknified} resi-duals versus leverages (hat diagonal values (hi). The jacknifed residuals (or studentized resi-duals) are the standardized cross-validated residuals. Each residual is divided by its standard deviation, which is calculated without the ith_{observation. The leverage (}_h

i) value of a chem-ical in the original variable space is defined as:

h_i = x_i(XT_X₎-1_x

iT (i = 1,….,n) (15)

where xi is the descriptor row-vector of the query compound and X is the n(p+1) matrix of p

model parameter values for n training set compounds. The superscript T refers to the trans-pose of the matrix/vector. The warning leverage value (h*) is defined as 3(m+1)/n. When h value of a compound is lower than h*, the probability of accordance between predicted and actual values is as high as that for the compounds in the training set. A chemical with hi > h*

will reinforce the model if the chemical is in the training set. But such a chemical in the valid-ation set and its predicted data may be unreliable. However, this chemical may not appear to be an outlier because its residual may be low. Thus the leverage and the jacknified residual should be combined for the characterization of the AD.

In this stage, linear QSPR model was developed and evaluated to predict the log pv of the compounds. The study we conducted consists of the multiple linear regressions (MLR) avail-able in the MobyDygssoftware.

RESULTS AND DISCUSSION

In order to predict log

p

v

, application of the GA–VSS lead to several good

models for the prediction based on different sets of molecular descriptors.

The best model obtained using 39 compounds is a four-dimensional model

(X0sol, SpPosA_H2, GATS2e and Hy) with a high predictive power.

The multiple linear regression model (MLR) is given by:

log p

_v

= 11.0490 – 0.4602X0sol – 12.3322SpPosA_H2 +

+ 1.1372GATS2e – 1.2333Hy

(16)

Statistical parameters for the model with: n

tr

= 39

n

ext

= 12 are:

R² = 0.9009Q²

LOO

= 0.8748Q²

Boot

= 0.8555F = 77. 25Q

2ext

= 0.8307

(6)

The reported fitting and validation parameters have, as expected, high values

indicating that the model has a very good predictive performance and the

des-criptors involved in it well describe the vapor pressure.

The high absolute t-values shown in Table I express that the regression

coef-ficients of the descriptors involved in the MLR model are significantly larger

than the standard deviation.

The t-probability of a descriptor can describe the

sta-tistical significance when combined together within an overall collective QSPR

model (descriptors interactions).

TABLE I. Characteristics of the selected descriptors in the best GA/MLR model

Descriptor Coefficient regressed Standard error coefficient t t-Probability VIF

Constant 11.0490 0.6288 17.57 0.000 –

X0sol –0.4602 0.0460 –9.81 0.000 1.720

SpPosA_H2 –12.3320 1.2100 –10.19 0.000 1.528

GATS2e 1.1372 0.1423 7.99 0.000 1.074

Hy –1.2333 0.1275 –9.68 0.000 1.944

Descriptors with t-probability values below 0.05 (95 % confidence) are

usu-ally considered statisticusu-ally significant in a particular model, which means that

their influence on the response variable is not merely by chance.

13

_{A smaller}

t-probability suggests a more significant descriptor.

The t-probability values of

the four descriptors are very small, indicating that all of them are highly

signific-ant descriptors.

The VIF is uniform and equal to 1.00 if there is no linear correlation between

a given variable and rest of the variables in the regression equations. Higher

values of VIF in Table I indicate a more serious multi-co linearity problem.

Models would not be accepted if they contain descriptors with VIFs above a

value of five.

14

2

1

1 =

−

j

VIF

R

(17)

where

R

_j2

_{is the squared correlation coefficient between the}

_j

th

_coefficient

regressed against all the other descriptors in the model.

15

Applicability domain

On analyzing the domain of the applicability of the model from a Williams

plot, all residuals were located within the range of three standard deviations, and

there was no influential compound both for the training or the prediction set, Fig.

1, which means that the model has a good external predictivity.

A diagram of the statistical coefficients

Q

2

_and

_R

2

_{is presented Fig. 2 to}

(7)

the log p

v

are smaller than those of the real models; the Q

2

values are lower (<10

%), as for the major part are the

R

2

_{values (<30 %). This ensures that the}

estab-lished model has a real base, and is not due arbitrarity.

Fig. 1. Williams plot: jackknifed resi-duals and leverages.

Fig. 2. Randomization test associated to the previous QSPR model.

This technique, plot of cross-validation log

p

_v

versus experimental log

p

_v

values, shown in Fig. 3, ensures the robustness of a QSPR model (see also Tables

S-I and S-II of the Supplementary material to this paper).

16

Descriptor contributions analysis

Based on a previously described procedure,

17,18

_{the relative contributions of}

the four descriptors to the model were determined. It should be noted that the

difference in contribution between two descriptors used in the model indicates

that all the descriptors are essential to generate the predictive model (Fig. 4).

(8)

Fig. 3. Cross-validation versus exp-erimental log pv for the entire data set.

Fig. 4. Relative contributions of the selected descriptors to the MLR model.

CONCLUSIONS

Multiple linear regressions were used to construct a quantitative structure

property relation model of 51 VOCs for their vapor pressures binding propriety.

The model was developed by a genetic algorithm selection of theoretical

mole-cular descriptors from among a wide set obtained with several softwares. The

data set were separated randomly into two subsets of 39 elements for training and

12 for external validation. The MLR

model

has good stability, robustness and

predictivity. The chemical applicability domain of the studied MLR model and

the reliability of the predictions were verified by the leverage approach. The

equation obtained could be used successfully to estimate the vapor pressures for

new compounds or for other compounds the experimental values of which are

unknown.

SUPPLEMENTARY MATERIAL

(9)

И З В О Д

ПРОУЧАВАЊАКВАНТИТАТИВНИХРЕЛАЦИЈАСТРУКТУРЕИРЕАКТИВНОСТИЗА

ПРЕДВИЂАЊЕНАПОНАПАРЕИСПАРЉИВИХОРГАНСКИХЈЕДИЊЕЊА

MOUNIA ZINE1

, AMEL BOUAKKADIA1,2

, LEILA LOURICI3

и DJELLOUL MESSADI1 1

Environmental and Food Safety Laboratory, Badji Mokhtar-Annaba University, BP. 12, 23000 Annaba, Algeria, 2_{Abbes Laghrour University, Faculty of sciences and technology, Khenchela, Algeria}_и3_{Chadeli Ben}

djedid University, BP.73, 3600 El taref, Algeria

Развијенјетеоријскимодел (QSPR) коришћењемвишеструкелинеарнерегресионеана -лизезапредвиђањенапонапаре (pv/Pa) испарљивихорганскихједињења (VOC), засеријуод

51 једињења. Напочеткујескупподатакапроизвољноподељенускупзапроучавање (39 једињења) искупзапроверу (12 једињења) заекстернустатистичкувалидацију. Развијенје четвородимензионалнимоделкористећикаонезависневаријаблетеоријскеописникеизве -денеизсофтвера Dragon применом GA (генетскиалгоритам)–VSS (variable subset selection) процедуре. Добијенимодел јеупотребљензапредвиђање напонапаре једињењаскупаза проверу, ипотврђена јесагласност измеђуексперименталнихипретсказаних вредности. Овај модел, са високомстатистичком значајношћу (R2 = 0,9090, Q2_Loo = 0,8748, Q2_Ext = 0,8307, s = 0,24), можебитиадекватноупотребљензапредвиђањеиописивање log (pv/Pa)

вредности других VOC. Област применљивости MLR модела испитивана је коришћењем

Виљемсовогграфа (William‘s plot) даседетектујуједињењакојасенеуклапајуумодел.

(Примљено 6. марта, прихваћено 13. јуна 2019)

REFERENCES

1. V. Cirimele, M. Etter, M. Villian, P. Kintez, Ann. Toxicol. Anal. 20 (2008) 67 (https://doi.org/10.1051/ata/2009002)

2. S. Chtita, R. Hmamouchi, M. Larif, M. Ghamali, M. Bouachrine, T. Lakhlifi, J. Taibah Univ. Sci. 10 (2016) 868 (https://doi.org/10.1016/j.jtusci.2015.04.007)

3. J. Akbar, S. Iqbal, F. Batool, A. Karim , K. W. Chan, Int. J. Mol. Sci. 13 (2012) 15387 (https://doi.org/10.3390/ijms131115387)

4. D. Mackay, W. Y. Shiu, K. C. Ma, S. C. Lee, Handbook of physical-chemical properties and environmental fate for organic chemicals,2nd_{ed., CRC Press Inc., Boca Raton, FL,} 2006 (https://doi.org/10.1201/9781420044393)

5. HyperChem 6.03 Package, Hypercube, Inc., Gainesville, FL, 1999; software available at: http://www.hyper.com

6. Talete Srl. Dragon for Windows(Software for Molecular Descriptor Calculation) Version 5.5,Milano, 2007, software available at http://www.talete.mi.it/

7. R. Leardi, R. Boggia, M. Terrile, J. Chemom. 6 (1992) 267 (https://doi.org/10.1002/cem.1180060506)

8. R. Todeschini, D. Ballabio, V. Consonni, A. Mauri, M. Pavan, MOBYDIGS, Software for Multilinear Regression Analysis and Variable Subset Selection by Genetic Algorithm. Release 1.1 for windows, Milano, 2009 (http://www.talete.mi.it/)

9. L. Eriksson, J. Jaworska, A. Worth, M. McCronin, R. M. Dowell, P. Gramatica, Environ. Health. Perspect. 111 (2003) 1361 (https://doi.org/10.1289/ehp.5758)

10. A. Tropscha, P. Gramatica, V. K. Gombar, QSARComb. Sci. 22 (2003) 70 (https://doi.org/10.1002/qsar.200390007)

11. H. Kubinyi, F. A. Hamprecht, T. Mietzner, J. Med. Chem. 41 (1998) 2553 (https://doi.org/10.1021/jm970732a)

(10)

13. L. F. Ramsey, W. D. Schafer, The Statistical Sleuth,Wadsworth Publishing Company, Belmont, CA, 1997 (https://ir.library.oregonstate.edu/downloads/j3860c13r)

14. A. J. Holder, D. M. Yourtee, D. A. White, A. G. Galaros, R. J. Smith Chain, J. Comput. Aided. Mol. Des. 17 (2003) 223 (https://doi.org/10.1023/A:1025382226037)

15. S. Chatterjee, A. S. Hadi, B. Price, Analysis of collinear data. In: Regression analysis by example, 4th_{ed., S. Chatterjee, A. S. Hadi, Eds., Wiley, New York, 2006, pp.. 221–258} (https://doi.org/10.1002/0470055464.ch9)

16. A. Tropsha, A. Golbraikh, Curr. Pharm. Des. 13 (2007) 3494 (https://doi.org/10.2174/138161207782794257)

17. W. Li, Y. Tang, Y. L. Zheng, Z. B. Qiu, Bioorg. Med. Chem. 14 (2006) 601 (https://doi.org/10.1016/j.bmc.2005.08.052)