PREDICTION OF PLASMA PROTEIN BINDING AFFINITY BY SUPPORT VECTOR MACHINE AND ARTIFICIAL NEURAL NETWORK

(1)

www.wjpr.net Vol 3, Issue 9, 2014. 432

PREDICTION OF PLASMA PROTEIN BINDING AFFINITY BY

SUPPORT VECTOR MACHINE AND ARTIFICIAL NEURAL

NETWORK

Abhishek Singh Chauhan1*, Utkarsh Raj1, Pritish K.Varadwaj1

1

Indian Institute of Information Technology, Allahabad, India.

ABSTRACT

Plasma Protein Binding plays a major role in pharmacokinetics. In this

work, we selected drug distribution prediction by support vector

statistical learning. Drugs which attach with high attraction to Plasma

protein means low drug distribution. If drug binds with low affinity to

plasma protein means high drug distribution. Our interest is in high

drug distribution. 715 drugs are drug distribution related drug. We

have divided 715 drugs in high ppb drug and low ppb drug with help of

statistical learning methods. We used two statistical learning methods

these are SVM along with ANN. The major idea of machine learning is

classifying data into two classes that is high plasma protein binding

compound and LPPB compound. In machine learning, feature selection or molecular

descriptor is very important footstep. We used TSAR, DRAGON and SCHRODINGER

software, which are applicable on molecular file format like SDF to extract 295 descriptors

from them. Next we selected 95 descriptors through different feature selection method. Our

approach shows that if any one gives new data related to ppb, our predictive model give the

class. If anyone gives the test data, our model tells about the high ppb or low ppb class. We

got more accuracy in ANN in comparison to SVM.

KEYWORDS: ANN (Artificial Neural Network), SVM (Support Vector Machine), LPPB

(Low Plasma Protein Binding), HPPB (High Plasma Protein Binding), FFBP (Feed Forward

Back propagation).

INTRODUCTION

Complex set of molecules is a drug. Drug does not dispersed in compound form in the body,

after being engrossed it should wrecked up into simplest form then it widen to all component

Volume 3, Issue 9, 432-441. Research Article ISSN 2277– 7105

Article Received on 24 August 2014,

Revised on 17 Sept 2014, Accepted on 12 Oct 2014

*Correspondence for

Author

Dr. Abhishek Singh

Chauhan

Indian Institute of

Information Technology,

Allahabad, India.

(2)

of body by blood. After assimilation and dispersed the remaining misuse part of medicine

will excreted in body.ADME play a major role in pharmacokinetics. A stand for Absorption,

D stand for distribution, M stand for metabolism, E stand for excretion. ADME is

pharmacokinetic property. Pharmacokinetics and pharmacodynamics both are related to drug.

Pharmacokinetics means effect of the body‟s on drug and pharmacodynamics means effect of

drug on to the body. The ADME databases provide newest and total data for structurally

varied compounds linked with recognized ADME properties; these are oral bioavailability,

enzyme metabolism, induction and inhibition, plasma protein binding, transport and blood

brain barrier. Among these properties we selected plasma protein binding (PPB).

Absorption is a method by which drug passes from place of administration to blood stream.

If drug absorbed poorly then we give that drug in different manner like inhalation or by

intravenously.

Distribution is a part of pharmacokinetics which explains the reversible transport of drug.

Compound move into different organ and muscle via blood stream. Distribution is a way by

which drug transfer from intravascular room to extra vascular room. Spaces explained

mathematically in expressions of volume of distribution. Distribution‟s Equation:

Volume of distribution = Drug’s Dose / drug concentration

Anabolism and Catabolism summation is metabolism. Anabolism means biosynthesis of

complex thing. Catabolism means breakdown of complex things.

Excretion is way by which non useful materials and waste metabolite are removed. We

selected Distribution property of drug for our research work. When drug enter into blood

circulation, it is dispersed to the body‟s tissues. The distribution is normally irregular

because of differences in tissue binding, blood perfusion, permeability of cell membranes and

regional pH. In drug distribution protein plays a major role. Contact of drug with tissue

protein and contact of drug with plasma protein make some changes in drug distribution. The

Protein binding have major impact on drug‟s PK and PD. Drug exists in two forms: bound

and unbound form in blood. Plasma protein bound to drug and remainder being unbound.

Unbound fraction of drug undergoes to metabolism. The interaction of drug to plasma or

serum protein is a saturable, reversible process, which is an important factor assessing the PD

(3)

plasma. The less bound drug more efficiently can traverse cell membrane or diffuse. Human

serum albumin, glycoprotein, alpha, beta and gamma globulin, lipoprotein are the blood

protein, drug bind to them. High plasma protein binding lead to drug present in blood

compartment, this result into lower volume of drug distribution. Low plasma protein binding

results into more drugs is free to move into tissue. This will result into high volume of

distribution.

METHODS AND MATERIALS

Data collection and Data pre-Processing

To develop a drug distribution prediction model, one should choose the dataset limited to

drug distribution. Drugs are related to PPB. We had taken plasma related drug because low

plasma protein binding drug shows high drug distribution and HPPB drug shows low drug

distribution. The data was downloaded from http://www.pkdb.ifsc.usp.br/. The data set

consists of 715 compounds. Class labels were defined as „0‟ for low plasma protein binding

drug and „1‟ for high plasma protein binding drug. The whole data comprising of 715 drugs

was then separated into two sets using 75:25 principle. Training data includes 200 drugs of

which 100 are HPPB ligands and 100 are LPPB ligands. Other remaining drug treats as test

data.

Calculation of Descriptor Values

The drug data set find from literature was in SDF format. We used three software DRAGON,

TSAR and Schrodinger software for the calculation of descriptor values for the dataset. We

were able to calculate 295 descriptors with the help of these three software‟s.

Descriptor Selection

Classification problem difficulty increased when data is linked with huge no. of descriptor.

Dimensionality can be expressed by no. of feature. If number of descriptor is more than

dimensionality of space which makes the optimization harder. Only relevant descriptor

should be used in algorithm. Descriptor selection did only to find out relevant descriptor.

This relevant descriptor is used in data set. Highly identical and irrelevant descriptor is

identified. Identical descriptor is removed to increase performance of classifier. Significant

descriptor can be selected either using automated approach or manually approach. Automated

(4)

www.wjpr.net Vol 3, Issue 9, 2014. 435 Relation between different Descriptors

We calculated 295 features from the software. We would like to check 2 descriptors are

correlated or not. Correlation among two descriptors then we delete the descriptor from the

dataset. We had used MATLAB correlation coefficient function for the descriptor selection.

We selected 0.6 as a threshold. Above the .6 value, we had discarded the descriptor. After the

use of correlation coefficient we got 95 relevant descriptors. We had used MATLAB

correlation coefficient function; it gives matrix R which is showing correlation coefficient

calculated using descriptor in matrix form. The MATLAB function is R = corrcoef(X) where

R is correlation coefficient and X is input matrix.

Calculated descriptors are: 95 calculated descriptors are as follows:

Kier chiv6(path) index Molecular mass, Inertia moment 3 size

Square reciprocal distance

sum(SRDS)

TCI6 lipole y component

Kierchi6 (ring)index Inertia moment 1

length Total lipole Log p molecular refractivity

KierchiV6(ring) index Ellipsoidal volume

(whole molecule) lipole x component ZM2V

Kierchi V3 (cluster) index Connectivity chi-1[Randic connectivity](CIX1) Maximal electropological positive variation(MEPV)

MTCI2 RTI

Average valence connectivity index chi-0(AVX0) Average connectivity index chi-5(ACIX5) Narumi Harmonic topological (NHT) Kier benzene-likeliness index(KBLI)

BTI Kier flexibility(KF)

Sum of E-state indices Solvation connectivity index chi-2 (SCIX2)

Solvation

connectivity index chi-1(SCIX1)

TCI4 Connectivity index chi-0(CIX0)

Mol-MW Second Mohar(SM) Whete JHETP GMTV

Radial centric(RC) Average connectivity index chi-3(ACIX3)

Maximal

electropological negative variation (MENV)

JHETV Mean square distance balaban(MSDB)

path/walk 2 randic shape

index(RSIPW2) Eccentric(DECC)

Average vertex distance

degree(AVDD)

MTCI6 Mean distance degree deviation (MDDD)

Average valence connectivity index chi-1(AVX1)

Average connectivity index chi-0(ACIX0)

Average

eccentricity(AECC) MR1

Valence connectivity index chi-5 (VX5)

Topological charge index of order3(TCI3)

Kier hall

(5)

Reciprocal hyper-distance-path index(RHDPI),

Path/walk3-randic shape index (RSIPW3)

Average connectivity index chi-2(ACIX2) JHETM Solvation connectivity index chi-3(SCIX3) Narumi Geometric

topological (NGT) MTCI1

Balaban distance connectivity index(J)

KAMS2 Path/walk 4-randic shape index(RSIPW4) Gutman molecular topological(GMT) Path/walk5-randic shape index(RSIPW5) Total structure

connectivity(TSC) KAMS3 TCI5 Average connectivity

index chi4(ACIX4)

Valence connectivity index chi-4(VX4)

Global topological

charge(GTC) KAMS1

Narumi simple topological(NST) Average valence

connectivity index chi-2(AVX2) Valence connectivity index chi-1(VX1) Solvation connectivity index chi-5(SCIX5) XU ,Mean topological charge index or order 3(MTCI3)

Average valence connectivity index chi-3(AVX3)

Average valence connectivity index chi-5(AVX5) Modified Randic connectivity (MRC) JHETZ Average valence connectivity index chi-4(AVX4) Valence connectivity index chi-0(VX0) E-state topological parameter (ETP), Mean wiener(MW) Average connectivity index chi-1(ACIX1)

MTCI 4 WHETP

JHETE Topological charge

index of order2(TCI2)

RESULTS AND DISCUSSION

ANN and SVM classifier are used for the ADME prediction. We have used 715 drugs data

set. Each drug is linked with 95 descriptors. Test data and a training data are prepared for

classification. Training set is involving 200 data. Rests of the data treat as test data. Classes

defined as „0‟ for LPPB and „1‟ for „HPPB. „0‟ stand for negative class and „1‟ stands for positive class. ANN and SVM classifier‟s performance comparison is done with quantitative

variable like specificity, sensitivity, and precision, Mathew correlation index (MCC),

accuracy and youden‟s index.

Artificial Neural Network Results

The ANN was implemented using nprtool and nntool boxes which present in MATLAB. The

FFBP used to train network with mean square error function. FFBP is used in nntool. In

nprtool scaled conjugate gradient backpropagation (trainscg) is used. Hit and trail method

used to determine no. of HL and neurons. In nntool, the network contains 2 layers. The no. of

neuron in layer is 10. Tansig transfer function is a used. The network is build. The training

data network is build. 200 drug is used in training data. The best validation performance

found to be 0.362 using 20 epochs. The network is used for test data to find out accuracy,

(6)

www.wjpr.net Vol 3, Issue 9, 2014. 437 Result of Nprtool

In nprtool data is divided into a 3 parts. Ist part is training. Second part is validation. Third

part is testing data. In training part 70% data treat as training data. Out of 200 drugs 140

drugs is selected for training data. In validation part 15% data treat as validating data. 30

drugs are taken as validation data. In testing part 15% data treat as testing data. 30 drugs are

taken as testing data. Performance of network is 0.339 using 33 epochs. Number of hidden

neuron is 20. Confusion matrix gives the accuracy of network. Confusion matrix was built for

training data.

Confusion Matrix Result

Result of Training Data

True Negative (TN) =72, True Positive (TP) = 63, False Positive (FP) = 3, False Negative

(FN) = 2, Accuracy of network for training = 96.4%, Overall accuracy of network = 95.5%,

Sensitivity = 95%, Specificity = 96%, Precision = 96%, Youden index = .91, MCC = .9283,

Positive predictive value = 96%, Negative predictive value = 95%.

Result of Test Data

TN = 10, TP = 19, FP = 0, FN = 1, Accuracy of network for test data = 96.7%, Overall

accuracy of network = 95.5%, Sensitivity= 95%, Specificity = 100%, Precision = 100%,

Youden index = .60, MCC = .9293, Positive predictive value = 100%, Negative predictive

(7)

www.wjpr.net Vol 3, Issue 9, 2014. 438 Result of SVM

TN = 92, TP = 88, FP = 11, FN = 9, Training data accuracy = 90%, Sensitivity = 90.72%,

Specificity = 89.32%, Precision = 90.72%, Youden index = .8004, Positive predictive value=

88.8%, Negative predictive value= 91.08%, ROC area = .9.

Support Vector Machine Classification Results by WEKA

To create SVM classifier 95 descriptors is used. Classification is done by SMO, which is

present in weka function of classify. Sequential minimal optimization (SMO) was performed

on training data with use of 10-fold cross a validation. 200 data is taken as training data.

(8)

www.wjpr.net Vol 3, Issue 9, 2014. 439 Training of SVM Classifier for Classification Using MATLAB Function

Svmtrain is function which is present in matlab. Svmtrain take two input training data and

group. Group means target data. Target data has classes. Svmstruct is an input to the

svmclassify.

SVMStruct = svmtrain (Training, Target)

Group = svmclassify (svmstruct, test)

Here sample means test data. We get classified test data with svmclassify.

>> Svmstruct=svmtrain (tr_in, tr_tr)

>> Group = svmclassify (svmstruct, ts_in)

Group = 0 0 0 0 0 0 0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 0 0 1

Classification Using LibSVM

LibSVM give accuracy of test data is a 50%.we get good accuracy of test data in ANN in

comparison to SVM.

Conclusion and Future Work

We constructed two prediction models for drug distribution prediction. We used ANN and

SVM for prediction. 95 descriptors were used for prediction model. 200 compounds used for

drug distribution prediction. First classification was done on 200 drugs in HPPB and LPPB

classes. Then 50 drugs were taken as a test data. ANN given 95% accuracy for training data

and 86% accuracy, we get for test data. SVM has given 90% accuracy for training data and

80% accuracy for test data. These models were made to predict drug distribution using

statistical learning as a substitute of time consuming experiment. ANN results well in

comparison to SVM. Different softwares used for prediction of HPPB and LPPB classes.

LPPB means low plasma protein binding which results into high drug distribution and HPPB

means high plasma protein binding which result into low drug distribution. Drug divided into

(9)

Descriptor selection is very difficult job. We calculated 295 descriptors for classification

purpose but these descriptors are more in number. In calculated descriptor, there are some

similar or irrelevant descriptor are present. Some algorithms applied for the select ion of

relevant descriptor. 95 descriptors associated data is used for the classification purpose. All

results are significant. If anyone gives new drug to our predictive model then our model is

able to tell that new drug belongs to which class LPPB or HPPB. We can say that drug has

low or high drug distribution. We can identify potential drug. ADME prediction is done by

statistical learning. Statistical learning avoids experimental work. These machine learning

techniques save the time and cost both.

REFERENCES

1. Tiago L. Moda, Leonardo G. Torres, Alexandre E. Carrara and Adriano D. Andricopulo,

“PK/DB: database for pharmacokinetic properties and predictive in silico ADME models”, Laboratory of Computational and Medicinal Chemistry, Center for Structural

Molecular Biotechnology, Brazil,2008; 24(19):2270–2271.

2. David S. Wishart, “Improving Early Drug Discovery through ADME

Modelling”,Departments of Biological Science and Computing Science, University of

Alberta, Edmonton, Alberta, Canada Drugs R D, 2007; 8(6):349-362.

3. Hrishikesh Mishra, Nitya Singh, Tapobrata Lahiri, Krishna Misra“A comparative study

on the molecular descriptors for predicting drug-likeness of small molecules”,

Bioinformation, 2009; 3(9): 384-388.

4. Prashant S. Kharkar, “Two-Dimensional (2D) In Silico Models for Absorption,

Distribution,Metabolism, Excretion and Toxicity (ADME/T) in Drug Discovery”, Wayne

State University, Department of Pharmaceutical Sciences, Applebaum College of

Pharmacy and Health Sciences, Detroit, MI 48202, USA, 2010; 116-126.

5. David S. Wishart ,“ Bioinformatics in drug development and assessment”, Departments

of Biological Sciences and Computing Science, University of Alberta, Edmonton,

Alberta, Canada, DOI: 10.1081/DMR-200055225, August 29-September 2, 2004.

6. Martin F. Møller, “A Scaled Conjugate Gradient Algorithm for Fast Supervised

Learning”, Computer Science Department, University of Aarhus, Denmark, PB-339,

1990.

7. P. A. ROUTLEDGE, “The plasma protein binding of basic drugs”, Department of

Pharmacology and Therapeutics, University of Wales College of Medicine, Heath Park,

(10)

8. Ania de la Nuez, Rolando Rodríguez, “Current methodology for the assessment of

ADME-Tox properties on drug candidate molecules”, Biotecnología Aplicada, 2008;

25:97-110.

9. Vapnik V., Cortes C, “Support-Vector networks, Machine Learnings”, 1995; 20: 273-297.

10.Mark AH, Lloyd AS, “Feature selection for machine learning: Comparing a

correlation-based filter approach to the wrapper”, Proceeding of the twelfth International FLAIRS

Conference, 1999.

11.Francesco Archetti, Stefano Lanzeni., Enza Messina., Leonardo Vanneschi., “Genetic

programming for computational pharmacokinetics in drug discovery and development”,