www.wjpr.net Vol 3, Issue 9, 2014. 432
PREDICTION OF PLASMA PROTEIN BINDING AFFINITY BY
SUPPORT VECTOR MACHINE AND ARTIFICIAL NEURAL
NETWORK
Abhishek Singh Chauhan1*, Utkarsh Raj1, Pritish K.Varadwaj1
1
Indian Institute of Information Technology, Allahabad, India.
ABSTRACT
Plasma Protein Binding plays a major role in pharmacokinetics. In this
work, we selected drug distribution prediction by support vector
statistical learning. Drugs which attach with high attraction to Plasma
protein means low drug distribution. If drug binds with low affinity to
plasma protein means high drug distribution. Our interest is in high
drug distribution. 715 drugs are drug distribution related drug. We
have divided 715 drugs in high ppb drug and low ppb drug with help of
statistical learning methods. We used two statistical learning methods
these are SVM along with ANN. The major idea of machine learning is
classifying data into two classes that is high plasma protein binding
compound and LPPB compound. In machine learning, feature selection or molecular
descriptor is very important footstep. We used TSAR, DRAGON and SCHRODINGER
software, which are applicable on molecular file format like SDF to extract 295 descriptors
from them. Next we selected 95 descriptors through different feature selection method. Our
approach shows that if any one gives new data related to ppb, our predictive model give the
class. If anyone gives the test data, our model tells about the high ppb or low ppb class. We
got more accuracy in ANN in comparison to SVM.
KEYWORDS: ANN (Artificial Neural Network), SVM (Support Vector Machine), LPPB
(Low Plasma Protein Binding), HPPB (High Plasma Protein Binding), FFBP (Feed Forward
Back propagation).
INTRODUCTION
Complex set of molecules is a drug. Drug does not dispersed in compound form in the body,
after being engrossed it should wrecked up into simplest form then it widen to all component
Volume 3, Issue 9, 432-441. Research Article ISSN 2277– 7105
Article Received on 24 August 2014,
Revised on 17 Sept 2014, Accepted on 12 Oct 2014
*Correspondence for
Author
Dr. Abhishek Singh
Chauhan
Indian Institute of
Information Technology,
Allahabad, India.
www.wjpr.net Vol 3, Issue 9, 2014. 433
of body by blood. After assimilation and dispersed the remaining misuse part of medicine
will excreted in body.ADME play a major role in pharmacokinetics. A stand for Absorption,
D stand for distribution, M stand for metabolism, E stand for excretion. ADME is
pharmacokinetic property. Pharmacokinetics and pharmacodynamics both are related to drug.
Pharmacokinetics means effect of the body‟s on drug and pharmacodynamics means effect of
drug on to the body. The ADME databases provide newest and total data for structurally
varied compounds linked with recognized ADME properties; these are oral bioavailability,
enzyme metabolism, induction and inhibition, plasma protein binding, transport and blood
brain barrier. Among these properties we selected plasma protein binding (PPB).
Absorption is a method by which drug passes from place of administration to blood stream.
If drug absorbed poorly then we give that drug in different manner like inhalation or by
intravenously.
Distribution is a part of pharmacokinetics which explains the reversible transport of drug.
Compound move into different organ and muscle via blood stream. Distribution is a way by
which drug transfer from intravascular room to extra vascular room. Spaces explained
mathematically in expressions of volume of distribution. Distribution‟s Equation:
Volume of distribution = Drug’s Dose / drug concentration
Anabolism and Catabolism summation is metabolism. Anabolism means biosynthesis of
complex thing. Catabolism means breakdown of complex things.
Excretion is way by which non useful materials and waste metabolite are removed. We
selected Distribution property of drug for our research work. When drug enter into blood
circulation, it is dispersed to the body‟s tissues. The distribution is normally irregular
because of differences in tissue binding, blood perfusion, permeability of cell membranes and
regional pH. In drug distribution protein plays a major role. Contact of drug with tissue
protein and contact of drug with plasma protein make some changes in drug distribution. The
Protein binding have major impact on drug‟s PK and PD. Drug exists in two forms: bound
and unbound form in blood. Plasma protein bound to drug and remainder being unbound.
Unbound fraction of drug undergoes to metabolism. The interaction of drug to plasma or
serum protein is a saturable, reversible process, which is an important factor assessing the PD
www.wjpr.net Vol 3, Issue 9, 2014. 434
plasma. The less bound drug more efficiently can traverse cell membrane or diffuse. Human
serum albumin, glycoprotein, alpha, beta and gamma globulin, lipoprotein are the blood
protein, drug bind to them. High plasma protein binding lead to drug present in blood
compartment, this result into lower volume of drug distribution. Low plasma protein binding
results into more drugs is free to move into tissue. This will result into high volume of
distribution.
METHODS AND MATERIALS
Data collection and Data pre-Processing
To develop a drug distribution prediction model, one should choose the dataset limited to
drug distribution. Drugs are related to PPB. We had taken plasma related drug because low
plasma protein binding drug shows high drug distribution and HPPB drug shows low drug
distribution. The data was downloaded from http://www.pkdb.ifsc.usp.br/. The data set
consists of 715 compounds. Class labels were defined as „0‟ for low plasma protein binding
drug and „1‟ for high plasma protein binding drug. The whole data comprising of 715 drugs
was then separated into two sets using 75:25 principle. Training data includes 200 drugs of
which 100 are HPPB ligands and 100 are LPPB ligands. Other remaining drug treats as test
data.
Calculation of Descriptor Values
The drug data set find from literature was in SDF format. We used three software DRAGON,
TSAR and Schrodinger software for the calculation of descriptor values for the dataset. We
were able to calculate 295 descriptors with the help of these three software‟s.
Descriptor Selection
Classification problem difficulty increased when data is linked with huge no. of descriptor.
Dimensionality can be expressed by no. of feature. If number of descriptor is more than
dimensionality of space which makes the optimization harder. Only relevant descriptor
should be used in algorithm. Descriptor selection did only to find out relevant descriptor.
This relevant descriptor is used in data set. Highly identical and irrelevant descriptor is
identified. Identical descriptor is removed to increase performance of classifier. Significant
descriptor can be selected either using automated approach or manually approach. Automated
www.wjpr.net Vol 3, Issue 9, 2014. 435 Relation between different Descriptors
We calculated 295 features from the software. We would like to check 2 descriptors are
correlated or not. Correlation among two descriptors then we delete the descriptor from the
dataset. We had used MATLAB correlation coefficient function for the descriptor selection.
We selected 0.6 as a threshold. Above the .6 value, we had discarded the descriptor. After the
use of correlation coefficient we got 95 relevant descriptors. We had used MATLAB
correlation coefficient function; it gives matrix R which is showing correlation coefficient
calculated using descriptor in matrix form. The MATLAB function is R = corrcoef(X) where
R is correlation coefficient and X is input matrix.
Calculated descriptors are: 95 calculated descriptors are as follows:
Kier chiv6(path) index Molecular mass, Inertia moment 3 size
Square reciprocal distance
sum(SRDS)
TCI6 lipole y component
Kierchi6 (ring)index Inertia moment 1
length Total lipole Log p molecular refractivity
KierchiV6(ring) index Ellipsoidal volume
(whole molecule) lipole x component ZM2V
Kierchi V3 (cluster) index Connectivity chi-1[Randic connectivity](CIX1) Maximal electropological positive variation(MEPV)
MTCI2 RTI
Average valence connectivity index chi-0(AVX0) Average connectivity index chi-5(ACIX5) Narumi Harmonic topological (NHT) Kier benzene-likeliness index(KBLI)
BTI Kier flexibility(KF)
Sum of E-state indices Solvation connectivity index chi-2 (SCIX2)
Solvation
connectivity index chi-1(SCIX1)
TCI4 Connectivity index chi-0(CIX0)
Mol-MW Second Mohar(SM) Whete JHETP GMTV
Radial centric(RC) Average connectivity index chi-3(ACIX3)
Maximal
electropological negative variation (MENV)
JHETV Mean square distance balaban(MSDB)
path/walk 2 randic shape
index(RSIPW2) Eccentric(DECC)
Average vertex distance
degree(AVDD)
MTCI6 Mean distance degree deviation (MDDD)
Average valence connectivity index chi-1(AVX1)
Average connectivity index chi-0(ACIX0)
Average
eccentricity(AECC) MR1
Valence connectivity index chi-5 (VX5)
Topological charge index of order3(TCI3)
Kier hall
www.wjpr.net Vol 3, Issue 9, 2014. 436
Reciprocal hyper-distance-path index(RHDPI),
Path/walk3-randic shape index (RSIPW3)
Average connectivity index chi-2(ACIX2) JHETM Solvation connectivity index chi-3(SCIX3) Narumi Geometric
topological (NGT) MTCI1
Balaban distance connectivity index(J)
KAMS2 Path/walk 4-randic shape index(RSIPW4) Gutman molecular topological(GMT) Path/walk5-randic shape index(RSIPW5) Total structure
connectivity(TSC) KAMS3 TCI5 Average connectivity
index chi4(ACIX4)
Valence connectivity index chi-4(VX4)
Global topological
charge(GTC) KAMS1
Narumi simple topological(NST) Average valence
connectivity index chi-2(AVX2) Valence connectivity index chi-1(VX1) Solvation connectivity index chi-5(SCIX5) XU ,Mean topological charge index or order 3(MTCI3)
Average valence connectivity index chi-3(AVX3)
Average valence connectivity index chi-5(AVX5) Modified Randic connectivity (MRC) JHETZ Average valence connectivity index chi-4(AVX4) Valence connectivity index chi-0(VX0) E-state topological parameter (ETP), Mean wiener(MW) Average connectivity index chi-1(ACIX1)
MTCI 4 WHETP
JHETE Topological charge
index of order2(TCI2)
RESULTS AND DISCUSSION
ANN and SVM classifier are used for the ADME prediction. We have used 715 drugs data
set. Each drug is linked with 95 descriptors. Test data and a training data are prepared for
classification. Training set is involving 200 data. Rests of the data treat as test data. Classes
defined as „0‟ for LPPB and „1‟ for „HPPB. „0‟ stand for negative class and „1‟ stands for positive class. ANN and SVM classifier‟s performance comparison is done with quantitative
variable like specificity, sensitivity, and precision, Mathew correlation index (MCC),
accuracy and youden‟s index.
Artificial Neural Network Results
The ANN was implemented using nprtool and nntool boxes which present in MATLAB. The
FFBP used to train network with mean square error function. FFBP is used in nntool. In
nprtool scaled conjugate gradient backpropagation (trainscg) is used. Hit and trail method
used to determine no. of HL and neurons. In nntool, the network contains 2 layers. The no. of
neuron in layer is 10. Tansig transfer function is a used. The network is build. The training
data network is build. 200 drug is used in training data. The best validation performance
found to be 0.362 using 20 epochs. The network is used for test data to find out accuracy,
www.wjpr.net Vol 3, Issue 9, 2014. 437 Result of Nprtool
In nprtool data is divided into a 3 parts. Ist part is training. Second part is validation. Third
part is testing data. In training part 70% data treat as training data. Out of 200 drugs 140
drugs is selected for training data. In validation part 15% data treat as validating data. 30
drugs are taken as validation data. In testing part 15% data treat as testing data. 30 drugs are
taken as testing data. Performance of network is 0.339 using 33 epochs. Number of hidden
neuron is 20. Confusion matrix gives the accuracy of network. Confusion matrix was built for
training data.
Confusion Matrix Result
Result of Training Data
True Negative (TN) =72, True Positive (TP) = 63, False Positive (FP) = 3, False Negative
(FN) = 2, Accuracy of network for training = 96.4%, Overall accuracy of network = 95.5%,
Sensitivity = 95%, Specificity = 96%, Precision = 96%, Youden index = .91, MCC = .9283,
Positive predictive value = 96%, Negative predictive value = 95%.
Result of Test Data
TN = 10, TP = 19, FP = 0, FN = 1, Accuracy of network for test data = 96.7%, Overall
accuracy of network = 95.5%, Sensitivity= 95%, Specificity = 100%, Precision = 100%,
Youden index = .60, MCC = .9293, Positive predictive value = 100%, Negative predictive
www.wjpr.net Vol 3, Issue 9, 2014. 438 Result of SVM
TN = 92, TP = 88, FP = 11, FN = 9, Training data accuracy = 90%, Sensitivity = 90.72%,
Specificity = 89.32%, Precision = 90.72%, Youden index = .8004, Positive predictive value=
88.8%, Negative predictive value= 91.08%, ROC area = .9.
Support Vector Machine Classification Results by WEKA
To create SVM classifier 95 descriptors is used. Classification is done by SMO, which is
present in weka function of classify. Sequential minimal optimization (SMO) was performed
on training data with use of 10-fold cross a validation. 200 data is taken as training data.
www.wjpr.net Vol 3, Issue 9, 2014. 439 Training of SVM Classifier for Classification Using MATLAB Function
Svmtrain is function which is present in matlab. Svmtrain take two input training data and
group. Group means target data. Target data has classes. Svmstruct is an input to the
svmclassify.
SVMStruct = svmtrain (Training, Target)
Group = svmclassify (svmstruct, test)
Here sample means test data. We get classified test data with svmclassify.
>> Svmstruct=svmtrain (tr_in, tr_tr)
>> Group = svmclassify (svmstruct, ts_in)
Group = 0 0 0 0 0 0 0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 0 0 1
Classification Using LibSVM
LibSVM give accuracy of test data is a 50%.we get good accuracy of test data in ANN in
comparison to SVM.
Conclusion and Future Work
We constructed two prediction models for drug distribution prediction. We used ANN and
SVM for prediction. 95 descriptors were used for prediction model. 200 compounds used for
drug distribution prediction. First classification was done on 200 drugs in HPPB and LPPB
classes. Then 50 drugs were taken as a test data. ANN given 95% accuracy for training data
and 86% accuracy, we get for test data. SVM has given 90% accuracy for training data and
80% accuracy for test data. These models were made to predict drug distribution using
statistical learning as a substitute of time consuming experiment. ANN results well in
comparison to SVM. Different softwares used for prediction of HPPB and LPPB classes.
LPPB means low plasma protein binding which results into high drug distribution and HPPB
means high plasma protein binding which result into low drug distribution. Drug divided into
www.wjpr.net Vol 3, Issue 9, 2014. 440
Descriptor selection is very difficult job. We calculated 295 descriptors for classification
purpose but these descriptors are more in number. In calculated descriptor, there are some
similar or irrelevant descriptor are present. Some algorithms applied for the select ion of
relevant descriptor. 95 descriptors associated data is used for the classification purpose. All
results are significant. If anyone gives new drug to our predictive model then our model is
able to tell that new drug belongs to which class LPPB or HPPB. We can say that drug has
low or high drug distribution. We can identify potential drug. ADME prediction is done by
statistical learning. Statistical learning avoids experimental work. These machine learning
techniques save the time and cost both.
REFERENCES
1. Tiago L. Moda, Leonardo G. Torres, Alexandre E. Carrara and Adriano D. Andricopulo,
“PK/DB: database for pharmacokinetic properties and predictive in silico ADME models”, Laboratory of Computational and Medicinal Chemistry, Center for Structural
Molecular Biotechnology, Brazil,2008; 24(19):2270–2271.
2. David S. Wishart, “Improving Early Drug Discovery through ADME
Modelling”,Departments of Biological Science and Computing Science, University of
Alberta, Edmonton, Alberta, Canada Drugs R D, 2007; 8(6):349-362.
3. Hrishikesh Mishra, Nitya Singh, Tapobrata Lahiri, Krishna Misra“A comparative study
on the molecular descriptors for predicting drug-likeness of small molecules”,
Bioinformation, 2009; 3(9): 384-388.
4. Prashant S. Kharkar, “Two-Dimensional (2D) In Silico Models for Absorption,
Distribution,Metabolism, Excretion and Toxicity (ADME/T) in Drug Discovery”, Wayne
State University, Department of Pharmaceutical Sciences, Applebaum College of
Pharmacy and Health Sciences, Detroit, MI 48202, USA, 2010; 116-126.
5. David S. Wishart ,“ Bioinformatics in drug development and assessment”, Departments
of Biological Sciences and Computing Science, University of Alberta, Edmonton,
Alberta, Canada, DOI: 10.1081/DMR-200055225, August 29-September 2, 2004.
6. Martin F. Møller, “A Scaled Conjugate Gradient Algorithm for Fast Supervised
Learning”, Computer Science Department, University of Aarhus, Denmark, PB-339,
1990.
7. P. A. ROUTLEDGE, “The plasma protein binding of basic drugs”, Department of
Pharmacology and Therapeutics, University of Wales College of Medicine, Heath Park,
www.wjpr.net Vol 3, Issue 9, 2014. 441
8. Ania de la Nuez, Rolando Rodríguez, “Current methodology for the assessment of
ADME-Tox properties on drug candidate molecules”, Biotecnología Aplicada, 2008;
25:97-110.
9. Vapnik V., Cortes C, “Support-Vector networks, Machine Learnings”, 1995; 20: 273-297.
10.Mark AH, Lloyd AS, “Feature selection for machine learning: Comparing a
correlation-based filter approach to the wrapper”, Proceeding of the twelfth International FLAIRS
Conference, 1999.
11.Francesco Archetti, Stefano Lanzeni., Enza Messina., Leonardo Vanneschi., “Genetic
programming for computational pharmacokinetics in drug discovery and development”,