ContentslistsavailableatScienceDirect
Computational
Biology
and
Chemistry
j ou rn a l h o m epa ge :w w w . e l s e v i e r . c o m / l o c a t e / c o m p b i o l c h e m
Research
Article
Development
of
a
sugar-binding
residue
prediction
system
from
protein
sequences
using
support
vector
machine
Masaki
Banno
a,
Yusuke
Komiyama
b,
Wei
Cao
a,
Yuya
Oku
a,
Kokoro
Ueki
a,
Kazuya
Sumikoshi
a,
Shugo
Nakamura
a,
Tohru
Terada
a,
Kentaro
Shimizu
a,∗aGraduateSchoolofAgriculturalandLifeSciences,TheUniversityofTokyo,1-1-1Yayoi,Bunkyo-Ward,Tokyo113-8657,Japan
bDigitalContentandMediaSciencesResearchDivision,NationalInstituteofInformatics,2-1-2Hitotsubashi,Chiyoda-Ward,Tokyo101-8430,Japan
a
r
t
i
c
l
e
i
n
f
o
Articlehistory:
Received8March2016
Receivedinrevisedform5October2016 Accepted23October2016
Availableonline9November2016
Keywords:
Supportvectormachine Sugar-bindingproteins Sugar-bindingresidueprediction Carbohydrate
Machinelearning
a
b
s
t
r
a
c
t
Severalmethodshavebeenproposedforprotein–sugarbindingsitepredictionusingmachinelearning algorithms.However,theyarenoteffectivetolearnvariouspropertiesofbindingsiteresiduescaused byvariousinteractionsbetweenproteinsandsugars.Inthisstudy,weclassifiedsugarsintoacidicand nonacidicsugarsandshowedthattheirbindingsiteshavedifferentaminoacidoccurrencefrequencies.By usingthisresult,wedevelopedsugar-bindingresiduepredictorsdedicatedtothetwoclassesofsugars:an acidsugarbindingpredictorandanonacidicsugarbindingpredictor.Wealsodevelopedacombination predictorwhichcombinestheresultsofthetwopredictors.Weshowedthatwhenasugarisknown tobeanacidicsugar,theacidicsugarbindingpredictorachievesthebestperformance,andshowed thatwhenasugarisknowntobeanonacidicsugarorisnotknowntobeeitherofthetwoclasses,the combinationpredictorachievesthebestperformance.Ourmethodusesonlyaminoacidsequencesfor prediction.Supportvectormachinewasusedasamachinelearningalgorithmandtheposition-specific scoringmatrixcreatedbytheposition-specificiterativebasiclocalalignmentsearchtoolwasusedas thefeaturevector.Weevaluatedtheperformanceofthepredictorsusingfive-foldcross-validation.We havelaunchedoursystem,asanopensourcefreewaretoolontheGitHubrepository(https://doi.org/10. 5281/zenodo.61513).
©2016TheAuthors.PublishedbyElsevierLtd.ThisisanopenaccessarticleundertheCCBY-NC-ND license(http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction
Interactionsbetweensugarchainsandproteinsplayessential rolesinbiologicalprocessessuchasintercellularcommunication, immunity,and cellularrecognition.The methodstoempirically analyzesuchinteractionsincludehemagglutinationassays,which areemployedin thediscoveryof novellectins.In recent years, methods utilizing glycan arrays have been developed as high-throughputsolutions,enablingresearcherstoobtaindataoninvitro
interactionsbetweenmultiplesugarchainsand proteins(Porter etal.,2010;Blixtetal.,2004;Gabiusetal.,2011).Nevertheless,the bioinformatics-basedpredictionapproachescanfurtherreducethe timeandeffortinvolvedinpredictingsuchinteractions,providing valuablecluesforexperimentalwork.Conventionalmethodsare usefulin determiningprotein–sugarchaininteractions or iden-tifyingsugarchainrecognitionsequences.However,theycannot
∗Correspondingauthor.
E-mailaddress:[email protected](K.Shimizu).
provideinformation onthebindingresiduesin proteins. Meth-odssuchasX-raycrystallographyandnuclearmagneticresonance haveprimarilybeenusedtoidentifythesebindingresidues. How-ever,suchtechniquesposenumerouschallengesbecausetheyare generallycost-andlabor-intensive,Moreover,thehighmotilityof sugarchainsrendersthedeterminationoftheirtertiarystructures difficult(DeMarcoandWoods,2008).Aspartialsolutionstosuch challenges,bioinformatics-basedtechniqueshavebeenattracting attention.
Dockingsimulationisa predictionmethodforsugar-binding residues based on their tertiary structures. To implement this method, many protein–ligand docking programs (Morris et al., 2009;Jonesetal.,1995,1997;Biesiadaetal.,2011;Forlietal.,2016; Grinteretal.,2014)andmolecularsimulationsareoftenemployed. In a previousstudyinvolvingsugar chain-bindingresidues,the heparin-binding residueshavebeen predictedinan interleukin onthebasisofitsproteinstructure(DeMarcoandWoods,2008). Thecandidateresidueswerenarroweddownviarepeateddocking withheparinmonosaccharidesanddisaccharides.Then,the hep-arinhexasaccharidesweredockedtotheremainingcandidatesto http://dx.doi.org/10.1016/j.compbiolchem.2016.10.009
1476-9271/©2016TheAuthors.PublishedbyElsevierLtd.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4. 0/).
predicttheheparin-bindingresiduesintheinterleukin. Another study has used machine learning to predict glucose-binding residues fromtertiary structure of proteins. It has employed a learning model with a support vector machine (SVM), which usedthe occurrencerates of atomsappearing in theproximity ofglucose-bindingresiduesasthefeaturevalues(McDonaldand Thornton,1994).Tsaietal.(2012)developedasugar-bindingsite predictionmethodbasedonthree-dimensionalprobabilitydensity maps,representingthedistributionsof36non-covalent interact-ingatomtypesaroundproteinsurfaces.Themethodreportedby Zhaoetal.(2014)usesastructuralalignmentprogram,SPalignand bindingaffinityscores,accordingtoaknowledge-basedpotential. Allofthesemethodsrelyonthetertiarystructureofthetarget proteinforthepredictionofthebindingresidues,thusrequiringthe determinationoftheproteinstructure.Theaminoacidsequenceof aproteinismucheasiertoobtainthanitstertiarystructure.Thus,it ispreferableforthehigh-throughputexperimentssuchas genome-wideandglycanarraysanalyses.
Some attempts have been made to build software applica-tionscapableof learningsuchfeaturessothat theycanpredict sugar-binding residues only from amino acid sequences. Malik etal.havedevelopedamachinelearning-basedmethodusing neu-ralnetworks.Theyhaveconstructedapredictionprogramusing theposition-specificscoringmatrices(PSSMs) derivedfromthe residue frequencyand multiplealignments of40 sugar-binding proteinsand18galactose-bindingproteinsasthefeaturevalues. Theperformanceoftheprogramhasbeenevaluatedby leave-one-outcross-validation(CV)(MalikandAhmad,2007).Theirresults showthatthepredictionprogramperformsmoreeffectivelywhen appliedtoadatasetofgalactose-bindingproteinsthanthatwhen learningusingtoallsugar-bindingproteins.Nassifetal.(2009)also developedaglucose-bindingsitepredictionmethod.Thismethod usesspatialfeaturesofbindingpocketsandaminoacidand chem-icalfeaturessuchascharge,polarity,mobility,andhydrophobicity asdeterminantfeaturesof abindingsite. Recently,a mannose-bindingsitepredictionprogramhasbeendeveloped;itusesthe compositionprofileofpatternsassequencefeatures(Agarwaletal., 2011).
Inthispresentstudy,weattemptedahigh-performance pre-dictionbygroupingthesugar-bindingproteinsdependingonthe characteristicsoftheirbindingresiduesanddesigningapredictor dedicatedtoeachgroup.Weanalyzedthecharacteristicsofthe bindingresiduesbyclusteringthesugarsaccordingtotheresidue compositionatthebindingsites,andtherebyclassifiedthe sug-arsintodifferentclasses.Individualpredictorsforeachsugarclass madethelearningofthepropensitiesofthebindingresiduesmore effective. This,in turn, resulted in improvedprediction perfor-manceofthepredictor.Furthermore,ourmethodusesonlythe aminoacidsequencesforprediction.SVMwasemployedbecause itisoneoftherepresentativetechniquesfortheclassificationof thedataintotwocategorieswithhighgeneralizationability.SVM takesasinputPSSMsaroundatargetresidueasfeaturevalues.It canimprovethepredictioncapabilityfurtherbyextensive incor-porationofthenatureofhomologousproteinscoupledwithsugar class-specificlearning.
2. Materialsandmethods
2.1. Searchforsugar-bindingproteinsintheproteindatabank database
Wetargetedthesugarsthatfrequentlyoccurinvivo,namely aldosesandketoses,andtheirderivativesinwhichthehydroxy group is oxidized or substitutedwith a methyl group,sulfonic group,phosphategroup,acetylgroup,aminegroup,oracetylamide group.Fig.1illustratestheprocedureforconstructingthedataset usedforprediction.
Withsugar-bindingresiduesdefinedastheresidueswithin4 ˚A ofthesugarmolecule.Weperformedanexhaustivesearchof pro-teindatabank(PDB)forproteinswithatleastonesugar-binding residues.Thisstudyfocusedonnoncovalentinteractionsbetween sugarsandproteinsandnotonglycosylationsitesatwhichsugars arecovalentlybondedwithproteins.Therefore,theresidueswithin the1.5 ˚Adistancefromasugarmolecule,aswellastheresidues adjacenttoacovalentlybonded sugarmolecule, wereexcluded fromthesearch.
Sugarsareoftencovalentlyattachednotonlytoproteinsbut alsotoothertypesofcompounds.AlthoughPDBcontainsdataon variousglycolipidsformedbybindingbetweensugarsandlipids, thesewerealsoexcluded.Herelipidsweredefinedascompounds registeredintheLipidMaps(Sudetal.,2007),adatabaseoflipid substances. Furthermore,saccharides that are used as cryopro-tectants,surfactants,andadditivestofacilitatecrystallization(Shi etal.,1997)wereexcludedbecausetheydonotengageininvivo
interactionswithsugars.
2.2. Clusteringofsugar-bindingresidues
Thebasicapproachofourstudywastoimprovetheprediction accuracybyclassifyingthesugarsintogroupsanddesigningthe predictorsdedicated toeachgroup.To obtainaneffective clas-sification,wefirstperformedclusteranalysisonthebasisofthe occurrencefrequencyofresiduesatthesugar-bindingsites.Inthe PDBdatabase,wetargetedsugarsboundtotheproteinswith100 ormoreresidues.Thegroupaveragemethodwasemployedasthe clusteringprocedure.
2.3. Eliminationofredundancy
Toreducetheredundancyofthesequences,theproteinswith asequencesimilarityof30%oraboveintherangeexceeding50% alignmentcoveragewerealsoexcludedusingBLASTClust(Altschul etal.,1997).Asaresultofthisprocess,369sugar-bindingproteins, 136acidicsugar-bindingproteins,and270nonacidicsugar-binding proteinswereselected.(37proteinshadbothacidicandnonacidic sugar-bindingresidues).
2.4. Predictionmethodforinteractingresidues
Onthebasisoftheproteinsequenceinformation,weperformed asearchonhomologoussequencesusingtheposition-specific iter-ativebasiclocalalignmentsearchtool(PSI-BLAST)(Altschuletal., 1997)inthenonredundantdatabaseofNCBI,therebycollecting homologoussequences.InthisPSI-BLASTsearch,numberof itera-tionswastwo,andtheE-valuethresholdofsequenceselectionfor profilecreationwas0.001.
Wedevelopedasystempredictingthesugar-bindingresidues ofproteinsonthebasisoftheiraminoacidsequences,using sup-portvector machine(SVM)(Cortesand Vapnik, 1995).An SVM learnsapredictivemodelfromthetrainingdatausingtheprinciple ofmarginmaximization.Itowesitshighgeneralization capabil-itytothislearningapproach.Topredicttheinteractingresidues,a PSSMofsugar-bindingsitesandtheirsequence-neighborresidues isconstructedbasedonthemultiplealignmentofthesugar-binding proteins.Weextractedwconsecutivecolumnvectors (correspond-ingtowconsecutiveresiduesinsequences)fromPSSMandused themas(20×w)-dimensionalfeaturevectorsinSVM.Usingthese featurevalues,SVMpredictedwhetherthecentralresidueswere thesugar-bindingresidues.Thevalueofwisdeterminedbythe parameteroptimizationproceduredescribedinSection2.5.
SVM wasgiven thedata of the sugar-binding residues as a positivedatasetfromTableS1forlearning.Weusedthedataofall theresiduesthatwere5–25residuesawayfromthesugar-binding sitesintheproteinsasanegativedatasetratherthanrandomly selectedproteinresidues.Thereweretworeasonsforusingthis negative dataset. One was to discriminate between the sugar-binding residues from nonsugar-binding residues in a protein. Theotherreasonwasthat,sincetheadjacentresiduestendedto havesomewhat similarfeaturevalues, giving residuesadjacent to binding residues as the negative examples might teach the machinetoimposepenaltiesonthefeaturevaluesresemblingthe
positiveexamples.Fortheevaluationofthepredictorperformance, allresiduesintheproteinsequenceswereusedasthetestset.
We constructed three types of predictors using three types oftrainingdatasets,asugar-binding residuepredictorusingthe sugar-binding proteins, a acidic sugar-binding residue predic-tor using the acidic sugar-binding proteins, and a nonacidic sugar-bindingresiduepredictorusingthenonacidicsugar-binding proteins.Besides,weconstructedacombinationpredictorby com-biningtheacidicandnonacidicsugar-bindingresiduepredictors. WhiletheSVM-basedpredictoroutputsdecisionvaluesas discrim-inantfunctionvalues,thecombinationpredictoriscomposedof thelinearcombinationofdiscriminantfunctionsoutputbythetwo predictiondevices.Weconstructednewpredictorsusingthelinear combinationofpredictiondevicesforacidicandnonacidicsugars asexpressedinthefollowingequation:
fnew (x)=p×facid (x)+q×fnonacid (x) (1)
For each data x, facid (x) and fnonacid (x) represent a
discrimi-nantfunctionoftheacidicandnonacidicsugar-bindingpredictors, respectively. A new discriminant function fnew was defined by
weighingthesefunctionswiththeparameterspandq.The dis-criminantfunctionfisdefinedas
f(x)=sgn
l i=1 ˛iyiK(xi,x)+b (2)suchthatyi=f(xi)givenNdatasamplesandlisthenumberof
train-ingrecords,yi∈{−1,+1}isthelabelassociatedwiththetraining
data,bisaconstant,xiisthesupportvectors,andKisthekernel
functionusedtotransformthedatapoints.
2.5. Evaluationmethod
To ensurean unbiasedparameter selection, we employed a nestedCVmethod.ThenestedCVcanguaranteeunbiased evalu-ationofgeneralizationcapabilitiesbyselectingmodelparameters inacross-validatingmanner.InthenestedCV,thedatasetwasfirst dividedintoksubsets,withoneusedasthetestsetandthe oth-ersasthetrainingset,asisthecasewithCV.Inthistrainingset, theparametersweredeterminedsothatthepredictorperformance tobeevaluatedbyCVwouldbemaximized.Thus,we obtained theoptimalvalues oftheparameters basedonthegridsearch:
w=5,C=−3and=−10.(Candareparametersforanonlinear SVMwithaGaussianradialbasisfunctionkernel.)Forthe param-etersofthediscriminantfunctioninEq.(1),wefixedthevalueof parameterpas1andobtainedtheoptimalvalue−0.96of param-eterqbasedonthegridsearch.Usingtheselectedparameters,we constructedthepredictorsthroughthelearningoftheentire train-ingset.Thecompletedmodelwasthensubjectedtoperformance evaluationonthebasisofthepredictionofthetestset.This proce-durewasrepeateduntilallsubsetswereusedastestsets,andthe resultswereaveragedtoevaluatethepredictorperformance. 3. Results
3.1. Characteristicsoftheinteractingsites
3.1.1. Clusteringofsugarsaccordingtothecompositionofthe residuesinthebindingsites
Fig.2Aportraysatreediagramshowingtheresultsof hierar-chicalclustering.Thesugarswereroughlydividedintotwogroups: withandwithouttheacidicfunctionalgroup.
Althoughthesugarswithanacidicfunctionalgrouphave var-ious functions,their common feature is a basic residue at the binding site. Sialic acid, for example, is a generic termfor the
Fig.2.Hierarchicalclusteringbyresidueoccurrencefrequencyofsugar-bindingresidues.(A)Hierarchicalclusteringbasedonthegroupaveragemethod.(B)Examplesof acidicsugars.(C)Examplesofnonacidicsugars.
neuraminicacid(i.e.,monosaccharideformedbyaldose condensa-tionofpyruvicacidandd-mannosamine)derivativeswithamino
or hydroxyl substituents. The sialic acids play important roles incellularrecognition.Theproteinsthatbindsialicacidinclude sialicacid-bindingimmunoglobulin-typelectins(Siglecs),a sub-setoftheI-typelectins.Arginineisacrucialfortherecognition ofsialicacidbySiglecs(Crockeretal.,2007).Glycosaminoglycans arepolysaccharidechainsconsistingof disaccharideunitsofan uronicacid(monosaccharidewithacarboxylgroup)andanamino sugar(monosaccharidewithanaminogroup),andmanysulfuric acidmoleculesarecovalentlyattachedtotheirhydroxylgroupsvia esterbonds.Thesurfaceofaglycosaminoglycanhasastrong neg-ativecharge,andarginine,lysine,andoccasionallyhistidineform ionicbondsatthebindingresidues(GandhiandMancera,2008). Incontrast,theCH/interactionsbetweentheCHgroupand aro-maticringarecrucialincommoninteractionsbetweensugarsand proteins(Gabiusetal.,2011).
Thesaccharidesweredividedintotwoclassesdependingonthe primarymodeoftheirinteractionswithaligand,thatis,ionic inter-actionornonionicinteraction.Thispropertywasreflectedbythe differencesinaminoacidcontentofthebindingproteins.Takethis intoaccount,weconstructedtheindividualbindingresidue pre-dictorsforsugarswithandwithoutanacidicfunctionalgroupand analyzedthedifferencesintheirproperties.Fig.2BandCshow someexamplesoftheclassifiedmonosaccharides,Andalso,Table1 describesthedetailedinformationof40sugarsinFig.2.
3.1.2. Residueoccurrencefrequencyofsugar-bindingsites
Wecalculatedtheoccurrenceprobabilityforeachtypeofamino acidinsugar-bindingsitesandinthewholesequencesofthe sugar-bindingproteins(Fig.3).
Fig. 3A shows the base-2 logarithm of the odds ratios of sugar-bindingresiduestoalltheproteinresiduescalculatedfor allthesugar-bindingproteins,acidicsugar-bindingproteins,and nonacidicsugar-bindingproteins.
Ineverycluster,polararomaticaminoacidssuchastryptophan, tyrosine,andhistidineaswellasarginineplayedimportantroles.In theacidicsugarcluster,argininehadaparticularlyhighoddsratio, andpolarresiduessuchaslysine,glycine,andserinedisplayed com-parativelyhigherratiosthanthatinthenonacidicsugarcluster.In contrast,theoddsratiooftryptophanwasparticularlyhighinthe nonacidicsugarcluster,andtheoddsratiosofpolarresidueswith anamide,suchasglutamineandasparagine,werehigherthanin theacidicsugarcluster.
It is well established that sugar affects a CH/ interaction withanaromaticringwhilebindingtoaprotein.Thisprobably explainswhythearomaticresiduesaccountedforalarge propor-tionofaminoacidsinthebindingsites.Whileacidicsugarsinteract stronglywithbasicresiduessuchasarginineandlysine,their inter-actionwithacidicresiduessuchasasparagineandglutamineare weaker.
Fig. 3B presents the base-2 logarithm of the odds ratiosof sugar-bindingresiduestotheproteinsurfaceresiduescalculated
Fig.3. Logoddsratiosofoccurrenceofsugarchain-bindingresidues,theaminoacidsarelistedinorderfromthelowestKyte–Doolittlehydropathyindexscore.Subfigure (C)representstheoccurrenceratiosofresiduesthatformedahydrogenbondwiththeproteinsidechain.Theoccurrenceratiosofresiduesthatformedahydrogenbond withthemainchainareplotted(main-chain)ontheright-handsside.(A)Logoddsratiosofsugarchain-bindingresiduesamongallaminoacidresidues.(B)Logoddsratios ofsugarchain-bindingresiduesamongthesurfaceresidues.(C)Logoddsratiosofsugarchain-bindingresiduesamonghydrogenbondacceptors.
forallthesugar-bindingproteins,acidicsugar-bindingproteins, andnonacidicsugar-bindingproteins.Incomparisonwiththecase ofallresidues,theproportionofhydrophobicresidueswashigh, whereastheoccurrenceratioofpolarresidueswaslow.This ten-dencywasparticularlyprominentforphenylalanineandcysteine inthenonacidiccluster.
Phenylalanineisa nonpolararomaticamino acid, andit can affectintoaCH/interactionwithsugar.Therelativescarcityof phenylalanineresidues exposed on thesurface suggeststhat a largeproportionofthesemoleculeswerefunctioningasbinding residues.
Cysteine stabilizes the protein folding by forming disulfide bonds. In OS-9 (PDB: 3AIH), a human-derived P-type lectin, the cysteine residues in the binding sites are strongly con-served among the proteins with the same domain. Although disulfidebondsarenot directlyinvolvedinsugar-binding, their formation might contribute to the establishment of binding domains (Satoh et al., 2010). Cysteine also forms coordinate bonds with metallic ligands. In the glucose 1-dehydrogenase (PDB: 2CDB), cysteine is involved in the catalytic reaction by forming a coordinate bond with a Zn2+ ion (Milburn et al., 2006).
Table1
NamesofsugarsusedinclusteringanalysisinFig.2.
Ligandid Ligandname
BG6 Beta-d-Glucose-6-phosphate
PRP Alpha-phosphoribosylpyrophosphoricacid ADA Alpha-d-galactopyranuronicacid IDS 2-O-sulfo-alpha-l-idopyranuronicacid SGN N,O6-disulfo-glucosamine G1P Alpha-d-glucose-1-phosphate 16G N-acetyl-d-glucosamine-6-phosphate G6P Alpha-d-glucose-6-phosphate R1P Ribose-1-phosphate F6P Fructose-6-phosphate FBP Beta-fructose-1,6-diphosphate
GCU d-Glucuronicacid
SIA O-sialicacid
BDP Beta-d-glucopyranuronicacid KDO 3-Deoxy-d-manno-oct-2-ulosonicacid M6P Alpha-d-mannose-6-phosphate NAA N-acetyl-d-allosamine RAM Alpha-l-rhamnose ARA Alpha-l-arabinose NAG N-acetyl-d-glucosamine NDG 2-(Acetylamino)-2-deoxy-A-d-glucopyranose BMA Beta-d-mannose XYP Beta-d-xylopyranose XYS Xylopyranose AHR Alpha-l-arabinofuranose FUL Beta-l-fucose GCS d-Glucosamine SGC 4-Deoxy-4-thio-beta-d-glucopyranose GDL 2-Acetamido-2-deoxy-d-glucono-1,5-lactone RIP Ribose(pyranoseform)
A2G N-Acetyl-2-deoxy-2-amino-galactose MAN Alpha-d-mannose RIB Ribose FUC Alpha-l-fucose FRU Fructose GLC Alpha-d-glucose GLA Alphad-galactose NGA N-Acetyl-d-galactosamine BGC Beta-d-glucose
GAL Beta-d-galactose
Fig.3C shows the base-2logarithm of theodds ratio ofthe proteinsidechainsbeingthebindinglociofthehydrogenbond acceptorsforthesugar-bindingproteins,acidicsugar-binding pro-teins,andnonacidicsugar-bindingproteins.Asimilaranalysiswith hydrogenbondacceptorsrevealedthatthenumberwasverysmall, andallthehydrogenbondswereformedinthemainproteinchains. IncomparisonwithFig.3B,thedifferencebetweenthetwo clus-terswasnotable;serine,lysine,andarginineaccountedforlarger proportionsintheacidicsugarcluster,whereastryptophan, his-tidine,asparagine,andglutamineweremorepredominantinthe nonacidicsugarcluster.
Theseresultsindicatedthataminoacidssuchaslysine,arginine, andserineformhydrogenbondswithacidicsugars.Incontrast, hydrogenbondswithasparagine,glutamine,andpolararomatic residueswerelesslikelytooccurintheacidicsugar-binding pro-teinsthanintheproteinsbindingthenonacidicsugars.
3.2. Performanceofsugar-bindingresiduepredictors
Weevaluatedthreepredictors,thesugar-bindingresidue pre-dictor, acidic sugar-binding site predictor, and the nonacidic sugar-bindingsitepredictorwiththedividedtestingdatasetusing five-foldCV,andwhereintheparameterswereoptimizedtogive thebestpredictionperformanceforeachsugarcluster.
3.2.1. Performanceevaluationofsugar-bindingresiduepredictors
We evaluated the performance of each predictor using all residuesof sugar-binding proteins asthe test set.Table 2 lists
Table2
Performanceoffoursugar-bindingresiduepredictorsforsugar-bindingproteins.
SVMsugarpredictors Sens.a(%) Spec.b(%) AUC MCC
Allsugarsc 34.1 92.3 0.754 0.178 Acidicsugarsd 30.1 93.9 0.738 0.169 Nonacidicsugarse 38.5 90.4 0.749 0.169 Combinationsugarsf 31.6 94.1 0.760 0.185 aSensitivity. bSpecificity.
c Sugar-bindingresiduepredictor. dAcidicsugar-bindingresiduepredictor. eNonacidicsugar-bindingresiduepredictor. f Combinationpredictor.
Fig.4. Receiver–operatorcharacteristiccurvesoffoursugar-bindingresidue pre-dictorsforsugar-bindingproteins.(Forinterpretationofthereferencestocolorin thisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)
the sugar-binding site prediction capabilities of each predictor model. Fig. 4 shows the receiver–operator characteristic (ROC) curvesdrawnonthebasisoftheevaluationresults.AnROCcurve is a graphical plot of thefalse positive rate (ratioof nonbind-ingresiduesfalselypredictedaspositive,hereafterreferredtoas FPR)on thex-axis, and thetrue positive rate(ratio of binding residuescorrectlypredictedaspositive,TPR) onthey-axis.The higherthecurveis,thebettertheevaluatedpredictoris.Inthis fig-ure,thecurvesforthecombination(green),nonacidicsugar(red), acidicsugar(blue),andallsugar(orange)showtheresultsforthe combinationpredictor,nonacidicsugar-bindingresiduepredictor, acidicsugarresiduepredictor,andsugar-bindingresiduepredictor, respectively.
As shown in Table 2, the combination model exhibited the largest values for both Matthew’scorrelation coefficient (MCC) scoreandareaundertheROCcurve(AUC).AUCiscalculatedby factoringintheentireROCcurve,andMCCisalocalindexof per-formancedeterminedwhenacertainthresholdisselected.Ascan beseeninFig.4,thismodelwasparticularlyeffectiveinthelow FPRrangeincomparisonwithalltheotherpredictormodels.Given thishighperformanceinthelowFPRrange,thecombinationmodel canberegardedasaneffectivemodelforthepredictionsof sugar-binding.
3.2.2. Performanceevaluationofacidicsugar-bindingresidue
prediction
Table3liststheacidicsugar-bindingresidueprediction capa-bilitiesofeachpredictormodel,andFig.5showstheROCcurves drawnonthebasisoftheevaluationresults.AsshowninTable3, sugar-bindingresiduepredictortrainedwiththedatasetofacidic
Table3
Performanceof foursugar-bindingresiduepredictors foracidicsugar-binding proteins.
SVMsugarpredictor Sens.a(%) Spec.b(%) AUC MCC
Allsugarsc 30.0 95.1 0.783 0.200 Acidicsugarsd 39.4 93.5 0.787 0.221 Nonacidicsugarse 34.0 92.6 0.752 0.163 Combinationsugarsf 29.7 95.8 0.784 0.193 aSensitivity. b Specificity.
c Sugar-bindingresiduepredictor. d Acidicsugar-bindingresiduepredictor. eNonacidicsugar-bindingresiduepredictor. f Combinationpredictor.
Fig.5.Receiver–operatorcharacteristiccurvesoffoursugar-bindingresidue pre-dictorsforacidicsugar-bindingproteins.
Table4
Performanceoffoursugar-bindingresiduepredictorsfornonacidicsugar-binding proteins.
SVMsugarpredictor Sens.a(%) Spec.b(%) AUC MCC
Allsugarsc 30.0 95.1 0.783 0.200 Acidicsugarsd 39.4 93.5 0.787 0.221 Nonacidicsugarse 34.0 92.6 0.752 0.163 Combinationsugarsf 29.7 95.8 0.784 0.193 aSensitivity. b Specificity.
c Sugar-bindingresiduepredictor. d Acidicsugar-bindingresiduepredictor. eNonacidicsugar-bindingresiduepredictor. f Combinationpredictor.
sugar-bindingproteins(acidicsugardataset)exhibitedthelargest valuesof bothAUCand MCC. Inparticular,in comparisonwith thecombination modeland thesugar-binding model, a signifi-cantdifferencebetweenMCCvalueswasobserved,whereasthere wasnosubstantialdifferencebetweentheAUCs.Althoughthe per-formanceoftheacidicsugarpredictormodelwaslowerthanthe performanceofothermodelsinthehighFPRranges,itwashigher inthelowerFPRranges.OvertheentireROCrange,thisresultedina significantdifferencebetweenMCCsdespitethelackofsubstantial differencebetweenAUCsintheperformanceevaluation.
3.2.3. Performanceevaluationofnonacidicsugar-bindingresidue
prediction
Table 4 lists the nonacidic sugar-binding residue predic-tioncapabilities of each predictormodel, and Fig.6 shows the ROC curves drawn on the basis of the evaluation results. The
Fig.6.Receiver–operatorcharacteristiccurvesoffoursugar-bindingresidue pre-dictorsfornonacidicsugar-bindingproteins.
Table5
Summaryofsugar-bindingresiduepredictions.
Testdataset/bestpredictor Sens.a(%) Spec.b(%) AUC MCC
AllSBPsc/combination 31.6 94.1 0.760 0.185 AcidicSBPsc/acidic 39.4 93.5 0.787 0.221 NonacidicSBPsc/combination 28.7 95.1 0.752 0.178 aSensitivity. bSpecificity. c SBPs:sugar-bindingproteins.
combinationmodelexhibitedthelargestvaluesforbothAUCand MCC.AscanbeseeninFig.6,thismodelwasparticularlyeffective in theFPR range lower than 0.1 in comparison withtheother predictormodels.Intherangeabovethatvalue,thecombination modeland thenonacidic sugar-binding protein-basedpredictor modelwerealmostthemosteffective;thepredictortrainedwith theunclassifieddatasetwaslesseffective.
3.2.4. Summaryofsugar-bindingresiduepredictions
Table5summarizesthepredictionresultsforthethreedatasets usingthethreepredictors.Thecombinationpredictoryieldedthe best performance for thesugar-binding proteins and nonacidic sugar-binding proteins as test datasets. Acidic sugar-binding residuepredictorachievedthebestperformanceforacidic sugar-bindingproteinsasatestdataset.Thiswasprobablybecausethe parametersselected werebiasedtoward alargenumber ofthe nonacidicsugars(accountingforover70%oftheentiredataset), which in turn degraded the prediction capabilities for a small numberofpredictionforacidicsugarbindingresidue.The predic-tiontendencyofboththesugar-bindingresiduepredictorandthe combinationpredictorwassimilartothatofthenonacidic sugar-bindingresiduepredictor.
Inpractice,thecombinationpredictorwouldbethemost use-fulwhen thetargetsugarsareunknown.Ifitwereapparentin advancethatthetargetsugarspossessedanacidicfunctionalgroup, itwouldbepreferabletoemploytheacidicsugar-bindingresidue predictor.
4. Conclusions
Usingclusteranalysis,wefoundthattheaminoacid composi-tionsatthebindingsitesdifferedfortheacidicsugarsandnonacidic sugars.Whileahighproportionofbasicresidueswerefoundinthe bindingsitesfortheacidicsugars,theacidicresiduesglutamine
andasparaginewererelativelyscarce.However,amongthe bind-ingresiduesforthenonacidicsugars,theproportionsofglutamine, asparagine,andglutamicandasparticacidswerehigh,andthebasic residue,lysine,wasrelativelyscarce.Webelievethatthisdifference wasresponsiblefordividingthesaccharidesintotwoclusters.
Consideringtheseresults,weattemptedtoconstructan indi-vidualpredictorforacidic andnonacidicsugar-bindingresidues andsucceededinimprovingthepredictioncapabilities.The com-bination predictor, incorporating a linear combination of the predictiondevicesforacidicandnonacidic,sugar-binding,showed thebestperformanceinthepredictionofsugar-bindingresidues and nonacidic sugar-binding residues. This result showed the effectivenessof ourmethodof individuallearning accordingto the properties of sugar-binding residues. Our individual learn-ingapproachisparticularlyeffectivewhenthedifferenceinthe sequencefeaturesbetweenthegroupsislarge.Althoughthe perfor-manceofourmethoddoesnotseemtobesufficient,wesuccessfully showed an improvement in performance using the individual learningapproach.
We also developed a method to predict whether a given protein is sugar-binding. We builtthe work-flow system com-biningthepredictorofsugar-bindingproteinandsugar-binding residue, and launched this system onthe Web. The predictors could befound and used on the Galaxy pipeline (Blankenberg etal.,2010)withhighflexibility.Todaywepresentedthis prod-uctasanopensourcefreewaresystemontheGitHubrepository viadocumentobjectidentifierofZenodo(https://doi.org/10.5281/ zenodo.61513). Thus, the predictor performance of the sugar-bindingresiduewasobtainedonthesugar-bindingproteins,highly accuratepredictionscanbeachievedusingthissystem.The sugar-bindingresiduepredictionissolelybasedonaminoacidsequences; itisfastenoughtobeappliedtogenome-widepredictions. Acknowledgments
WewouldliketothankDr.WayneDawsonandthemembers oftheBioinformationEngineeringLaboratoryfortheirsupportand valuablediscussions.ThisworkissupportedbythePlatformfor DrugDiscovery,Informatics,andStructuralLifeSciencefromthe MinistryofEducation, Culture,Sports, Science,andTechnology, Japan.
AppendixA. Supplementarydata
Supplementarydataassociatedwiththisarticlecanbefound, intheonlineversion,athttp://dx.doi.org/10.1016/j.compbiolchem. 2016.10.009.
References
Agarwal,S.,Mishra,N.K.,Singh,H.,Raghava,G.P.S.,2011.Identificationofmannose interactingresiduesusinglocalcomposition.PloSOne6(9),e24039,http://dx. doi.org/10.1371/journal.pone.0024039.
Altschul,S.F.,Madden,T.L.,Schäffer,A.A.,Zhang,J.,Zhang,Z.,Miller,W.,Lipman, D.J.,1997.GappedBLASTandPSI-BLAST:anewgenerationofproteindatabase searchprograms.,http://dx.doi.org/10.1093/nar/25.17.3389.
Biesiada,J.,Porollo,A.,Velayutham,P.,Kouril,M.,Meller,J.,2011.Surveyofpublic domainsoftwarefordockingsimulationsandvirtualscreening.Hum.Genom.5 (5),497–505,http://dx.doi.org/10.1186/1479-7364-5-5-497.
Blankenberg, D.,Von Kuster, G.,Coraor, N., Ananda,G., Lazarus, R.,Mangan, M.,Nekrutenko,A.,Taylor,J.,2010.Galaxy:aweb-basedgenomeanalysis toolforexperimentalists. In:Ausubel, F.M.,etal. (Eds.),CurrentProtocols inMolecularBiology.,http://dx.doi.org/10.1002/0471142727.mb1910s89,Unit 19.10.1–19.10.21(Chapter19).
Blixt,O.,Head,S.,Mondala,T.,Scanlan,C.,Huflejt,M.E.,Alvarez,R.,Bryan,M.C.,Fazio, F.,Calarese,D.,Stevens,J.,Razi,N.,Stevens,D.J.,Skehel,J.J.,vanDie,I.,Burton, D.R.,Wilson,I.A.,Cummings,R.,Bovin,N.,Wong,C.H.,Paulson,J.C.,2004.Printed covalentglycanarrayforligandprofilingofdiverseglycanbindingproteins. Proc.Natl.Acad.Sci.101(49),17033–17038,http://dx.doi.org/10.1073/pnas. 0407902101.
Cortes,C.,Vapnik,V.,1995.Support-vectornetworks.Mach.Learn.20(3),273–297,
http://dx.doi.org/10.1023/A:1022627411411.
Crocker,P.R.,Paulson,J.C.,Varki,A.,2007.Siglecsandtheirrolesintheimmune system.Nat.Rev.Immunol.7(4),255–266,http://dx.doi.org/10.1038/nri2056. DeMarco,M.L.,Woods,R.J.,2008.Structuralglycobiology:agameofsnakesand
lad-ders.Glycobiology18(6),426–440,http://dx.doi.org/10.1093/glycob/cwn026. Forli,S.,Huey,R.,Pique,M.E.,Sanner,M.F.,Goodsell,D.S.,Olson,A.J.,2016.
Compu-tationalprotein–liganddockingandvirtualdrugscreeningwiththeAutoDock suite.Nat.Protoc.11(5),905–919,http://dx.doi.org/10.1038/nprot.2016.051. Gabius,H.J.,André,S.,Jiménez-Barbero,J.,Romero,A.,Solís,D.,2011.Fromlectin
structuretofunctionalglycomics:principlesofthesugarcode.TrendsBiochem. Sci.36(6),298–313,http://dx.doi.org/10.1016/j.tibs.2011.01.005.
Gandhi,N.S.,Mancera,R.L.,2008.Thestructureofglycosaminoglycansandtheir interactionswithproteins.Chem.Biol.DrugDes.72(6),455–482,http://dx.doi. org/10.1111/j.1747-0285.2008.00741.x.
Grinter, S.Z., Zou, X., 2014. Challenges, applications, and recent advances of protein–ligand docking in structure-based drug design. Molecules (Basel, Switzerland) 19 (7), 10150–10176, http://dx.doi.org/10.3390/ molecules190710150.
Jones,G.,Willett,P.,Glen,R.C.,1995.Molecularrecognitionofreceptorsitesusinga geneticalgorithmwithadescriptionofdesolvation.J.Mol.Biol.245(1),43–53,
http://dx.doi.org/10.1016/S0022-2836(95)80037-9.
Jones,G.,Willett,P.,Glen,R.C.,Leach,A.R.,Taylor,R.,1997.Developmentand vali-dationofageneticalgorithmforflexibledocking.J.Mol.Biol.267(3),727–748,
http://dx.doi.org/10.1006/jmbi.1996.0897.
Malik,A.,Ahmad,S.,2007.Sequenceandstructuralfeaturesofcarbohydratebinding inproteinsandassessmentofpredictabilityusinganeuralnetwork.BMCStruct. Biol.7,1,http://dx.doi.org/10.1186/1472-6807-7-1.
McDonald,I.K.,Thornton,J.M.,1994.Satisfyinghydrogenbondingpotentialin pro-teins.J.Mol.Biol.238(5),777–793,http://dx.doi.org/10.1006/jmbi.1994.1334. Milburn,C.C.,Lamble,H.J.,Theodossis,A.,Bull,S.D.,Hough,D.W.,Danson,M.J.,Taylor,
G.L.,2006.Thestructuralbasisofsubstratepromiscuityinglucose dehydroge-nasefromthehyperthermophilicarchaeonsulfolobussolfataricus.J.Biol.Chem. 281(21),14796–14804,http://dx.doi.org/10.1074/jbc.M601334200. Morris,G.M.,Huey,R.,Lindstrom,W.,Sanner,M.F.,Belew,R.K.,Goodsell,D.S.,Olson,
A.J.,2009.AutoDock4andAutoDockTools4:automateddockingwithselective receptorflexibility.J.Comput.Chem.30(16),2785–2791,http://dx.doi.org/10. 1002/jcc.21256.
Nassif,H.,Al-Ali,H.,Khuri,S.,Keirouz,W.,2009.Predictionofprotein–glucose bind-ingsitesusingsupportvectormachines.Proteins77(1),121–132,http://dx.doi. org/10.1002/prot.22424.
Porter,A.,Yue,T.,Heeringa,L.,Day,S.,Suh,E.,Haab,B.B.,2010.Amotif-basedanalysis ofglycanarraydatatodeterminethespecificitiesofglycan-bindingproteins. Glycobiology20(3),369–380,http://dx.doi.org/10.1093/glycob/cwp187. Satoh,T.,Chen,Y.,Hu,D.,Hanashima,S.,Yamamoto,K.,Yamaguchi,Y.,2010.
Struc-turalbasisforoligosacchariderecognitionofmisfoldedglycoproteinsbyOS-9in ER-associateddegradation.Mol.Cell40(6),905–916,http://dx.doi.org/10.1016/ j.molcel.2010.11.017.
Shi,W.,Dunbar,J.,Jayasekera,M.M.K.,Viola,R.E.,Farber,G.K.,1997.The struc-tureofl-aspartateammonia-lyasefromEscherichiacoli.Biochemistry36(30), 9136–9144,http://dx.doi.org/10.1021/bi9704515.
Sud,M.,Fahy,E.,Cotter,D.,Brown,A.,Dennis,E.A.,Glass,C.K.,Merrill,A.H., Mur-phy,R.C.,Raetz,C.R.H.,Russell,D.W.,Subramaniam,S.,2007.LMSD:LIPIDMAPS structuredatabase.Nucl.AcidsRes.35(Database),D527–D532,http://dx.doi. org/10.1093/nar/gkl838.
Tsai,K.C.,Jian,J.W.,Yang,E.W.,Hsu,P.C.,Peng,H.P.,Chen,C.T.,Chen,J.B.,Chang, J.Y.,Hsu,W.L.,Yang,A.S.,2012.Predictionofcarbohydratebindingsiteson pro-teinsurfaceswith3-dimensionalprobabilitydensitydistributionsofinteracting atoms.PloSOne7(7),e40846,http://dx.doi.org/10.1371/journal.pone.0040846. Zhao,H.,Yang,Y.,vonItzstein,M.,Zhou,Y.,2014.Carbohydrate-bindingprotein identificationbycouplingstructuralsimilaritysearchingwithbindingaffinity prediction.J.Comput.Chem.35(30),2177–2183,http://dx.doi.org/10.1002/jcc. 23730.