Development of a sugar-binding residue prediction system from protein sequences using support vector machine

(1)

ContentslistsavailableatScienceDirect

Computational

Biology

and

Chemistry

j ou rn a l h o m epa ge :w w w . e l s e v i e r . c o m / l o c a t e / c o m p b i o l c h e m

Research

Article

Development

of

a

sugar-binding

residue

prediction

system

from

protein

sequences

using

support

vector

machine

Masaki

Banno

a

,

Yusuke

Komiyama

b

,

Wei

Cao

a

,

Yuya

Oku

a

,

Kokoro

Ueki

a

,

Kazuya

Sumikoshi

a

,

Shugo

Nakamura

a

,

Tohru

Terada

a

,

Kentaro

Shimizu

a,∗

a_Graduate_School_of_Agricultural_and_Life_Sciences,_The_University_of_Tokyo,_1-1-1_Yayoi,_Bunkyo-Ward,_Tokyo_113-8657,_Japan

b_Digital_Content_and_Media_Sciences_Research_Division,_National_Institute_of_Informatics,_2-1-2_{Hitotsubashi,}_{Chiyoda-Ward,}_Tokyo_101-8430,_Japan

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Received8March2016

Receivedinrevisedform5October2016 Accepted23October2016

Availableonline9November2016

Keywords:

Supportvectormachine Sugar-bindingproteins Sugar-bindingresidueprediction Carbohydrate

Machinelearning

a

b

s

t

r

a

c

t

Severalmethodshavebeenproposedforprotein–sugarbindingsitepredictionusingmachinelearning algorithms.However,theyarenoteffectivetolearnvariouspropertiesofbindingsiteresiduescaused byvariousinteractionsbetweenproteinsandsugars.Inthisstudy,weclassifiedsugarsintoacidicand nonacidicsugarsandshowedthattheirbindingsiteshavedifferentaminoacidoccurrencefrequencies.By usingthisresult,wedevelopedsugar-bindingresiduepredictorsdedicatedtothetwoclassesofsugars:an acidsugarbindingpredictorandanonacidicsugarbindingpredictor.Wealsodevelopedacombination predictorwhichcombinestheresultsofthetwopredictors.Weshowedthatwhenasugarisknown tobeanacidicsugar,theacidicsugarbindingpredictorachievesthebestperformance,andshowed thatwhenasugarisknowntobeanonacidicsugarorisnotknowntobeeitherofthetwoclasses,the combinationpredictorachievesthebestperformance.Ourmethodusesonlyaminoacidsequencesfor prediction.Supportvectormachinewasusedasamachinelearningalgorithmandtheposition-specific scoringmatrixcreatedbytheposition-specificiterativebasiclocalalignmentsearchtoolwasusedas thefeaturevector.Weevaluatedtheperformanceofthepredictorsusingfive-foldcross-validation.We havelaunchedoursystem,asanopensourcefreewaretoolontheGitHubrepository(https://doi.org/10. 5281/zenodo.61513).

1. Introduction

Interactionsbetweensugarchainsandproteinsplayessential rolesinbiologicalprocessessuchasintercellularcommunication, immunity,and cellularrecognition.The methodstoempirically analyzesuchinteractionsincludehemagglutinationassays,which areemployedin thediscoveryof novellectins.In recent years, methods utilizing glycan arrays have been developed as high-throughputsolutions,enablingresearcherstoobtaindataoninvitro

interactionsbetweenmultiplesugarchainsand proteins(Porter etal.,2010;Blixtetal.,2004;Gabiusetal.,2011).Nevertheless,the bioinformatics-basedpredictionapproachescanfurtherreducethe timeandeffortinvolvedinpredictingsuchinteractions,providing valuablecluesforexperimentalwork.Conventionalmethodsare usefulin determiningprotein–sugarchaininteractions or iden-tifyingsugarchainrecognitionsequences.However,theycannot

∗Correspondingauthor.

E-mailaddress:[email protected](K.Shimizu).

provideinformation onthebindingresiduesin proteins. Meth-odssuchasX-raycrystallographyandnuclearmagneticresonance haveprimarilybeenusedtoidentifythesebindingresidues. How-ever,suchtechniquesposenumerouschallengesbecausetheyare generallycost-andlabor-intensive,Moreover,thehighmotilityof sugarchainsrendersthedeterminationoftheirtertiarystructures difﬁcult(DeMarcoandWoods,2008).Aspartialsolutionstosuch challenges,bioinformatics-basedtechniqueshavebeenattracting attention.

Dockingsimulationisa predictionmethodforsugar-binding residues based on their tertiary structures. To implement this method, many protein–ligand docking programs (Morris et al., 2009;Jonesetal.,1995,1997;Biesiadaetal.,2011;Forlietal.,2016; Grinteretal.,2014)andmolecularsimulationsareoftenemployed. In a previousstudyinvolvingsugar chain-bindingresidues,the heparin-binding residueshavebeen predictedinan interleukin onthebasisofitsproteinstructure(DeMarcoandWoods,2008). Thecandidateresidueswerenarroweddownviarepeateddocking withheparinmonosaccharidesanddisaccharides.Then,the hep-arinhexasaccharidesweredockedtotheremainingcandidatesto http://dx.doi.org/10.1016/j.compbiolchem.2016.10.009

(2)

predicttheheparin-bindingresiduesintheinterleukin. Another study has used machine learning to predict glucose-binding residues fromtertiary structure of proteins. It has employed a learning model with a support vector machine (SVM), which usedthe occurrencerates of atomsappearing in theproximity ofglucose-bindingresiduesasthefeaturevalues(McDonaldand Thornton,1994).Tsaietal.(2012)developedasugar-bindingsite predictionmethodbasedonthree-dimensionalprobabilitydensity maps,representingthedistributionsof36non-covalent interact-ingatomtypesaroundproteinsurfaces.Themethodreportedby Zhaoetal.(2014)usesastructuralalignmentprogram,SPalignand bindingafﬁnityscores,accordingtoaknowledge-basedpotential. Allofthesemethodsrelyonthetertiarystructureofthetarget proteinforthepredictionofthebindingresidues,thusrequiringthe determinationoftheproteinstructure.Theaminoacidsequenceof aproteinismucheasiertoobtainthanitstertiarystructure.Thus,it ispreferableforthehigh-throughputexperimentssuchas genome-wideandglycanarraysanalyses.

Some attempts have been made to build software applica-tionscapableof learningsuchfeaturessothat theycanpredict sugar-binding residues only from amino acid sequences. Malik etal.havedevelopedamachinelearning-basedmethodusing neu-ralnetworks.Theyhaveconstructedapredictionprogramusing theposition-speciﬁcscoringmatrices(PSSMs) derivedfromthe residue frequencyand multiplealignments of40 sugar-binding proteinsand18galactose-bindingproteinsasthefeaturevalues. Theperformanceoftheprogramhasbeenevaluatedby leave-one-outcross-validation(CV)(MalikandAhmad,2007).Theirresults showthatthepredictionprogramperformsmoreeffectivelywhen appliedtoadatasetofgalactose-bindingproteinsthanthatwhen learningusingtoallsugar-bindingproteins.Nassifetal.(2009)also developedaglucose-bindingsitepredictionmethod.Thismethod usesspatialfeaturesofbindingpocketsandaminoacidand chem-icalfeaturessuchascharge,polarity,mobility,andhydrophobicity asdeterminantfeaturesof abindingsite. Recently,a mannose-bindingsitepredictionprogramhasbeendeveloped;itusesthe compositionproﬁleofpatternsassequencefeatures(Agarwaletal., 2011).

Inthispresentstudy,weattemptedahigh-performance pre-dictionbygroupingthesugar-bindingproteinsdependingonthe characteristicsoftheirbindingresiduesanddesigningapredictor dedicatedtoeachgroup.Weanalyzedthecharacteristicsofthe bindingresiduesbyclusteringthesugarsaccordingtotheresidue compositionatthebindingsites,andtherebyclassifiedthe sug-arsintodifferentclasses.Individualpredictorsforeachsugarclass madethelearningofthepropensitiesofthebindingresiduesmore effective. This,in turn, resulted in improvedprediction perfor-manceofthepredictor.Furthermore,ourmethodusesonlythe aminoacidsequencesforprediction.SVMwasemployedbecause itisoneoftherepresentativetechniquesfortheclassificationof thedataintotwocategorieswithhighgeneralizationability.SVM takesasinputPSSMsaroundatargetresidueasfeaturevalues.It canimprovethepredictioncapabilityfurtherbyextensive incor-porationofthenatureofhomologousproteinscoupledwithsugar class-specificlearning.

2. Materialsandmethods

2.1. Searchforsugar-bindingproteinsintheproteindatabank database

Wetargetedthesugarsthatfrequentlyoccurinvivo,namely aldosesandketoses,andtheirderivativesinwhichthehydroxy group is oxidized or substitutedwith a methyl group,sulfonic group,phosphategroup,acetylgroup,aminegroup,oracetylamide group.Fig.1illustratestheprocedureforconstructingthedataset usedforprediction.

Withsugar-bindingresiduesdeﬁnedastheresidueswithin4 ˚A ofthesugarmolecule.Weperformedanexhaustivesearchof pro-teindatabank(PDB)forproteinswithatleastonesugar-binding residues.Thisstudyfocusedonnoncovalentinteractionsbetween sugarsandproteinsandnotonglycosylationsitesatwhichsugars arecovalentlybondedwithproteins.Therefore,theresidueswithin the1.5 ˚Adistancefromasugarmolecule,aswellastheresidues adjacenttoacovalentlybonded sugarmolecule, wereexcluded fromthesearch.

(3)

Sugarsareoftencovalentlyattachednotonlytoproteinsbut alsotoothertypesofcompounds.AlthoughPDBcontainsdataon variousglycolipidsformedbybindingbetweensugarsandlipids, thesewerealsoexcluded.Herelipidsweredeﬁnedascompounds registeredintheLipidMaps(Sudetal.,2007),adatabaseoflipid substances. Furthermore,saccharides that are used as cryopro-tectants,surfactants,andadditivestofacilitatecrystallization(Shi etal.,1997)wereexcludedbecausetheydonotengageininvivo

interactionswithsugars.

2.2. Clusteringofsugar-bindingresidues

Thebasicapproachofourstudywastoimprovetheprediction accuracybyclassifyingthesugarsintogroupsanddesigningthe predictorsdedicated toeachgroup.To obtainaneffective clas-siﬁcation,weﬁrstperformedclusteranalysisonthebasisofthe occurrencefrequencyofresiduesatthesugar-bindingsites.Inthe PDBdatabase,wetargetedsugarsboundtotheproteinswith100 ormoreresidues.Thegroupaveragemethodwasemployedasthe clusteringprocedure.

2.3. Eliminationofredundancy

Toreducetheredundancyofthesequences,theproteinswith asequencesimilarityof30%oraboveintherangeexceeding50% alignmentcoveragewerealsoexcludedusingBLASTClust(Altschul etal.,1997).Asaresultofthisprocess,369sugar-bindingproteins, 136acidicsugar-bindingproteins,and270nonacidicsugar-binding proteinswereselected.(37proteinshadbothacidicandnonacidic sugar-bindingresidues).

2.4. Predictionmethodforinteractingresidues

Onthebasisoftheproteinsequenceinformation,weperformed asearchonhomologoussequencesusingtheposition-speciﬁc iter-ativebasiclocalalignmentsearchtool(PSI-BLAST)(Altschuletal., 1997)inthenonredundantdatabaseofNCBI,therebycollecting homologoussequences.InthisPSI-BLASTsearch,numberof itera-tionswastwo,andtheE-valuethresholdofsequenceselectionfor proﬁlecreationwas0.001.

Wedevelopedasystempredictingthesugar-bindingresidues ofproteinsonthebasisoftheiraminoacidsequences,using sup-portvector machine(SVM)(Cortesand Vapnik, 1995).An SVM learnsapredictivemodelfromthetrainingdatausingtheprinciple ofmarginmaximization.Itowesitshighgeneralization capabil-itytothislearningapproach.Topredicttheinteractingresidues,a PSSMofsugar-bindingsitesandtheirsequence-neighborresidues isconstructedbasedonthemultiplealignmentofthesugar-binding proteins.Weextractedwconsecutivecolumnvectors (correspond-ingtowconsecutiveresiduesinsequences)fromPSSMandused themas(20×w)-dimensionalfeaturevectorsinSVM.Usingthese featurevalues,SVMpredictedwhetherthecentralresidueswere thesugar-bindingresidues.Thevalueofwisdeterminedbythe parameteroptimizationproceduredescribedinSection2.5.

SVM wasgiven thedata of the sugar-binding residues as a positivedatasetfromTableS1forlearning.Weusedthedataofall theresiduesthatwere5–25residuesawayfromthesugar-binding sitesintheproteinsasanegativedatasetratherthanrandomly selectedproteinresidues.Thereweretworeasonsforusingthis negative dataset. One was to discriminate between the sugar-binding residues from nonsugar-binding residues in a protein. Theotherreasonwasthat,sincetheadjacentresiduestendedto havesomewhat similarfeaturevalues, giving residuesadjacent to binding residues as the negative examples might teach the machinetoimposepenaltiesonthefeaturevaluesresemblingthe

positiveexamples.Fortheevaluationofthepredictorperformance, allresiduesintheproteinsequenceswereusedasthetestset.

We constructed three types of predictors using three types oftrainingdatasets,asugar-binding residuepredictorusingthe sugar-binding proteins, a acidic sugar-binding residue predic-tor using the acidic sugar-binding proteins, and a nonacidic sugar-bindingresiduepredictorusingthenonacidicsugar-binding proteins.Besides,weconstructedacombinationpredictorby com-biningtheacidicandnonacidicsugar-bindingresiduepredictors. WhiletheSVM-basedpredictoroutputsdecisionvaluesas discrim-inantfunctionvalues,thecombinationpredictoriscomposedof thelinearcombinationofdiscriminantfunctionsoutputbythetwo predictiondevices.Weconstructednewpredictorsusingthelinear combinationofpredictiondevicesforacidicandnonacidicsugars asexpressedinthefollowingequation:

fnew (x)=p×facid (x)+q×fnonacid (x) (1)

For each data x, facid (x) and fnonacid (x) represent a

discrimi-nantfunctionoftheacidicandnonacidicsugar-bindingpredictors, respectively. A new discriminant function fnew was deﬁned by

weighingthesefunctionswiththeparameterspandq.The dis-criminantfunctionfisdeﬁnedas

f(x)=sgn

_l

i=1 ˛iyiK(xi,x)+b

(2)

suchthatyi=f(xi)givenNdatasamplesandlisthenumberof

train-ingrecords,yi∈{−1,+1}isthelabelassociatedwiththetraining

data,bisaconstant,xiisthesupportvectors,andKisthekernel

functionusedtotransformthedatapoints.

2.5. Evaluationmethod

To ensurean unbiasedparameter selection, we employed a nestedCVmethod.ThenestedCVcanguaranteeunbiased evalu-ationofgeneralizationcapabilitiesbyselectingmodelparameters inacross-validatingmanner.InthenestedCV,thedatasetwasﬁrst dividedintoksubsets,withoneusedasthetestsetandthe oth-ersasthetrainingset,asisthecasewithCV.Inthistrainingset, theparametersweredeterminedsothatthepredictorperformance tobeevaluatedbyCVwouldbemaximized.Thus,we obtained theoptimalvalues oftheparameters basedonthegridsearch:

w=5,C=−3and=−10.(Candareparametersforanonlinear SVMwithaGaussianradialbasisfunctionkernel.)Forthe param-etersofthediscriminantfunctioninEq.(1),weﬁxedthevalueof parameterpas1andobtainedtheoptimalvalue₋0.96of param-eterqbasedonthegridsearch.Usingtheselectedparameters,we constructedthepredictorsthroughthelearningoftheentire train-ingset.Thecompletedmodelwasthensubjectedtoperformance evaluationonthebasisofthepredictionofthetestset.This proce-durewasrepeateduntilallsubsetswereusedastestsets,andthe resultswereaveragedtoevaluatethepredictorperformance. 3. Results

3.1. Characteristicsoftheinteractingsites

3.1.1. Clusteringofsugarsaccordingtothecompositionofthe residuesinthebindingsites

Fig.2Aportraysatreediagramshowingtheresultsof hierar-chicalclustering.Thesugarswereroughlydividedintotwogroups: withandwithouttheacidicfunctionalgroup.

Althoughthesugarswithanacidicfunctionalgrouphave var-ious functions,their common feature is a basic residue at the binding site. Sialic acid, for example, is a generic termfor the

(4)

Fig.2.Hierarchicalclusteringbyresidueoccurrencefrequencyofsugar-bindingresidues.(A)Hierarchicalclusteringbasedonthegroupaveragemethod.(B)Examplesof acidicsugars.(C)Examplesofnonacidicsugars.

neuraminicacid(i.e.,monosaccharideformedbyaldose condensa-tionofpyruvicacidandd-mannosamine)derivativeswithamino

or hydroxyl substituents. The sialic acids play important roles incellularrecognition.Theproteinsthatbindsialicacidinclude sialicacid-bindingimmunoglobulin-typelectins(Siglecs),a sub-setoftheI-typelectins.Arginineisacrucialfortherecognition ofsialicacidbySiglecs(Crockeretal.,2007).Glycosaminoglycans arepolysaccharidechainsconsistingof disaccharideunitsofan uronicacid(monosaccharidewithacarboxylgroup)andanamino sugar(monosaccharidewithanaminogroup),andmanysulfuric acidmoleculesarecovalentlyattachedtotheirhydroxylgroupsvia esterbonds.Thesurfaceofaglycosaminoglycanhasastrong neg-ativecharge,andarginine,lysine,andoccasionallyhistidineform ionicbondsatthebindingresidues(GandhiandMancera,2008). Incontrast,theCH/interactionsbetweentheCHgroupand aro-maticringarecrucialincommoninteractionsbetweensugarsand proteins(Gabiusetal.,2011).

Thesaccharidesweredividedintotwoclassesdependingonthe primarymodeoftheirinteractionswithaligand,thatis,ionic inter-actionornonionicinteraction.Thispropertywasreﬂectedbythe differencesinaminoacidcontentofthebindingproteins.Takethis intoaccount,weconstructedtheindividualbindingresidue pre-dictorsforsugarswithandwithoutanacidicfunctionalgroupand analyzedthedifferencesintheirproperties.Fig.2BandCshow someexamplesoftheclassiﬁedmonosaccharides,Andalso,Table1 describesthedetailedinformationof40sugarsinFig.2.

3.1.2. Residueoccurrencefrequencyofsugar-bindingsites

Wecalculatedtheoccurrenceprobabilityforeachtypeofamino acidinsugar-bindingsitesandinthewholesequencesofthe sugar-bindingproteins(Fig.3).

Fig. 3A shows the base-2 logarithm of the odds ratios of sugar-bindingresiduestoalltheproteinresiduescalculatedfor allthesugar-bindingproteins,acidicsugar-bindingproteins,and nonacidicsugar-bindingproteins.

Ineverycluster,polararomaticaminoacidssuchastryptophan, tyrosine,andhistidineaswellasarginineplayedimportantroles.In theacidicsugarcluster,argininehadaparticularlyhighoddsratio, andpolarresiduessuchaslysine,glycine,andserinedisplayed com-parativelyhigherratiosthanthatinthenonacidicsugarcluster.In contrast,theoddsratiooftryptophanwasparticularlyhighinthe nonacidicsugarcluster,andtheoddsratiosofpolarresidueswith anamide,suchasglutamineandasparagine,werehigherthanin theacidicsugarcluster.

It is well established that sugar affects a CH/ interaction withanaromaticringwhilebindingtoaprotein.Thisprobably explainswhythearomaticresiduesaccountedforalarge propor-tionofaminoacidsinthebindingsites.Whileacidicsugarsinteract stronglywithbasicresiduessuchasarginineandlysine,their inter-actionwithacidicresiduessuchasasparagineandglutamineare weaker.

Fig. 3B presents the base-2 logarithm of the odds ratiosof sugar-bindingresiduestotheproteinsurfaceresiduescalculated

(5)

Fig.3. Logoddsratiosofoccurrenceofsugarchain-bindingresidues,theaminoacidsarelistedinorderfromthelowestKyte–Doolittlehydropathyindexscore.Subﬁgure (C)representstheoccurrenceratiosofresiduesthatformedahydrogenbondwiththeproteinsidechain.Theoccurrenceratiosofresiduesthatformedahydrogenbond withthemainchainareplotted(main-chain)ontheright-handsside.(A)Logoddsratiosofsugarchain-bindingresiduesamongallaminoacidresidues.(B)Logoddsratios ofsugarchain-bindingresiduesamongthesurfaceresidues.(C)Logoddsratiosofsugarchain-bindingresiduesamonghydrogenbondacceptors.

forallthesugar-bindingproteins,acidicsugar-bindingproteins, andnonacidicsugar-bindingproteins.Incomparisonwiththecase ofallresidues,theproportionofhydrophobicresidueswashigh, whereastheoccurrenceratioofpolarresidueswaslow.This ten-dencywasparticularlyprominentforphenylalanineandcysteine inthenonacidiccluster.

Phenylalanineisa nonpolararomaticamino acid, andit can affectintoaCH/interactionwithsugar.Therelativescarcityof phenylalanineresidues exposed on thesurface suggeststhat a largeproportionofthesemoleculeswerefunctioningasbinding residues.

Cysteine stabilizes the protein folding by forming disulﬁde bonds. In OS-9 (PDB: 3AIH), a human-derived P-type lectin, the cysteine residues in the binding sites are strongly con-served among the proteins with the same domain. Although disulﬁdebondsarenot directlyinvolvedinsugar-binding, their formation might contribute to the establishment of binding domains (Satoh et al., 2010). Cysteine also forms coordinate bonds with metallic ligands. In the glucose 1-dehydrogenase (PDB: 2CDB), cysteine is involved in the catalytic reaction by forming a coordinate bond with a Zn2+ _ion _(Milburn _et _al., 2006).

(6)

Table1

NamesofsugarsusedinclusteringanalysisinFig.2.

Ligandid Ligandname

BG6 Beta-d-Glucose-6-phosphate

PRP Alpha-phosphoribosylpyrophosphoricacid ADA Alpha-d-galactopyranuronicacid IDS 2-O-sulfo-alpha-l-idopyranuronicacid SGN N,O6-disulfo-glucosamine G1P Alpha-d-glucose-1-phosphate 16G N-acetyl-d-glucosamine-6-phosphate G6P Alpha-d-glucose-6-phosphate R1P Ribose-1-phosphate F6P Fructose-6-phosphate FBP Beta-fructose-1,6-diphosphate

GCU d-Glucuronicacid

SIA O-sialicacid

BDP Beta-d-glucopyranuronicacid KDO 3-Deoxy-d-manno-oct-2-ulosonicacid M6P Alpha-d-mannose-6-phosphate NAA N-acetyl-d-allosamine RAM Alpha-l-rhamnose ARA Alpha-l-arabinose NAG N-acetyl-d-glucosamine NDG 2-(Acetylamino)-2-deoxy-A-d-glucopyranose BMA Beta-d-mannose XYP Beta-d-xylopyranose XYS Xylopyranose AHR Alpha-l-arabinofuranose FUL Beta-l-fucose GCS d-Glucosamine SGC 4-Deoxy-4-thio-beta-d-glucopyranose GDL 2-Acetamido-2-deoxy-d-glucono-1,5-lactone RIP Ribose(pyranoseform)

A2G N-Acetyl-2-deoxy-2-amino-galactose MAN Alpha-d-mannose RIB Ribose FUC Alpha-l-fucose FRU Fructose GLC Alpha-d-glucose GLA Alphad-galactose NGA N-Acetyl-d-galactosamine BGC Beta-d-glucose

GAL Beta-d-galactose

Fig.3C shows the base-2logarithm of theodds ratio ofthe proteinsidechainsbeingthebindinglociofthehydrogenbond acceptorsforthesugar-bindingproteins,acidicsugar-binding pro-teins,andnonacidicsugar-bindingproteins.Asimilaranalysiswith hydrogenbondacceptorsrevealedthatthenumberwasverysmall, andallthehydrogenbondswereformedinthemainproteinchains. IncomparisonwithFig.3B,thedifferencebetweenthetwo clus-terswasnotable;serine,lysine,andarginineaccountedforlarger proportionsintheacidicsugarcluster,whereastryptophan, his-tidine,asparagine,andglutamineweremorepredominantinthe nonacidicsugarcluster.

Theseresultsindicatedthataminoacidssuchaslysine,arginine, andserineformhydrogenbondswithacidicsugars.Incontrast, hydrogenbondswithasparagine,glutamine,andpolararomatic residueswerelesslikelytooccurintheacidicsugar-binding pro-teinsthanintheproteinsbindingthenonacidicsugars.

3.2. Performanceofsugar-bindingresiduepredictors

Weevaluatedthreepredictors,thesugar-bindingresidue pre-dictor, acidic sugar-binding site predictor, and the nonacidic sugar-bindingsitepredictorwiththedividedtestingdatasetusing ﬁve-foldCV,andwhereintheparameterswereoptimizedtogive thebestpredictionperformanceforeachsugarcluster.

3.2.1. Performanceevaluationofsugar-bindingresiduepredictors

We evaluated the performance of each predictor using all residuesof sugar-binding proteins asthe test set.Table 2 lists

Table2

Performanceoffoursugar-bindingresiduepredictorsforsugar-bindingproteins.

SVMsugarpredictors Sens.a_(%) _Spec.b_(%) _AUC _MCC

Allsugarsc _34.1 _92.3 _0.754 _0.178 Acidicsugarsd _30.1 _93.9 _0.738 _0.169 Nonacidicsugarse _38.5 _90.4 _0.749 _0.169 Combinationsugarsf _31.6 _94.1 _0.760 _0.185 a_Sensitivity. b_{Speciﬁcity.}

c _{Sugar-binding}_residue_predictor. d_Acidic_{sugar-binding}_residue_predictor. e_Nonacidic_{sugar-binding}_residue_predictor. f _Combination_predictor.

Fig.4. Receiver–operatorcharacteristiccurvesoffoursugar-bindingresidue pre-dictorsforsugar-bindingproteins.(Forinterpretationofthereferencestocolorin thisﬁgurelegend,thereaderisreferredtothewebversionofthisarticle.)

the sugar-binding site prediction capabilities of each predictor model. Fig. 4 shows the receiver–operator characteristic (ROC) curvesdrawnonthebasisoftheevaluationresults.AnROCcurve is a graphical plot of thefalse positive rate (ratioof nonbind-ingresiduesfalselypredictedaspositive,hereafterreferredtoas FPR)on thex-axis, and thetrue positive rate(ratio of binding residuescorrectlypredictedaspositive,TPR) onthey-axis.The higherthecurveis,thebettertheevaluatedpredictoris.Inthis ﬁg-ure,thecurvesforthecombination(green),nonacidicsugar(red), acidicsugar(blue),andallsugar(orange)showtheresultsforthe combinationpredictor,nonacidicsugar-bindingresiduepredictor, acidicsugarresiduepredictor,andsugar-bindingresiduepredictor, respectively.

As shown in Table 2, the combination model exhibited the largest values for both Matthew’scorrelation coefﬁcient (MCC) scoreandareaundertheROCcurve(AUC).AUCiscalculatedby factoringintheentireROCcurve,andMCCisalocalindexof per-formancedeterminedwhenacertainthresholdisselected.Ascan beseeninFig.4,thismodelwasparticularlyeffectiveinthelow FPRrangeincomparisonwithalltheotherpredictormodels.Given thishighperformanceinthelowFPRrange,thecombinationmodel canberegardedasaneffectivemodelforthepredictionsof sugar-binding.

3.2.2. Performanceevaluationofacidicsugar-bindingresidue

prediction

Table3liststheacidicsugar-bindingresidueprediction capa-bilitiesofeachpredictormodel,andFig.5showstheROCcurves drawnonthebasisoftheevaluationresults.AsshowninTable3, sugar-bindingresiduepredictortrainedwiththedatasetofacidic

(7)

Table3

Performanceof foursugar-bindingresiduepredictors foracidicsugar-binding proteins.

SVMsugarpredictor Sens.a_(%) _Spec.b_(%) _AUC _MCC

Allsugarsc _30.0 _95.1 _0.783 _0.200 Acidicsugarsd _39.4 _93.5 _0.787 _0.221 Nonacidicsugarse _34.0 _92.6 _0.752 _0.163 Combinationsugarsf _29.7 _95.8 _0.784 _0.193 a_Sensitivity. b _{Speciﬁcity.}

c _{Sugar-binding}_residue_predictor. d _Acidic_{sugar-binding}_residue_predictor. e_Nonacidic_{sugar-binding}_residue_predictor. f _Combination_predictor.

Fig.5.Receiver–operatorcharacteristiccurvesoffoursugar-bindingresidue pre-dictorsforacidicsugar-bindingproteins.

Table4

Performanceoffoursugar-bindingresiduepredictorsfornonacidicsugar-binding proteins.

SVMsugarpredictor Sens.a_(%) _Spec.b_(%) _AUC _MCC

Allsugarsc _30.0 _95.1 _0.783 _0.200 Acidicsugarsd _39.4 _93.5 _0.787 _0.221 Nonacidicsugarse _34.0 _92.6 _0.752 _0.163 Combinationsugarsf _29.7 _95.8 _0.784 _0.193 a_Sensitivity. b _{Speciﬁcity.}

c _{Sugar-binding}_residue_predictor. d _Acidic_{sugar-binding}_residue_predictor. e_Nonacidic_{sugar-binding}_residue_predictor. f _Combination_predictor.

sugar-bindingproteins(acidicsugardataset)exhibitedthelargest valuesof bothAUCand MCC. Inparticular,in comparisonwith thecombination modeland thesugar-binding model, a signiﬁ-cantdifferencebetweenMCCvalueswasobserved,whereasthere wasnosubstantialdifferencebetweentheAUCs.Althoughthe per-formanceoftheacidicsugarpredictormodelwaslowerthanthe performanceofothermodelsinthehighFPRranges,itwashigher inthelowerFPRranges.OvertheentireROCrange,thisresultedina signiﬁcantdifferencebetweenMCCsdespitethelackofsubstantial differencebetweenAUCsintheperformanceevaluation.

3.2.3. Performanceevaluationofnonacidicsugar-bindingresidue

prediction

Table 4 lists the nonacidic sugar-binding residue predic-tioncapabilities of each predictormodel, and Fig.6 shows the ROC curves drawn on the basis of the evaluation results. The

Fig.6.Receiver–operatorcharacteristiccurvesoffoursugar-bindingresidue pre-dictorsfornonacidicsugar-bindingproteins.

Table5

Summaryofsugar-bindingresiduepredictions.

Testdataset/bestpredictor Sens.a_(%) _Spec.b_(%) _AUC _MCC

AllSBPsc_/combination _31.6 _94.1 _0.760 _0.185 AcidicSBPsc_/acidic _39.4 _93.5 _0.787 _0.221 NonacidicSBPsc_/combination _28.7 _95.1 _0.752 _0.178 a_Sensitivity. b_{Speciﬁcity.} c _SBPs:_{sugar-binding}_proteins.

combinationmodelexhibitedthelargestvaluesforbothAUCand MCC.AscanbeseeninFig.6,thismodelwasparticularlyeffective in theFPR range lower than 0.1 in comparison withtheother predictormodels.Intherangeabovethatvalue,thecombination modeland thenonacidic sugar-binding protein-basedpredictor modelwerealmostthemosteffective;thepredictortrainedwith theunclassiﬁeddatasetwaslesseffective.

3.2.4. Summaryofsugar-bindingresiduepredictions

Table5summarizesthepredictionresultsforthethreedatasets usingthethreepredictors.Thecombinationpredictoryieldedthe best performance for thesugar-binding proteins and nonacidic sugar-binding proteins as test datasets. Acidic sugar-binding residuepredictorachievedthebestperformanceforacidic sugar-bindingproteinsasatestdataset.Thiswasprobablybecausethe parametersselected werebiasedtoward alargenumber ofthe nonacidicsugars(accountingforover70%oftheentiredataset), which in turn degraded the prediction capabilities for a small numberofpredictionforacidicsugarbindingresidue.The predic-tiontendencyofboththesugar-bindingresiduepredictorandthe combinationpredictorwassimilartothatofthenonacidic sugar-bindingresiduepredictor.

Inpractice,thecombinationpredictorwouldbethemost use-fulwhen thetargetsugarsareunknown.Ifitwereapparentin advancethatthetargetsugarspossessedanacidicfunctionalgroup, itwouldbepreferabletoemploytheacidicsugar-bindingresidue predictor.

4. Conclusions

Usingclusteranalysis,wefoundthattheaminoacid composi-tionsatthebindingsitesdifferedfortheacidicsugarsandnonacidic sugars.Whileahighproportionofbasicresidueswerefoundinthe bindingsitesfortheacidicsugars,theacidicresiduesglutamine

(8)

andasparaginewererelativelyscarce.However,amongthe bind-ingresiduesforthenonacidicsugars,theproportionsofglutamine, asparagine,andglutamicandasparticacidswerehigh,andthebasic residue,lysine,wasrelativelyscarce.Webelievethatthisdifference wasresponsiblefordividingthesaccharidesintotwoclusters.

Consideringtheseresults,weattemptedtoconstructan indi-vidualpredictorforacidic andnonacidicsugar-bindingresidues andsucceededinimprovingthepredictioncapabilities.The com-bination predictor, incorporating a linear combination of the predictiondevicesforacidicandnonacidic,sugar-binding,showed thebestperformanceinthepredictionofsugar-bindingresidues and nonacidic sugar-binding residues. This result showed the effectivenessof ourmethodof individuallearning accordingto the properties of sugar-binding residues. Our individual learn-ingapproachisparticularlyeffectivewhenthedifferenceinthe sequencefeaturesbetweenthegroupsislarge.Althoughthe perfor-manceofourmethoddoesnotseemtobesufﬁcient,wesuccessfully showed an improvement in performance using the individual learningapproach.

We also developed a method to predict whether a given protein is sugar-binding. We builtthe work-flow system com-biningthepredictorofsugar-bindingproteinandsugar-binding residue, and launched this system onthe Web. The predictors could befound and used on the Galaxy pipeline (Blankenberg etal.,2010)withhighflexibility.Todaywepresentedthis prod-uctasanopensourcefreewaresystemontheGitHubrepository viadocumentobjectidentifierofZenodo(https://doi.org/10.5281/ zenodo.61513). Thus, the predictor performance of the sugar-bindingresiduewasobtainedonthesugar-bindingproteins,highly accuratepredictionscanbeachievedusingthissystem.The sugar-bindingresiduepredictionissolelybasedonaminoacidsequences; itisfastenoughtobeappliedtogenome-widepredictions. Acknowledgments

WewouldliketothankDr.WayneDawsonandthemembers oftheBioinformationEngineeringLaboratoryfortheirsupportand valuablediscussions.ThisworkissupportedbythePlatformfor DrugDiscovery,Informatics,andStructuralLifeSciencefromthe MinistryofEducation, Culture,Sports, Science,andTechnology, Japan.

AppendixA. Supplementarydata

Supplementarydataassociatedwiththisarticlecanbefound, intheonlineversion,athttp://dx.doi.org/10.1016/j.compbiolchem. 2016.10.009.

References

Agarwal,S.,Mishra,N.K.,Singh,H.,Raghava,G.P.S.,2011.Identiﬁcationofmannose interactingresiduesusinglocalcomposition.PloSOne6(9),e24039,http://dx. doi.org/10.1371/journal.pone.0024039.

Altschul,S.F.,Madden,T.L.,Schäffer,A.A.,Zhang,J.,Zhang,Z.,Miller,W.,Lipman, D.J.,1997.GappedBLASTandPSI-BLAST:anewgenerationofproteindatabase searchprograms.,http://dx.doi.org/10.1093/nar/25.17.3389.

Biesiada,J.,Porollo,A.,Velayutham,P.,Kouril,M.,Meller,J.,2011.Surveyofpublic domainsoftwarefordockingsimulationsandvirtualscreening.Hum.Genom.5 (5),497–505,http://dx.doi.org/10.1186/1479-7364-5-5-497.

Blankenberg, D.,Von Kuster, G.,Coraor, N., Ananda,G., Lazarus, R.,Mangan, M.,Nekrutenko,A.,Taylor,J.,2010.Galaxy:aweb-basedgenomeanalysis toolforexperimentalists. In:Ausubel, F.M.,etal. (Eds.),CurrentProtocols inMolecularBiology.,http://dx.doi.org/10.1002/0471142727.mb1910s89,Unit 19.10.1–19.10.21(Chapter19).

Blixt,O.,Head,S.,Mondala,T.,Scanlan,C.,Huﬂejt,M.E.,Alvarez,R.,Bryan,M.C.,Fazio, F.,Calarese,D.,Stevens,J.,Razi,N.,Stevens,D.J.,Skehel,J.J.,vanDie,I.,Burton, D.R.,Wilson,I.A.,Cummings,R.,Bovin,N.,Wong,C.H.,Paulson,J.C.,2004.Printed covalentglycanarrayforligandproﬁlingofdiverseglycanbindingproteins. Proc.Natl.Acad.Sci.101(49),17033–17038,http://dx.doi.org/10.1073/pnas. 0407902101.

Cortes,C.,Vapnik,V.,1995.Support-vectornetworks.Mach.Learn.20(3),273–297,

http://dx.doi.org/10.1023/A:1022627411411.

Crocker,P.R.,Paulson,J.C.,Varki,A.,2007.Siglecsandtheirrolesintheimmune system.Nat.Rev.Immunol.7(4),255–266,http://dx.doi.org/10.1038/nri2056. DeMarco,M.L.,Woods,R.J.,2008.Structuralglycobiology:agameofsnakesand

lad-ders.Glycobiology18(6),426–440,http://dx.doi.org/10.1093/glycob/cwn026. Forli,S.,Huey,R.,Pique,M.E.,Sanner,M.F.,Goodsell,D.S.,Olson,A.J.,2016.

Compu-tationalprotein–liganddockingandvirtualdrugscreeningwiththeAutoDock suite.Nat.Protoc.11(5),905–919,http://dx.doi.org/10.1038/nprot.2016.051. Gabius,H.J.,André,S.,Jiménez-Barbero,J.,Romero,A.,Solís,D.,2011.Fromlectin

structuretofunctionalglycomics:principlesofthesugarcode.TrendsBiochem. Sci.36(6),298–313,http://dx.doi.org/10.1016/j.tibs.2011.01.005.

Gandhi,N.S.,Mancera,R.L.,2008.Thestructureofglycosaminoglycansandtheir interactionswithproteins.Chem.Biol.DrugDes.72(6),455–482,http://dx.doi. org/10.1111/j.1747-0285.2008.00741.x.

Grinter, S.Z., Zou, X., 2014. Challenges, applications, and recent advances of protein–ligand docking in structure-based drug design. Molecules (Basel, Switzerland) 19 (7), 10150–10176, http://dx.doi.org/10.3390/ molecules190710150.

Jones,G.,Willett,P.,Glen,R.C.,1995.Molecularrecognitionofreceptorsitesusinga geneticalgorithmwithadescriptionofdesolvation.J.Mol.Biol.245(1),43–53,

http://dx.doi.org/10.1016/S0022-2836(95)80037-9.

Jones,G.,Willett,P.,Glen,R.C.,Leach,A.R.,Taylor,R.,1997.Developmentand vali-dationofageneticalgorithmforﬂexibledocking.J.Mol.Biol.267(3),727–748,

http://dx.doi.org/10.1006/jmbi.1996.0897.

Malik,A.,Ahmad,S.,2007.Sequenceandstructuralfeaturesofcarbohydratebinding inproteinsandassessmentofpredictabilityusinganeuralnetwork.BMCStruct. Biol.7,1,http://dx.doi.org/10.1186/1472-6807-7-1.

McDonald,I.K.,Thornton,J.M.,1994.Satisfyinghydrogenbondingpotentialin pro-teins.J.Mol.Biol.238(5),777–793,http://dx.doi.org/10.1006/jmbi.1994.1334. Milburn,C.C.,Lamble,H.J.,Theodossis,A.,Bull,S.D.,Hough,D.W.,Danson,M.J.,Taylor,

G.L.,2006.Thestructuralbasisofsubstratepromiscuityinglucose dehydroge-nasefromthehyperthermophilicarchaeonsulfolobussolfataricus.J.Biol.Chem. 281(21),14796–14804,http://dx.doi.org/10.1074/jbc.M601334200. Morris,G.M.,Huey,R.,Lindstrom,W.,Sanner,M.F.,Belew,R.K.,Goodsell,D.S.,Olson,

A.J.,2009.AutoDock4andAutoDockTools4:automateddockingwithselective receptorﬂexibility.J.Comput.Chem.30(16),2785–2791,http://dx.doi.org/10. 1002/jcc.21256.

Nassif,H.,Al-Ali,H.,Khuri,S.,Keirouz,W.,2009.Predictionofprotein–glucose bind-ingsitesusingsupportvectormachines.Proteins77(1),121–132,http://dx.doi. org/10.1002/prot.22424.

Porter,A.,Yue,T.,Heeringa,L.,Day,S.,Suh,E.,Haab,B.B.,2010.Amotif-basedanalysis ofglycanarraydatatodeterminethespeciﬁcitiesofglycan-bindingproteins. Glycobiology20(3),369–380,http://dx.doi.org/10.1093/glycob/cwp187. Satoh,T.,Chen,Y.,Hu,D.,Hanashima,S.,Yamamoto,K.,Yamaguchi,Y.,2010.

Struc-turalbasisforoligosacchariderecognitionofmisfoldedglycoproteinsbyOS-9in ER-associateddegradation.Mol.Cell40(6),905–916,http://dx.doi.org/10.1016/ j.molcel.2010.11.017.

Shi,W.,Dunbar,J.,Jayasekera,M.M.K.,Viola,R.E.,Farber,G.K.,1997.The struc-tureofl-aspartateammonia-lyasefromEscherichiacoli.Biochemistry36(30), 9136–9144,http://dx.doi.org/10.1021/bi9704515.

Sud,M.,Fahy,E.,Cotter,D.,Brown,A.,Dennis,E.A.,Glass,C.K.,Merrill,A.H., Mur-phy,R.C.,Raetz,C.R.H.,Russell,D.W.,Subramaniam,S.,2007.LMSD:LIPIDMAPS structuredatabase.Nucl.AcidsRes.35(Database),D527–D532,http://dx.doi. org/10.1093/nar/gkl838.

Tsai,K.C.,Jian,J.W.,Yang,E.W.,Hsu,P.C.,Peng,H.P.,Chen,C.T.,Chen,J.B.,Chang, J.Y.,Hsu,W.L.,Yang,A.S.,2012.Predictionofcarbohydratebindingsiteson pro-teinsurfaceswith3-dimensionalprobabilitydensitydistributionsofinteracting atoms.PloSOne7(7),e40846,http://dx.doi.org/10.1371/journal.pone.0040846. Zhao,H.,Yang,Y.,vonItzstein,M.,Zhou,Y.,2014.Carbohydrate-bindingprotein identiﬁcationbycouplingstructuralsimilaritysearchingwithbindingafﬁnity prediction.J.Comput.Chem.35(30),2177–2183,http://dx.doi.org/10.1002/jcc. 23730.

ScienceDirect

w w w . e l s e v i e r . c o m / l o c a t e / c o m p b i o l c h e m

5281/zenodo.61513

(http://creativecommons.org/licenses/by-nc-nd/4.0/

http://dx.doi.org/10.1016/j.compbiolchem.2016.10.009