Protein secondary structure prediction by using deep learning method

(1)

Citation: Wang, Yangxu, Mao, Hua and Yi, Zhang (2017) Protein secondary structure

prediction by using deep learning method. Knowledge-Based Systems, 118. pp. 115-123.

ISSN 0950-7051

Published by: Elsevier

URL:

http://dx.doi.org/10.1016/j.knosys.2016.11.015

<http://dx.doi.org/10.1016/j.knosys.2016.11.015>

This version was downloaded from Northumbria Research Link:

http://nrl.northumbria.ac.uk/39675/

Northumbria University has developed Northumbria Research Link (NRL) to enable users to

access the University’s research output. Copyright

©

and moral rights for items on NRL are

retained by the individual author(s) and/or other copyright owners. Single copies of full items

can be reproduced, displayed or performed, and given to third parties in any format or

medium for personal research or study, educational, or not-for-profit purposes without prior

permission or charge, provided the authors, title and full bibliographic details are given, as

well as a hyperlink and/or URL to the original metadata page. The content must not be

changed in any way. Full items must not be sold commercially in any format or medium

without formal permission of the copyright holder. The full policy is available online:

http://nrl.northumbria.ac.uk/policies.html

This document may differ from the final, published version of the research and has been

made available online in accordance with publisher policies. To read and/or cite from the

published version of the research, please visit the publisher’s website (a subscription may be

required.)

(2)

Yangxu

Wang,

Hua

Mao

∗

,

Zhang

Yi

MachineIntelligenceLaboratory,CollegeofComputerScience,SichuanUniversity,Chengdu610065,People’sRepublicofChina

Keywords: Deeplearning

Secondarystructureprediction Encoder–decodernetworks Recurrentneuralnetworks

a

b

s

t

r

a

c

t

Thepredictionofproteinstructuresdirectlyfromaminoacidsequencesisoneofthebiggestchallenges incomputationalbiology.Itcanbedividedintoseveralindependentsub-problemsinwhichprotein sec-ondarystructure(SS)predictionisfundamental.ManycomputationalmethodshavebeenproposedforSS predictionproblem.Fewofthemcanmodelwellboththesequence-structuremappingrelationship be-tweeninputproteinfeaturesandSS,andtheinteractionrelationshipamongresidueswhichareboth im-portantforSSprediction.Inthispaper,weproposedadeeprecurrentencoder–decodernetworkscalled SecondaryStructureRecurrentEncoder–DecoderNetworks(SSREDNs)to solvethisSSprediction prob-lem.DeeparchitectureandrecurrentstructuresareemployedintheSSREDNstomodelboththecomplex nonlinearmappingrelationshipbetweeninputproteinfeaturesandSS,andthemutualinteractionamong continuousresiduesoftheproteinchain.Aseriesoftechniquesarealsousedinthispapertoreﬁnethe model’sperformance.The proposedmodel isapplied tothe opendatasetCullPDBand CB513. Experi-mentalresultsdemonstratethatourmethodcanimprovebothQ3andQ8accuracycomparedwithsome publicavailablemethods.ForQ8predictionproblem,itachieves68.20%and73.1%accuracyonCB513and CullPDBdatasetinfewerepochsbetterthanthepreviousstate-of-artmethod.

1. Introduction

Discoveringprotein’sstructureandbiologicalfunctionsarevery importantforunderstandingtheirbiologicalprocesses,suchasthe protein-protein interactions [1], protein complexes identiﬁcation [2] andprotein structure prediction. Proteinstructure prediction, elucidating the complex relationshipbetween aprotein sequence andits structure,isoneofthemostimportantchallengesin com-putationalbiology[3].Themostelementaltaskofproteinstructure prediction is protein secondary structure (SS) prediction, which aimstodiscoverthestructuralstatesofaminoacids.SSrepresents thelocalconformationofthepolypeptidebackboneofproteinsand providesabridgethatlinkstheprimarysequenceandthetertiary structure,whichisveryhelpfulformanystructuralandfunctional analysistools[4,5].

Typically, protein secondary structures can either be divided intothreestates(

α

-helix(H),

β

-strand(E)andcoilregion(C)) or be further classiﬁed into eight ﬁne-grained states (310-helix (G),

α

-helix(H),

π

-helix(I),

β

-strand(E),

β

-bridge(B),

β

-turn(T),high curvatureregions (S)andirregularloop(L)).SSpredictionis usu-ally evaluated by Q3 and Q8 accuracy, which measures the

per-∗ _{Corresponding}_author.

E-mail addresses: [email protected] (Y. Wang), [email protected] (H. Mao),[email protected](Z.Yi).

centageofresiduesforwhich3-stateor8-stateSSiscorrectly pre-dicted.Currently,extensiveresearcheffortshavebeenspenton ap-plyingcomputationalmethods toaddresstheQ3prediction prob-lem,butveryfewtothemorechallengingQ8predictionproblem.

Hidden markovmodel (HMM) has been applied to 3-state SS prediction problem [6]. Although HMM can describe the inter-actions among adjust residues, it’s very challengingfor HMM to modelthe complex nonlinear relationship betweeninput protein featuresandSS. Supportvector machine (SVM)[7]can dealwith thiscomplex nonlinear mapping, but it’s challenging forSVM to take intoconsideration the interactions amongadjacent residues. Toourbestknowledge,byusinga2-stage neuralnetworks(NNs) method[8], sofarthebest Q3 accuracyisabout80%.FortheQ8 predictionproblem,existingmethods[9,10]failtoprovide promis-ing results. The problem may be that most of these mentioned methodsare shallowarchitectures.Thelimitationof themisthat it’s very diﬃcult for a relatively shallow architectures to model wellboththecomplexsequence-structurerelationshipbetween in-putprotein features andSS, and the mutualinteraction relation-ship among residues. However, they are both important for SS prediciton[10,11].

Nowdays, NNs with deep architectures, also called deep neu-ralnetworks(DNNs)becomethemostpowerfulmachinelearning techniquesforpatternrecognition[12,13].Withtheabilityof map-pingunorganizedlow-levelfeaturesintohigh-levellatendata rep-resentationswhicharemoresuitableforaﬁnalclassiﬁcation

(3)

prob-116

lem[14],DNNscanbetter modelthecomplexsequence-structure relationshipforSSprediction.Cheng[15]proposedatypical deep beliefnetworksmodelforQ3 prediciton,inwhicheach layerisa restricted Boltzmannmachine (RBM). AnotherDNNs approach is reported in [16], which develop a multi-step iterativedeep net-worksmodel that predicts four different sets of structural prop-erties including 3-state SS. In order to better capture the mu-tualinteractions amongresidues,a deepconvolutional generative stochasticnetwork isproposed in[17] fora 8-state SSprediction problem, which may be the best 8-state predictor as we know. However,theperformanceofthisapproachissensitivetothe cho-senconvolution window size which isdiﬃcult to determine. Re-currentneuralnetworks(RNNs)are alsoadoptedforthisproblem astheycan employcontextualinformationofinput sequenceand learn the variable-width dependencies in the protein chain [11]. However,thegradientproblemstilllimitstheapplicationofthese RNN-basedapproaches[18].

In this paper, we proposed a deep recurrent network, called the Secondary Structure Recurrent Encoder–Decoder Networks (SSREDNs) for the challenge of both Q3 and Q8 prediction. SSREDNs employ the encoder–decoder architecture [19], a typi-caldeeparchitecturetomodelthesequence-structurerelationship. This architecture allows its encoder part to encode a given pro-teinsequenceintoarepresentationlayerandtheits decoderpart decodesthislearnedrepresentationintonewfeaturespace. Com-pared with classical deep architectures, it enhances the model’s learning and representation abilities [19]. Second, different with [17]whichuse adeep convolutionalnetworks,weuseanew-typed recurrent structure, called gated recurrent units (GRUs) [20] in ournetworkstomodelthemutualinteractionrelationshipamong residues.Itovercomes thegradient problemandcan betterlearn both the adjacent and long-range interactions among residues withouttheproblemofchoosingthesizeoftheconvolution win-dow. The proposed networks can model both the complex non-linearsequence-structuremappingbetweeninputproteinfeatures andSS,andthemutualinteractionrelationshipamongresidues.In particular,astackauto-encoder(SAE)architecture[21]isalso con-tainedintheencoder partofournetworkstoautomaticallylearn compactandrepresentativefeaturesforinput.It’sespecially help-fulforproteinsaswelackthenecessaryintuitionorknowledgefor hand-craftingfeaturesinvolvingmultipleaminoacids[22].Aseries oftechniques are also adopted to reﬁne the performance ofthe networks, such asthe dropout technique[23] and theoptimized algorithmnamedAdam[24].

Fortheevaluationpurpose,theproposed methodisappliedto thebenchmarkdatasetCB513andCullPDBforbothQ3andQ8 pre-diction.First,SSREDNsoutperformsthebestQ8 predictor[17], es-pecially onsome SS typeswhich are diﬃcultto predict. Further-more,italsooutperformsmostofthepublicavailablemethodsfor bothQ3andQ8prediction.

The paper is organized asfollows. Section 2 gives an outline of the deep learning architectures and techniques used in this work. In Section 3, the proposed networks model and its learn-ing algorithmare introduced. Experimental details are elucidated inSection4.Finally,theconclusionsandfutureworkaredrawnin Section5.

2. Preliminary

In this section, a brief introduction of DNNsand the SAEare presented.In addition,theGRUs usedtodeal withthelong-term interactionsbetweenresiduesarealsodescribed.

2.1. Deep neural networks (DNNs)

DNNs are artiﬁcialneural networkswith multiplehidden lay-ersbetweentheinputandoutputlayers.InDNNs,thedeep-seated

Fig.1. (a)Deepfeed-forwardnetworks.(b)Deeprecurrentnetworks.

layers enabletheextractionofmore abstractfeatures fromlower layers. Although DNNs are typically designed with feed-forward layers,recentlyresearchershavesuccessfullydevelopedDNNswith recurrentstructures forapplicationssuchasnaturallanguage pro-cessing [25]. As shown in Fig. 1(a), deep feed-forward networks can represent complex data in a multi-layer network structure. Withsufficientlayersoflearningneurons,itcanmapunorganized low-level features to a high-level data manifold, which is more suitableforafinalclassificationorregressiontask.

Recurrentunits hasconnections betweenneuralunits toform a directed cycle, which allows them to exhibit dynamic tempo-ral behavior of arbitrarysequential inputs, asshownin Fig.1(b). Deepnetworkswithrecurrentstructure,alsocalleddeeprecurrent neuralnetworks(DRNNs)havealreadybeensuccessfullyusedfor complexsequentialdata analysis,such asspeechrecognition, im-age semantic understanding and online handwritten recognition. Withtheextracapabilityofcapturingthespatial/temporal depen-dencieswithinsequences,DRNNsaremorediﬃculttotraindueto thevanishing-gradientandover-ﬁttingproblems[18].

2.2. Stack auto-encoder (SAE)

Theclassicalauto-encoder(AE)[26]isathree-layerneural net-work(asshowninFig.2(a)):anencoderlayermapsinputtoa rep-resentationlayerandadecoder layerreconstructs theinputfrom therepresentationlayer.Theﬁrsttwolayersareregardedasan en-coder,whilethelatertwolayersare regardedasadecoder.Given input x andlet y denotethestateofneuronsintherepresentation layer,anAE’sfeed-forwardprocesscanbeformalizedasfollows:

y=

σ

(

Whx+bh

)

, (1)

x=g

WT hy+bo

, (2)

wherex isthereconstructoutput,and W h, b h, W_hT and b o respec-tivelydenotetheweight matrixoftheencoder, biasvectorofthe

(4)

Fig.2. (a)Atypicalwiththree-layerAE.Theinputxandoutputxhavethesame shape.Aftertrained,theweightsofbothencoderanddecoderareﬁxed.The en-coderpartcanbestackedtoanSAEthen.(b)Afour-layerStackAE.Itispre-trained layer-by-layerasthreestackAEs.TherepresentationlearnedatthepreviousAEis usedasinputforlearningthenextAE’srepresentation.

hiddenlayer,weightmatrixofthedecoderandbiasvectorforthe output layer. As the reconstruction error is minimized in a well trainedAE,thecostfunction J AEcanbedeﬁnedas:

JAE=L

(

x,x

)

, (3)

where L (·) is the function measuring the difference betweenthe originalinputsandthereconstructedones.

AEcan betrainedby thestandard back-propagationalgorithm [27].AEscanbefurtherstackedtocreateanstackedauto-encoder (SAE) [21] (as shown in Fig. 2(b)). In a SAE, the representation learned at the previous level is used as input for learning the nextlevel’srepresentation,whichyieldsabetterrepresentationfor deeper networks. Itcan automatically discoverandextract useful featuresorrepresentationsinalayer-wiseprocedure.Thetraining ofSAEfollowsthe recentlyproposedpre-trainingandﬁne-tuning paradigm.Automaticallylearningfeatureisimportantforthe pro-teinSSpredictionaspriorknowledgeforhand-craftingfeatures in-volving multipleaminoacidsisbarely available.Therefore,SAEis adoptedinSSREDNs topre-train thefeed-forward layersthat fol-lowtheinputlayertolearnbetterfeaturesautomatically.

2.3. Gated recurrent units (GRUs) 2.3.1. Classical recurrent unit

Recurrent units have connections towards themselves. A clas-sical recurrent unit isshownin Fig.3(a), its update procedureis formalizedasinEq.4:

ht=g

(

Wh·[xt,ht−1]

)

, (4)

where x t is the input at time t, h t and h t−1 is the hidden state

of the recurrentunit attime t and t ₋1_, respectively. W _h isthe weightsmatrix.Thebiasisomithere.

As the nonlinear active function g (·) will act on the recurrent unit repeatedly over time, it isdifficult to train a recurrent net-works.The gradient tends to eithervanish(most ofthe time) or approachinfinityduringthetrainingprocess.Thismakes gradient-based optimization method difficult, and it becomes even more problematicbecause theeffectof long-termdependenciesis hid-den(beingexponentiallysmallerwithrespecttosequencelength) bytheeffectofshort-termdependencies[20].

2.3.2. Gated recurrent unit

As classical recurrentunits have the gradient problems when training, a new-typed recurrent unit called GRU is used in our

Fig.3. (a)Classicalrecurrentunit.(b)GatedrecurrentUnit.

work. GRUwas proposed tosolve the gradient problemin train-ingrecurrentnetworksbyallowingeachrecurrentunittocapture the dependenciesof differenttime scales in an adaptive manner (seeFig.3(b)).Unliketheclassicalrecurrentunit,theactivation h t oftheGRUattime t isalinearinterpolationbetweentheprevious activation h t−1andthecandidateactivationh ˜t:

ht=zth˜t+

(

1−zt

)

ht−1. (5)

Here, z tdenotestheupdategateofGRU.Itdecideshowmuchthe unitupdatesitsactivationorcontent.Thisallowstheerrorto eas-ilyback-propagatethroughtheunitwithoutvanishingtooquickly orblowingup asaresultofpassingthrough multipletime steps, andthusreducingthediﬃcultyoftrainingtherecurrentnetworks.

z t iscomputedasfollow:

zt=

σ

(

Wz·[xt,ht−1]

)

, (6)

σ

(

x

)

= 1

1+e−x. (7)

Thecandidateactivationh ˜t iscomputedas[20]:

˜

ht=tanh

(

W·[xt,rtht−1]

)

, (8)

whichissimilartothetraditionalrecurrentunitexceptforthe in-clusionofthe r th t−1 term; r t isasetofresetgates andisan element-wisemultiplication. r t iscomputedasfollows:

rt=

σ

(

Wr·[xt,ht−1]

)

, (9)

σ

(

x

)

= 1

1+e−x. (10)

Weuse x t todenotethesequentialinputattime t .WithGRUs,it iseasyforeach unit toremembertheexistence ofaspeciﬁc fea-tureintheinputstreamoveralongseriesoftimesteps.Any fea-turethatisdecidedtobeimportantbytheupdategatewillnotbe overwritten.

(5)

118

Fig.4. Theproposedsecondarystructurerecurrentencoder–decodernetworks.

3. Theproposedmodelandalgorithm

Asthemajorpartofthisarticle,theproposedsecondary struc-turerecurrentencoder–decoder networks (SSREDNs) willbe pre-sentedinthissection.Itsarchitectureandthealgorithmfor learn-ingparameters fromdatawillbediscussedindetail.Weusea bi-directionalGRUstructureheretodealwiththespatialinteraction informationinproteinchain.Adamalgorithm isusedtolearnthe networkweightsfromdata.

3.1. Secondary structure recurrent encoder–decoder networks (SSREDNs)

As shown in Fig. 4, secondary structure recurrent encoder– decoder networks(SSREDNs) is essentially a deep recurrent net-work with feed-forward layers and recurrent architectures. It is consistedofan encoderpart,adecoderpartandarepresentation layer.Boththeencoderanddecoderpartcontainfeed-forward lay-ers,recurrentlayersandspecialtrainingmechanisms.Theencoder partlearnsagoodrepresentationfortheinputproteinfeature se-quencethatreﬂectsbothimmediateandlong-termaminoacid de-pendencies,andthe decoder partusesthe representationforthe ﬁnalSSpredictionwhichalsotakesconsideration intothespatial dependency. GRUs are used inthe encoder anddecoder to learn theaminoacidinteractioninformationwithintheproteinchain.

3.1.1. Encoder part

In the encoder part, a SAE is ﬁrst used to pre-train the ﬁrst fewlayers ofthenetworksforbetter featureextraction. ThisSAE istrain byaunsupervisedlayer-wisestrategy.Then, weightsfrom

the trained SAE are used to initialize theselayers. Byusing this pre-trainingstrategy,thenetworkscangetmoreeﬃcientfeatures automaticallyatthebeginningofthetrainingprocess.Throughthe encoder part,it allowsthe encodingofthe inputsequence intoa goodrepresentationattherepresentationlayer.Bidirectional GRU layersareusedheretocapturethecontext-dependentrelationship (interactionamongresidues)here.

3.1.2. Decoder part

Thedecoderpartdecodesthelearnedrepresentationofthe rep-resentationlayerintothestructurespaceforSSprediction. Bidirec-tionalGRUrecurrentlayersarealsousedtolearnadditionalglobal dependence information for the ﬁnal prediction. We use a soft-maxlayerastheﬁnal layertooutputtheprobability oftheeight secondarystructuralstates.Finally,therecurrentencoder–decoder networkistrainedbyasupervisedwayusingtheback-propagation learningalgorithm.

3.1.3. Training

AsshowninFig. 4,whentrainingweusethe aminoacid fea-turesasinputtothenetwork—residuebyresiduealongtheprotein chain—andtheencoder generatesastablerepresentation contain-ingthecontextualinformationoftheinputsequence.We normal-izetheactivationleveloftherepresentationlayerthroughanextra output layer.Next,the decoderpredicts the Q8 state forthe cur-rentinput residue usingthe representationfrom theencoder. As we use recurrent structures to model the sequence-structure re-lationship,thewindow sizeproblemofthe convolutionalmethod doesnotexist.Ateachtimestep,thenetworkreceivesoneresidue

(6)

Fig.5. BidirectionalGRULayer.

from the protein chain as input and then outputs its secondary structuralstate.

3.2. Bidirectional GRU

Astheproteinsequencecontainsbothforwardandreverse or-der dependency information between adjacent positions on the protein chain, we used bidirectional recurrent layers [28] with GRUsinthenetworks.Abidirectional-GRU(B-GRU)layerisshown inFig.5.TheB-GRUlayercontainstworeversedparallel GRU lay-erswhicharenotconnectedtoeachother.Sequentialinformation ﬂowsthroughthetwolayersinreversedorderastimeprogresses. Consider a sequence of protein data of length N with the networks’ input denoted as X =

(

x 1,x 2,...,x N

)

. Each x i,i ∈

(

1,2,...,N

)

isafeaturevector thatcontainsthefeature informa-tionoftheacidsatposition i .ThelearningalgorithmfortheB-GRU isshowninAlgorithm1:

The ﬁrst GRU layer (forward GRU layer) will accept the in-put from x 1 to x N ateach time step. The second GRU layer (re-versed GRU layer) will start at x N and accept the input from x N downto x 1;i.e.,inthebackwardtimedirection.Inthisway,when

processinga sequenceinput X =

(

x 1,x 2,...,x N

)

,the prorsadGRU layer will capture the information a _tprorsad at anytime step t_,t _∈

(

1,2,...,N

)

, which contains the dependence relationship of the previous t positions,whilethe reverseGRUlayercapturesthe re-versed information a reverse

N+1−t, which containsthe dependence rela-tionshipofthe t previouspositionsinreversedirection.Therefore, once the B-GRU layer is trained,the forward andbackward GRU layers’outputs a t,t ∈

(

1,2,...,N

)

mayincludethefulldependence relationshipsforitsinputsequenceatanyposition t .Boththe for-ward andbackward GRU layers’ outputs will passupward to the output layeroftheB-GRUtoformabetter representationforthe subsequent layers. Furthermore,in the back-propagation process, the forward and backward GRU layers are processed in decreas-ing andincreasing time order,respectively,by usingthe BPTT al-gorithm[29].

Comparedwithagenericrecurrentlayer,theB-GRUcanbetter deal withboth the spatialdependenciesinprotein sequence and the gradient problemsthat normallyoccur duringnetwork train-ing, which are usually the biggest challenge for trainingclassical recurrentnetworks.

3.3. Training

The network was trained withAdam, the stochastic gradient-based optimization method proposed by Kingma [24]. It com-bine the advantagesof two recently popular optimization meth-ods: AdaGrad[30],theadaptivegradientalgorithm,andRMSProp, which is similar but introduces an additional decay term. Adam

Algorithm1 B-GRUlearningAlgorithm. Require:

InputData:

X =

{

x 1,x 2,...,x N

}

:Sequenceinput X withlength N Ensure: Feedforward: for t =1to N do a _tprorsad=GRU

(

x t

)

a prorsad₌

{

_aprorsad 1 ,a prorsad 2 ,...,a prorsad N

}

endfor for t = N to1do a reverse t =GRU

(

x N+1−t

)

a reverse₌

{

_areverse N ,...,a 2reverse,a re1verse

}

endfor for all t ∈

(

1,2,..,N

)

do a t=concat

a _tprorsad_,a reverse N+1−t

a =

{

a 1,a 2,...,a N

}

endfor Backpropogation: for all t _∈

(

1_,2_,_..,N

)

do

Backward pass for output layer, storing the back error

δ

at eachtimestep

endfor

for t = N to1 do

BPTT for reverse GRU layer, using the stored

δ

from output layer

endfor

for t =1to N do

BPTT for prorsad GRU layer, using the stored

δ

fromoutput layer

endfor

computesindividual adaptivelearningrates fordifferent parame-tersofthenetworksfromestimatesofﬁrstandsecondmoments ofthegradients.IthasbeenshownthatAdamperformsequaltoor betterthansomeotheroptimizationmethods,regardlessof hyper-parametersetting.

InSSREDNs,thelearningoftherepresentationlayerisvery im-portantforthefinalperformance,becauseitconnectsboththe en-coder andthe decoder. As we known, the outputs of SS predic-tionmustbeacontinuousstructuralstatesthat adjacentresidues formtheuniformstructureinspace,eg.theSSstateHmayconsist ofseveralacids.Soa better activationofthe representationlayer shouldbethatdifferentinputfeaturesmayhavesimilar represen-tation at the hidden representation layer if they have the same outputlabel attheendduringthe trainingprocess.Forthis pur-pose, aadditional outputlayer is addedfollowed the representa-tion layer with Mean Square Error cost function to constrain its activation level.It will force differentinput feature to obtain ap-proximateactivationsatrepresentationlayeriftheyhavethesame label.Sothereareactuallytwooutputlayersinthetraining proce-dure,thegradientsflowfrombothoutputlayers.Oneupdatesthe wholenetworksbuttheother onejustreflectstheencoder.After theweightsofthenetworksistrainedwe onlyusethefinal out-putlayer(redpartinFig.6).Thecostfunctionsofthetwooutput layersareasfollows:

L1

(

x,

)

= 1 m m i=1 n j=1

x_ij−y_ij

2, (11) L2

(

x,

)

=− 1 m m i=1 n j=1

(

y_ijlog

(

P

(

x_ij

))

+

(

1−y_ij

)(

1−log

(

P

(

x_ij

))))

, (12)

(7)

120

Fig.6. Multi-outputtraining.A additionaloutputlayerwith MeanSquareError (MSE)costisaddedfollowedtherepresentationlayerduringthetrainingprocess. Ithelptoadjusttheactivityoftherepresentationlayerfortheﬁnalprediction.It alsohelptospeedupthetrainingprocesstosomeextent.

where

istheparameterset,andx _ij, y _ijdenotethej-thelementof thei-thsampleoftheinputset x andthetargetset y ,respectively. We use the Adam algorithm [24] to optimize the SSREDNs in training process. Moreover, the ﬁnal cost function L (x,

) of SSREDNsisasumofcostatthetwooutputlayers.

denotesthe wholeparametersetofthenetworksandweuse

γ

tobalancethe contributionofthetwooutputlayers.

L

(

x,

)

=L1

(

x,

)

+

γ

L2

(

x,

)

(13)

4. Experiments

In order to evaluate the performance of proposed SSREDNs, a seriesof experiments are performedon open datasets CullPDB [31]andCB513[10]. Detaileddatadescription isfirstprovided in Section4.1.In Section4.2,the modeltrainingsetup ispresented. First, variationsof network architectures of different layers were testedto refinethemodel’sperformance. Amongthem,we select theonewithbestperformanceforlatercomparisonwithother ex-istingapproaches.Wealsohaveevaluatedtheinfluenceofthe pre-trainstrategy using the SAEon thefinal predictionperformance. TheperformanceanalysisispresentedinSection4.3.Itshowsthat our method smoothly converges with fewer trainingepochs and improvesalmost all ofthe 8statas’ssensitivity andprecision ac-curacyontheCullPDBtestingsetcompared with[17] whichmay bethe best 8-state predictor.Our methodis alsocompared with some public methods (SSpro[32], RaptorX [10], PSIPRED [8]) on bothCullPDBandCB513dataset.Itoutperformstheothersonboth Q3andQ8accuracies.

4.1. Features and dataset

Themajor purposeofpredicttheSStypesforeachaminoacid ofagivenproteinsequence.WesolveboththeQ8 andQ3 predic-tionproblemsinthispaper.Weevaluated theproposed SSREDNs

forbothQ3andQ8ontwodatasets CullPDB[31]andCB513[10], whichare alsoused byother relatedinvestigations [10].CullPDB, alarge non-homologousdataset(identitylessthan 30%),contains 6128 protein amino acid sequences labeled with Q8 secondary structure,andhasbeenrandomlydividedintotraining(5600), val-idation(248), andtesting(280)sets.Aseparateevaluationisalso performed on CB513 dataset, which contains513 proteins, while trainingon CullPDB datasetfurther ﬁltered to remove sequences withmorethan25%identitywiththeCB513dataset.

Protein sequence profiles with evolutionary information have become a breakthrough forSS prediction. Thus, Position Specific Scoring Matrix (PSSM) features have been used here, which are widelyusedfeaturesthatcanbeextractedfromproteinprofilesby DefineSecondaryStructureofProteins(DSSP)andPositionSpecific IteratedBasicLocalAlignmentSearchTool(PSI-BLAST).In[17],the datausedfortrainingcontainedfeaturesandlabelsin56channels (22 forPSSM, 22 foramino acidsequence, 2for terminals, 2for solvent accessibilitylabels, 8 forsecondary structure labels). The trainingdatainclude700aminoacids.It’s consideredto provides a good balance between efficiency andcoverage as the majority of proteinchains are shorter than 700 aminoacids. When train-ingandtesting,shortersequences(lessthan700aminoacids)are paddedwith0.

Thoroughourexperiments,onlyPSSMfeatures(22)andamino acidsequencefeatures(22)wereusedhereforthe8-state predc-tion.Wealsofurtherremovedthe‘Noseq’channelasaconvention [17]. So the input proﬁles inour experiment consist of50 chan-nels(21forPSSM,21fortheaminoacidsequence,8forsecondary structurelabels).

4.2. Training setup 4.2.1. Networks setup

In the encoder and decoder framework, the ReLU layers with 50–300 unitsare used asfeed-forward layers. GRUlayers always contains100–300gated recurrentunits.In the B-GRU layer, out-put fromthe bidirectionalforward andbackward layers are con-catenatedintoasinglevectortobeusedasinputinthefollowing layers.

WeightsoftheSAEwereinitializedinagreedylayer-wise man-ner, which map the layers’ inputs back to themselves. Next, the otherweightsineachlayeraresampleduniformlybetween₋0.05 and0.05andbiasesareinitializedat0.Theinitialhiddenstatesof theGRUlayers(h ˜,h )areallsetto0andupdatedduringthe train-ingprocedure. Weuse50%dropoutattheB-GRU andGRU layers toavoidtheover-ﬁttingproblem.

AReLUactivationfunction[33]isusedforallthefeed-forward layersotherthantheoutputlayer.Whentraining,asoftmax activa-tionfunctionisusedonthefinalclassificationlayerandasigmoid activation function is used on the middle extra output layer, as showninFig.6.ThelearningratefortheSAEstageis0.05.When trainingthewholeencoder–decodernetwork,theAdamalgorithm, whichonlyrequiresfirst-ordergradientsandlittlememory,is ap-plied to control the parameter updateswith the default settings (

α

₌0_.001_,

β

1=0.9,

β

2=0.999,

ε

=1e −8). Here,

α

is the

learn-ingrate,

β

1,

β

2areparametersforexponentialmovingaveragesof

thegradientandthesquaredgradient,respectively,

ε

is a small pa-rameterusedtoavoidsingularitiesassociatedwithazero denom-inator.We justset

γ

to1foran equivalentcontributionbetween thetwooutputlayersinEq.(13).

All parameters were trainedglobally by Adamalgorithm with theﬁnalcostfunctioninEq.(13).Themaximumnumberof train-ing epochs is 100 and the batch size is 40 in our experiment. All thetrainingprocedures are implementedby Pythonbased on Theano and Keras libraries.The trainingprocedure wasexecuted basedonNvidiaTeslaK40GPUs.

(8)

Table1

PredictionaccuraciesisgivenforSSREDNswithdifferentarchitectures. Model Q8Accuracy(%) Segmentoverlapscore 3SAE,1BGRU,2FNN,1GRU 71.84 77.17 5SAE,1BGRU,2FNN,1GRU 70.51 76.70 3SAE,2BGRU,2FNN,2BGRU 72.51 77.95 3SAE,2BGRU,1GRU,2FNN,1BGRU 72.36 77.64 3SAE,2BGRU,2FNN,1BGRU 73.14 78.20 Table2

Comparisonoftheperformancewithandwithoutpre-trainusingSAE. Model(3SAE,2BGRU,2FNN,1BGRU) Q8Accuracy Segmentoverlapscore Withoutpre-train 72.13 77.12

Pre-trainusingSAE 73.14 78.20

4.2.2. Architecture analysis

We tried asetof variousnetwork architecturestoﬁnda suit-ablemodelforproteinSSpredictionproblem. Theirperformances are showninTable1.Thesegment ofoverlap(SOV)score,which canbeinterpretedasSSsegment-basedaccuracy,isalsousedhere toevaluatetheperformanceofdifferentnetworkarchitecture. Em-pirically,we found that a3-layerSAEis asuitable choice. The B-GRUlayersalways getbetterresultsthanthestandardGRUlayers duringthetrainingprocess.Itveriﬁesthatthebidirectional struc-turecanbettercapturetheinteractionrelationsofresiduesinthe protein chainbothin theencoder partandthedecoder part. Be-sides, therecurrentlayerwillalsoaffectthetrainingresults(row 4vs5inTable1).The optimalstructure(inrow5)wasfoundto be:{300-256-128StackedAuto-encoder}– {256BidirectionalGRU layers(0.5 dropout)}∗2– {256-128 ReLUlayers }– {64 ReLU Rep layer }– {128 BidirectionalGRU layer(0.5 dropout)}– {256-128 ReLUlayers}– {8sigmoidlayer}.Furthermore,a{8sigmoidlayer} layerwithMSEcostfunctionwasaddedonthe{64ReLUReplayer }. It obtained 73.14% Q8 accuracy and 78.20% SOV scores on the CullPDBtesting(272)set.

4.2.3. Stack auto-encoder

Feature learningis avery importantprocess forSSprediction. Thediscovery ofgoodfeaturesmaybenefitthepredictionprocess and improve the prediction accuracy. A SAEarchitecture is used followed the input layer in our model to extract better features from the input protein features automatically. After pre-training, the SAE part learned robust representation from the input fea-tures.Itwillalsobefine-tunedwhentrainingthewholenetworks. To confirm if this pre-train strategy using SAE is really helpful forimprovingthepredictionaccuracy,wetrainedamodelonthe CullPDB trainingset withandwithout the pre-train strategy and test iton theCullPDBtestingset.The resultisshowninTable 2. It improvesthe finalQ8 accuracy from72.13%to 73.14%andSOV scorefrom77.12%to78.20%.

4.3. Performance

4.3.1. Performance on testing set

Thepredictionsensitivities,precisionsandfrequenciesfor indi-vidual secondary structure states of theCullPDB testing set with 272 sequences are shown in Table 3. Compared with [17], for the four major states, H, E, L and T, our SSREDNs method im-proves the prediction sensitivityandprecision. The improvement forthepredictionofstateT,i.e.,the

β

₋turn predictionwhich de-pends on the long-rangeinter-residue interactions, indicates that the SSREDNs model can learn the long-range structure features from theinput proteinchain sequence. Predictionsfor thestates G, S andBare diﬃcultbecause oftheir lessfrequencies,andthe SSREDNs also makes better predictions for them compared with

Table3

Performanceofindividual secondarystructurestateon CullPDB testingset.

Sec. Sensitivity Precision Frequency Description H 94.11/93.5 86.96/82.8 35.8/35.4 α-helix E 84.52/82.3 78.07/74.8 24.1/21.8 β-helix L 64.01/63.3 58.24/54.1 21.3/18.6 loop T 53.97/50.6 57.96/54.8 10.1/11.1 β−turn S 28.29/15.9 49.13/42.3 5.0/7.9 bend G 35.99/13.3 40.48/49.6 3.4/4.1 310−helix B 7.4/0.1 45.63/50 0.2/1.1 β−brige I – /— – /00- 0/0 π−helix

Fig.7. ModelperformanceduringthetrainingprocessonCullPDBdataset.(a)The curveconvergestoastablestatewithin20epochesonCullPDBtrainingdataset.(b) ThewholeperformancecurveofSSREDNswithin100trainingepochs.

[17]onpredictionsensitivities.Thelowerpredictionprecisionsfor states G and B are accounted for by their relative rarity in the trainingset compared withour test set (3.1 vs 4.1, 0.2 vs 1.1). On the other hand, SSREDNs is essentially a data-driven model, andit canlearn the complex interdependenciesof residuesfrom a mass of trainingsequences. The SSREDNs has some limitation forpredictingtheraritystateG,BandI(310−helix,

β

−brige and

π

−helix ) due to their few frequencies (only around 5% intotal) inthetrain sequences.Thesestateswillalso bea focusoffuture work.

4.3.2. Convergence

Fig.7(a) shows the trainingaccuracy for the CullPDB dataset. The SSREDNs show a powerful learning ability that always con-verges within20 trainingepochs toa predictionaccuracy rateof

(9)

122

Table4

Q8accuracyonCullPDBandCB513.

Model Q8Accuracy

CB513 CullPDB

GSN[17] 66.4 72.1

TheoptimalSSREDNs 68.2 73.1

Table5

Q8andQ3accuracyondatasetsCB513andCullPDB.

Method Q8 Q3 CB513 CullPDB CB513 CullPDB SSpro 63.5 66.6 78.5 79.5 RaptorX-SS8 64.9 69.7 78.3 81.2 PSIPRED – – 79.2 82.5 Ours 68.2 73.1 82.9 84.2

about 72%, equal to the best result in [17], which requires 300 epochs.InFig.7(b),itshowsthatSSREDNswillsmoothlyconverges to its optimal result on validation set within about 70 training epochs,andthisoptimalmodel isusedforthe testingsetin our experiments.

4.3.3. Comparasion with other methods

As shown in Table 4, we compared the Q8 accuracy of our methodwiththedeepgenerativestochasticnetwork(GSN)method in[17], which maybe the best 8-state predictor.Forthis valida-tion, we trained a model in which the CullPDB dataset was ﬁl-teredtoremovesequenceshavinghomologywithCB513sequences (morethan25%identity).WealsojustconsiderthePSSMfeatures andaminoacidsequencefeaturesinthetesting.Withtheoptimal model,weachieveaQ8 accuracyof68.2%onCB513and73.1%on CullPDB,whichoutperformsthebestresultofGSNmodel.

With the same network architecture and parameters set, we alsocompareourmethodwiththefollowingpublicavailable pro-grams:PSIPREDfor3-statesSSprediction;SSpro,RaptorXforboth 8-statesand3-states SSpredictionon the datasets ofCB513 and CullPDB.TheSSpropackageisusedwithouttemplate(i.e.,notusing asolvedstructure inPDBastemplate). Alltheprograms are run-ningwiththeirparameterssetaccordingtotheirrespectivepapers. AslistedinTable5,ourmethodoutperformstheothers,including thepopularPSIPREDonQ3prediction,SSproandRaptorXonboth Q3 and Q8 prediction. In terms of both Q3 and Q8 accuracy on CullPDBandCB513, weobtains 84.2%,82.9%,73.1%and66.4%, re-spectively.

5. Conclusionandfuturework

In thisarticle,we proposed adeeprecurrentencoder–decoder network and employed it to predict the secondary structure of residues in amino acid sequences. The combination of encoder– decoder architecture and GRUs is well suited to model both the sequence-structure relationship between input protein fea-turesandSS,andthemutualinteractions amongresidues.It also achievesbetterperformanceonbothQ3andQ8accuracy.For fur-therdevelopment ofthe learning ability ofSSREDNs, however,a methodmust be developed for determining the architecture and parametersofthenetworks,suchaslayertype,layersize,optimal methodandinitialization.Thisisalsoachallengefordeeplearning andmachine learningingeneral. Tofurther improvethelearning abilityofthepresentmodel,multi-taskpredictionmaybe an av-enueworthpursuing,suchasthepredictionofboththesecondary structureandthesolventaccessiblesurfacearea.

Acknowledgment

ThisworkwassupportedbytheNationalScienceFoundationof

China(Grantnos.61432012and61402306).

Supplementarymaterial

Supplementary material associated with this article can be found,intheonlineversion,at10.1016/j.knosys.2016.11.015. References

[1]X.Luo, Z.Ming, Z.You, S. Li, Y. Xia,H. Leung,Improvingnetwork topolo-gy-basedproteininteractomemappingviacollaborativeﬁltering,Knowl.-Based Syst.90(2015)23–32.

[2]X.Lei, Y. Ding, H. Fujita, A.Zhang, Identiﬁcation ofdynamicprotein com-plexesbasedonfruitﬂyoptimizationalgorithm,Knowl.-BasedSyst.105(2016) 270–277.

[3]J.Cheng,A.N.Tegge,P.Baldi,Machinelearningmethodsforproteinstructure prediction,IEEERev.Biomed.Eng.1(2008)41–49.

[4]J.K.Myers,T.G.Oas,Preorganizedsecondarystructureasanimportant deter-minantoffastproteinfolding,Nat.Struct.Biol.8(6)(2001)552–558.

[5]Z.Yang,I-Tasserserverforprotein3dstructureprediction,BMCBioinform.9 (3)(2008)297–315.

[6]Z.Aydin,Y.Altunbasak,M.Borodovsky,Proteinsecondarystructureprediction forasingle-sequenceusinghiddensemi-markovmodels,BMCBioinform.7(1) (2006)178.

[7]X.D.Sun,R.B.Huang,Predictionofproteinstructuralclassesusingsupport vec-tormachines,AminoAcids,30(4)(2006)469–475.

[8]D.T.Jones,Proteinsecondarystructurepredictionbasedonposition-speciﬁc scoringmatrices,J.Mol.Biol.292(2)(1999)195–202.

[9]G.Pollastri,D.Przybylski,B.Rost,P.Baldi,Improvingthepredictionofprotein secondarystructureinthreeandeightclassesusingrecurrentneuralnetworks andproﬁles,Proteins:Struct.Funct.Bioinform.47(2)(2002)228–235.

[10]Z.Wang,F.Zhao,J.Peng,J.Xu,Protein8-classsecondarystructureprediction usingconditionalneuralﬁelds,Proteomics,11(19)(2011)3786–3792.

[11]S.Babaei,A.Geranmayeh,S.A.Seyyedsalehi,Proteinsecondarystructure pre-dictionusingmodularreciprocalbidirectionalrecurrentneuralnetworks, Com-put.MethodsProgramsBiomed.100(3)(2010)237–247.

[12] R.Salakhutdinov,G.Hinton,DeepBoltzmannmachines,J.Mach.Learn.Res.5 (2)(2009)1967–2006.

[13]Y.Bengio,G.Mesnil,Y.Dauphin,S.Rifai,Bettermixingviadeep representa-tions,in:ProceedingsofInternationalConferenceonMachineLearning(ICML), 2012,pp.552–560.

[14]G.E.Hinton,R.R.Salakhutdinov,Reducingthedimensionalityofdatawith neu-ralnetworks,Science,313(5786)(2006)504–507.

[15]M.Spencer,J.Eickholt,J.Cheng,Adeeplearningnetworkapproachtoabinitio proteinsecondarystructureprediction,IEEE/ACMTrans.Comput.Biol. Bioin-form.12(1)(2015)103–112.

[16]R.Heffernan,K.Paliwal,J.Lyons,A.Dehzangi,A.Sharma,J.Wang,A.Sattar, Y.Yang,Y.Zhou,Improvingpredictionofsecondarystructure,localbackbone angles,andsolventaccessiblesurfaceareaofproteinsbyiterativedeep learn-ing,Sci.Rep.5(11476)(2015),doi:10.1038/srep11476.

[17]J. Zhou, O.G. Troyanskaya, Deep supervised and convolutional generative stochasticnetworkforproteinsecondarystructureprediction,in:Proceedings ofInternationalConferenceonMachineLearning(ICML),2014,pp.745–753.

[18]R.Pascanu,T.Mikolov,Y.Bengio,Onthediﬃcultyoftrainingrecurrent neu-ralnetworks,in:ProceedingsofInternationalConferenceonMachineLearning (ICML),2013,pp.1310–1318.

[19]K.Cho,B.V.Merrienboer,D.Bahdanau,Y.Bengio,Onthepropertiesofneural machinetranslation:encoder-decoderapproaches,arXivpreprint,arXiv:1409. 1259(2014).

[20]J.Chung,C.Gulcehre,K.H.Cho,Y.Bengio,Empiricalevaluationofgated recur-rentneuralnetworksonsequencemodeling,arXivpreprint,arXiv:1412.3555

(2014).

[21]P.Vincent,H.Larochelle,I.Lajoie,Y.Bengio,P.A.Manzagol,Stackeddenoising autoencoders:learningusefulrepresentationsinadeepnetworkwithalocal denoisingcriterion,J.Mach.Learn.Res.11(6)(2010)3371–3408.

[22]C.D.Huang,S.F.Liang,C.T.Lin,R.C.Wu,Machinelearningwithautomatic fea-tureselectionformulti-classproteinfoldclassiﬁcation,J.Inform.Eng.21(4) (2005)711–720.

[23]G.E.Dahl,T.N.Sainath,G.E.Hinton,Improvingdeepneuralnetworksforlvcsr using rectiﬁed linear units and dropout, in: Proceedings of IEEE Interna-tionalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP),2013, pp.8609–8613.

[24]D.Kingma, J.Ba,Adam:amethod forstochastic optimization,in: Proceed-ings of International Conference on Learning Representations (ICLR),2015, pp.254–269.

[25]M.Hermans,B.Schrauwen,Trainingandanalysingdeeprecurrentneural net-works,in:ProceedingsofAdvancesinNeuralInformationProcessingSystems (NIPS),2013,pp.190–198.

[26]Q. You, Y.J. Zhang, Anew trainingprinciple for stackeddenoising autoen-coders, in:ProceedingsofInternationalConference onImageand Graphics, 2013,pp.384–389.

(10)

[27]R.Hecht-Nielsen,Theoryofthebackpropagationneuralnetwork,NeuralNetw. 1(1)(1988)65–93.

[28]M.Schuster,K.K.Paliwal,Bidirectionalrecurrentneuralnetworks,IEEETrans. SignalProcess.45(11)(1997)2673–2681.

[29]P.J.Werbos,Backpropagationthroughtime:whatit doesand howtodoit, Proc.IEEE,78(10)(1990)1550–1560.

[30]J.Duchi,E.Hazan,Y.Singer,Adaptivesubgradientmethodsforonlinelearning andstochasticoptimization,J.Mach.Learn.Res.12(7)(2011)257–269.

[31] W.Guoli,R.L.DunbrackJr,Pisces:aproteinsequencecullingserver, Bioinfor-matics,19(12)(2003)1589–1591.

[32]C.N. Magnan, P.Baldi, Sspro/accpro5: almostperfect prediction ofprotein secondarystructureandrelativesolventaccessibilityusingproﬁles,machine learningandstructuralsimilarity,Bioinformatics,30(18)(2014)2592–2597.

[33]X.Glorot,A.Bordes,Y.Bengio,Deepsparserectiﬁerneuralnetworks,in: Pro-ceedingsofInternational Conference onArtiﬁcial Intelligenceand Statistics (AISTATS),2011,pp.315–323.

http://nrl.northumbria.ac.uk/policies.html

6140230

10.1016/j.knosys.2016.11.015

23–32

270–277

41–49

552–558

297–315

178

469–475

195–202

228–235

3786–3792

237–247

1967–2006

552–560

504–507

103–112

10.1038/srep11476

745–753

1310–1318

machine

arXiv:1412.3555

3371–3408

711–720

8609–8613

254–269

190–198

384–389

65–93

2673–2681

1550–1560

257–269

1589–1591

2592–2597

315–323