• No results found

Protein secondary structure prediction by using deep learning method

N/A
N/A
Protected

Academic year: 2021

Share "Protein secondary structure prediction by using deep learning method"

Copied!
10
0
0

Loading.... (view fulltext now)

Full text

(1)

Citation: Wang, Yangxu, Mao, Hua and Yi, Zhang (2017) Protein secondary structure

prediction by using deep learning method. Knowledge-Based Systems, 118. pp. 115-123.

ISSN 0950-7051

Published by: Elsevier

URL:

http://dx.doi.org/10.1016/j.knosys.2016.11.015

<http://dx.doi.org/10.1016/j.knosys.2016.11.015>

This version was downloaded from Northumbria Research Link:

http://nrl.northumbria.ac.uk/39675/

Northumbria University has developed Northumbria Research Link (NRL) to enable users to

access the University’s research output. Copyright

© 

and moral rights for items on NRL are

retained by the individual author(s) and/or other copyright owners. Single copies of full items

can be reproduced, displayed or performed, and given to third parties in any format or

medium for personal research or study, educational, or not-for-profit purposes without prior

permission or charge, provided the authors, title and full bibliographic details are given, as

well as a hyperlink and/or URL to the original metadata page. The content must not be

changed in any way. Full items must not be sold commercially in any format or medium

without formal permission of the copyright holder. The full policy is available online:

http://nrl.northumbria.ac.uk/policies.html

This document may differ from the final, published version of the research and has been

made available online in accordance with publisher policies. To read and/or cite from the

published version of the research, please visit the publisher’s website (a subscription may be

required.)

(2)

Yangxu

Wang,

Hua

Mao

,

Zhang

Yi

MachineIntelligenceLaboratory,CollegeofComputerScience,SichuanUniversity,Chengdu610065,People’sRepublicofChina

Keywords: Deeplearning

Secondarystructureprediction Encoder–decodernetworks Recurrentneuralnetworks

a

b

s

t

r

a

c

t

Thepredictionofproteinstructuresdirectlyfromaminoacidsequencesisoneofthebiggestchallenges incomputationalbiology.Itcanbedividedintoseveralindependentsub-problemsinwhichprotein sec-ondarystructure(SS)predictionisfundamental.ManycomputationalmethodshavebeenproposedforSS predictionproblem.Fewofthemcanmodelwellboththesequence-structuremappingrelationship be-tweeninputproteinfeaturesandSS,andtheinteractionrelationshipamongresidueswhichareboth im-portantforSSprediction.Inthispaper,weproposedadeeprecurrentencoder–decodernetworkscalled SecondaryStructureRecurrentEncoder–DecoderNetworks(SSREDNs)to solvethisSSprediction prob-lem.DeeparchitectureandrecurrentstructuresareemployedintheSSREDNstomodelboththecomplex nonlinearmappingrelationshipbetweeninputproteinfeaturesandSS,andthemutualinteractionamong continuousresiduesoftheproteinchain.Aseriesoftechniquesarealsousedinthispapertorefinethe model’sperformance.The proposedmodel isapplied tothe opendatasetCullPDBand CB513. Experi-mentalresultsdemonstratethatourmethodcanimprovebothQ3andQ8accuracycomparedwithsome publicavailablemethods.ForQ8predictionproblem,itachieves68.20%and73.1%accuracyonCB513and CullPDBdatasetinfewerepochsbetterthanthepreviousstate-of-artmethod.

1. Introduction

Discoveringprotein’sstructureandbiologicalfunctionsarevery importantforunderstandingtheirbiologicalprocesses,suchasthe protein-protein interactions [1], protein complexes identification [2] andprotein structure prediction. Proteinstructure prediction, elucidating the complex relationshipbetween aprotein sequence andits structure,isoneofthemostimportantchallengesin com-putationalbiology[3].Themostelementaltaskofproteinstructure prediction is protein secondary structure (SS) prediction, which aimstodiscoverthestructuralstatesofaminoacids.SSrepresents thelocalconformationofthepolypeptidebackboneofproteinsand providesabridgethatlinkstheprimarysequenceandthetertiary structure,whichisveryhelpfulformanystructuralandfunctional analysistools[4,5].

Typically, protein secondary structures can either be divided intothreestates(

α

-helix(H),

β

-strand(E)andcoilregion(C)) or be further classified into eight fine-grained states (310-helix (G),

α

-helix(H),

π

-helix(I),

β

-strand(E),

β

-bridge(B),

β

-turn(T),high curvatureregions (S)andirregularloop(L)).SSpredictionis usu-ally evaluated by Q3 and Q8 accuracy, which measures the

per-∗ Correspondingauthor.

E-mail addresses: [email protected] (Y. Wang), [email protected] (H. Mao),[email protected](Z.Yi).

centageofresiduesforwhich3-stateor8-stateSSiscorrectly pre-dicted.Currently,extensiveresearcheffortshavebeenspenton ap-plyingcomputationalmethods toaddresstheQ3prediction prob-lem,butveryfewtothemorechallengingQ8predictionproblem.

Hidden markovmodel (HMM) has been applied to 3-state SS prediction problem [6]. Although HMM can describe the inter-actions among adjust residues, it’s very challengingfor HMM to modelthe complex nonlinear relationship betweeninput protein featuresandSS. Supportvector machine (SVM)[7]can dealwith thiscomplex nonlinear mapping, but it’s challenging forSVM to take intoconsideration the interactions amongadjacent residues. Toourbestknowledge,byusinga2-stage neuralnetworks(NNs) method[8], sofarthebest Q3 accuracyisabout80%.FortheQ8 predictionproblem,existingmethods[9,10]failtoprovide promis-ing results. The problem may be that most of these mentioned methodsare shallowarchitectures.Thelimitationof themisthat it’s very difficult for a relatively shallow architectures to model wellboththecomplexsequence-structurerelationshipbetween in-putprotein features andSS, and the mutualinteraction relation-ship among residues. However, they are both important for SS prediciton[10,11].

Nowdays, NNs with deep architectures, also called deep neu-ralnetworks(DNNs)becomethemostpowerfulmachinelearning techniquesforpatternrecognition[12,13].Withtheabilityof map-pingunorganizedlow-levelfeaturesintohigh-levellatendata rep-resentationswhicharemoresuitableforafinalclassification

(3)

prob-116

lem[14],DNNscanbetter modelthecomplexsequence-structure relationshipforSSprediction.Cheng[15]proposedatypical deep beliefnetworksmodelforQ3 prediciton,inwhicheach layerisa restricted Boltzmannmachine (RBM). AnotherDNNs approach is reported in [16], which develop a multi-step iterativedeep net-worksmodel that predicts four different sets of structural prop-erties including 3-state SS. In order to better capture the mu-tualinteractions amongresidues,a deepconvolutional generative stochasticnetwork isproposed in[17] fora 8-state SSprediction problem, which may be the best 8-state predictor as we know. However,theperformanceofthisapproachissensitivetothe cho-senconvolution window size which isdifficult to determine. Re-currentneuralnetworks(RNNs)are alsoadoptedforthisproblem astheycan employcontextualinformationofinput sequenceand learn the variable-width dependencies in the protein chain [11]. However,thegradientproblemstilllimitstheapplicationofthese RNN-basedapproaches[18].

In this paper, we proposed a deep recurrent network, called the Secondary Structure Recurrent Encoder–Decoder Networks (SSREDNs) for the challenge of both Q3 and Q8 prediction. SSREDNs employ the encoder–decoder architecture [19], a typi-caldeeparchitecturetomodelthesequence-structurerelationship. This architecture allows its encoder part to encode a given pro-teinsequenceintoarepresentationlayerandtheits decoderpart decodesthislearnedrepresentationintonewfeaturespace. Com-pared with classical deep architectures, it enhances the model’s learning and representation abilities [19]. Second, different with [17]whichuse adeep convolutionalnetworks,weuseanew-typed recurrent structure, called gated recurrent units (GRUs) [20] in ournetworkstomodelthemutualinteractionrelationshipamong residues.Itovercomes thegradient problemandcan betterlearn both the adjacent and long-range interactions among residues withouttheproblemofchoosingthesizeoftheconvolution win-dow. The proposed networks can model both the complex non-linearsequence-structuremappingbetweeninputproteinfeatures andSS,andthemutualinteractionrelationshipamongresidues.In particular,astackauto-encoder(SAE)architecture[21]isalso con-tainedintheencoder partofournetworkstoautomaticallylearn compactandrepresentativefeaturesforinput.It’sespecially help-fulforproteinsaswelackthenecessaryintuitionorknowledgefor hand-craftingfeaturesinvolvingmultipleaminoacids[22].Aseries oftechniques are also adopted to refine the performance ofthe networks, such asthe dropout technique[23] and theoptimized algorithmnamedAdam[24].

Fortheevaluationpurpose,theproposed methodisappliedto thebenchmarkdatasetCB513andCullPDBforbothQ3andQ8 pre-diction.First,SSREDNsoutperformsthebestQ8 predictor[17], es-pecially onsome SS typeswhich are difficultto predict. Further-more,italsooutperformsmostofthepublicavailablemethodsfor bothQ3andQ8prediction.

The paper is organized asfollows. Section 2 gives an outline of the deep learning architectures and techniques used in this work. In Section 3, the proposed networks model and its learn-ing algorithmare introduced. Experimental details are elucidated inSection4.Finally,theconclusionsandfutureworkaredrawnin Section5.

2. Preliminary

In this section, a brief introduction of DNNsand the SAEare presented.In addition,theGRUs usedtodeal withthelong-term interactionsbetweenresiduesarealsodescribed.

2.1. Deep neural networks (DNNs)

DNNs are artificialneural networkswith multiplehidden lay-ersbetweentheinputandoutputlayers.InDNNs,thedeep-seated

Fig.1. (a)Deepfeed-forwardnetworks.(b)Deeprecurrentnetworks.

layers enabletheextractionofmore abstractfeatures fromlower layers. Although DNNs are typically designed with feed-forward layers,recentlyresearchershavesuccessfullydevelopedDNNswith recurrentstructures forapplicationssuchasnaturallanguage pro-cessing [25]. As shown in Fig. 1(a), deep feed-forward networks can represent complex data in a multi-layer network structure. Withsufficientlayersoflearningneurons,itcanmapunorganized low-level features to a high-level data manifold, which is more suitableforafinalclassificationorregressiontask.

Recurrentunits hasconnections betweenneuralunits toform a directed cycle, which allows them to exhibit dynamic tempo-ral behavior of arbitrarysequential inputs, asshownin Fig.1(b). Deepnetworkswithrecurrentstructure,alsocalleddeeprecurrent neuralnetworks(DRNNs)havealreadybeensuccessfullyusedfor complexsequentialdata analysis,such asspeechrecognition, im-age semantic understanding and online handwritten recognition. Withtheextracapabilityofcapturingthespatial/temporal depen-dencieswithinsequences,DRNNsaremoredifficulttotraindueto thevanishing-gradientandover-fittingproblems[18].

2.2. Stack auto-encoder (SAE)

Theclassicalauto-encoder(AE)[26]isathree-layerneural net-work(asshowninFig.2(a)):anencoderlayermapsinputtoa rep-resentationlayerandadecoder layerreconstructs theinputfrom therepresentationlayer.Thefirsttwolayersareregardedasan en-coder,whilethelatertwolayersare regardedasadecoder.Given input x andlet y denotethestateofneuronsintherepresentation layer,anAE’sfeed-forwardprocesscanbeformalizedasfollows:

y=

σ

(

Whx+bh

)

, (1)

x=g

WT hy+bo

, (2)

wherex isthereconstructoutput,and W h, b h, WhT and b o respec-tivelydenotetheweight matrixoftheencoder, biasvectorofthe

(4)

Fig.2. (a)Atypicalwiththree-layerAE.Theinputxandoutputxhavethesame shape.Aftertrained,theweightsofbothencoderanddecoderarefixed.The en-coderpartcanbestackedtoanSAEthen.(b)Afour-layerStackAE.Itispre-trained layer-by-layerasthreestackAEs.TherepresentationlearnedatthepreviousAEis usedasinputforlearningthenextAE’srepresentation.

hiddenlayer,weightmatrixofthedecoderandbiasvectorforthe output layer. As the reconstruction error is minimized in a well trainedAE,thecostfunction J AEcanbedefinedas:

JAE=L

(

x,x

)

, (3)

where L (·) is the function measuring the difference betweenthe originalinputsandthereconstructedones.

AEcan betrainedby thestandard back-propagationalgorithm [27].AEscanbefurtherstackedtocreateanstackedauto-encoder (SAE) [21] (as shown in Fig. 2(b)). In a SAE, the representation learned at the previous level is used as input for learning the nextlevel’srepresentation,whichyieldsabetterrepresentationfor deeper networks. Itcan automatically discoverandextract useful featuresorrepresentationsinalayer-wiseprocedure.Thetraining ofSAEfollowsthe recentlyproposedpre-trainingandfine-tuning paradigm.Automaticallylearningfeatureisimportantforthe pro-teinSSpredictionaspriorknowledgeforhand-craftingfeatures in-volving multipleaminoacidsisbarely available.Therefore,SAEis adoptedinSSREDNs topre-train thefeed-forward layersthat fol-lowtheinputlayertolearnbetterfeaturesautomatically.

2.3. Gated recurrent units (GRUs) 2.3.1. Classical recurrent unit

Recurrent units have connections towards themselves. A clas-sical recurrent unit isshownin Fig.3(a), its update procedureis formalizedasinEq.4:

ht=g

(

Wh·[xt,ht−1]

)

, (4)

where x t is the input at time t, h t and h t−1 is the hidden state

of the recurrentunit attime t and t 1, respectively. W h isthe weightsmatrix.Thebiasisomithere.

As the nonlinear active function g (·) will act on the recurrent unit repeatedly over time, it isdifficult to train a recurrent net-works.The gradient tends to eithervanish(most ofthe time) or approachinfinityduringthetrainingprocess.Thismakes gradient-based optimization method difficult, and it becomes even more problematicbecause theeffectof long-termdependenciesis hid-den(beingexponentiallysmallerwithrespecttosequencelength) bytheeffectofshort-termdependencies[20].

2.3.2. Gated recurrent unit

As classical recurrentunits have the gradient problems when training, a new-typed recurrent unit called GRU is used in our

Fig.3. (a)Classicalrecurrentunit.(b)GatedrecurrentUnit.

work. GRUwas proposed tosolve the gradient problemin train-ingrecurrentnetworksbyallowingeachrecurrentunittocapture the dependenciesof differenttime scales in an adaptive manner (seeFig.3(b)).Unliketheclassicalrecurrentunit,theactivation h t oftheGRUattime t isalinearinterpolationbetweentheprevious activation h t−1andthecandidateactivationh ˜t:

ht=zth˜t+

(

1−zt

)

ht−1. (5)

Here, z tdenotestheupdategateofGRU.Itdecideshowmuchthe unitupdatesitsactivationorcontent.Thisallowstheerrorto eas-ilyback-propagatethroughtheunitwithoutvanishingtooquickly orblowingup asaresultofpassingthrough multipletime steps, andthusreducingthedifficultyoftrainingtherecurrentnetworks.

z t iscomputedasfollow:

zt=

σ

(

Wz·[xt,ht−1]

)

, (6)

σ

(

x

)

= 1

1+ex. (7)

Thecandidateactivationh ˜t iscomputedas[20]:

˜

ht=tanh

(

W·[xt,rtht−1]

)

, (8)

whichissimilartothetraditionalrecurrentunitexceptforthe in-clusionofthe r th t−1 term; r t isasetofresetgates andisan element-wisemultiplication. r t iscomputedasfollows:

rt=

σ

(

Wr·[xt,ht−1]

)

, (9)

σ

(

x

)

= 1

1+ex. (10)

Weuse x t todenotethesequentialinputattime t .WithGRUs,it iseasyforeach unit toremembertheexistence ofaspecific fea-tureintheinputstreamoveralongseriesoftimesteps.Any fea-turethatisdecidedtobeimportantbytheupdategatewillnotbe overwritten.

(5)

118

Fig.4. Theproposedsecondarystructurerecurrentencoder–decodernetworks.

3. Theproposedmodelandalgorithm

Asthemajorpartofthisarticle,theproposedsecondary struc-turerecurrentencoder–decoder networks (SSREDNs) willbe pre-sentedinthissection.Itsarchitectureandthealgorithmfor learn-ingparameters fromdatawillbediscussedindetail.Weusea bi-directionalGRUstructureheretodealwiththespatialinteraction informationinproteinchain.Adamalgorithm isusedtolearnthe networkweightsfromdata.

3.1. Secondary structure recurrent encoder–decoder networks (SSREDNs)

As shown in Fig. 4, secondary structure recurrent encoder– decoder networks(SSREDNs) is essentially a deep recurrent net-work with feed-forward layers and recurrent architectures. It is consistedofan encoderpart,adecoderpartandarepresentation layer.Boththeencoderanddecoderpartcontainfeed-forward lay-ers,recurrentlayersandspecialtrainingmechanisms.Theencoder partlearnsagoodrepresentationfortheinputproteinfeature se-quencethatreflectsbothimmediateandlong-termaminoacid de-pendencies,andthe decoder partusesthe representationforthe finalSSpredictionwhichalsotakesconsideration intothespatial dependency. GRUs are used inthe encoder anddecoder to learn theaminoacidinteractioninformationwithintheproteinchain.

3.1.1. Encoder part

In the encoder part, a SAE is first used to pre-train the first fewlayers ofthenetworksforbetter featureextraction. ThisSAE istrain byaunsupervisedlayer-wisestrategy.Then, weightsfrom

the trained SAE are used to initialize theselayers. Byusing this pre-trainingstrategy,thenetworkscangetmoreefficientfeatures automaticallyatthebeginningofthetrainingprocess.Throughthe encoder part,it allowsthe encodingofthe inputsequence intoa goodrepresentationattherepresentationlayer.Bidirectional GRU layersareusedheretocapturethecontext-dependentrelationship (interactionamongresidues)here.

3.1.2. Decoder part

Thedecoderpartdecodesthelearnedrepresentationofthe rep-resentationlayerintothestructurespaceforSSprediction. Bidirec-tionalGRUrecurrentlayersarealsousedtolearnadditionalglobal dependence information for the final prediction. We use a soft-maxlayerasthefinal layertooutputtheprobability oftheeight secondarystructuralstates.Finally,therecurrentencoder–decoder networkistrainedbyasupervisedwayusingtheback-propagation learningalgorithm.

3.1.3. Training

AsshowninFig. 4,whentrainingweusethe aminoacid fea-turesasinputtothenetwork—residuebyresiduealongtheprotein chain—andtheencoder generatesastablerepresentation contain-ingthecontextualinformationoftheinputsequence.We normal-izetheactivationleveloftherepresentationlayerthroughanextra output layer.Next,the decoderpredicts the Q8 state forthe cur-rentinput residue usingthe representationfrom theencoder. As we use recurrent structures to model the sequence-structure re-lationship,thewindow sizeproblemofthe convolutionalmethod doesnotexist.Ateachtimestep,thenetworkreceivesoneresidue

(6)

Fig.5. BidirectionalGRULayer.

from the protein chain as input and then outputs its secondary structuralstate.

3.2. Bidirectional GRU

Astheproteinsequencecontainsbothforwardandreverse or-der dependency information between adjacent positions on the protein chain, we used bidirectional recurrent layers [28] with GRUsinthenetworks.Abidirectional-GRU(B-GRU)layerisshown inFig.5.TheB-GRUlayercontainstworeversedparallel GRU lay-erswhicharenotconnectedtoeachother.Sequentialinformation flowsthroughthetwolayersinreversedorderastimeprogresses. Consider a sequence of protein data of length N with the networks’ input denoted as X =

(

x 1,x 2,...,x N

)

. Each x i,i

(

1,2,...,N

)

isafeaturevector thatcontainsthefeature informa-tionoftheacidsatposition i .ThelearningalgorithmfortheB-GRU isshowninAlgorithm1:

The first GRU layer (forward GRU layer) will accept the in-put from x 1 to x N ateach time step. The second GRU layer (re-versed GRU layer) will start at x N and accept the input from x N downto x 1;i.e.,inthebackwardtimedirection.Inthisway,when

processinga sequenceinput X =

(

x 1,x 2,...,x N

)

,the prorsadGRU layer will capture the information a tprorsad at anytime step t,t

(

1,2,...,N

)

, which contains the dependence relationship of the previous t positions,whilethe reverseGRUlayercapturesthe re-versed information a reverse

N+1−t, which containsthe dependence rela-tionshipofthe t previouspositionsinreversedirection.Therefore, once the B-GRU layer is trained,the forward andbackward GRU layers’outputs a t,t

(

1,2,...,N

)

mayincludethefulldependence relationshipsforitsinputsequenceatanyposition t .Boththe for-ward andbackward GRU layers’ outputs will passupward to the output layeroftheB-GRUtoformabetter representationforthe subsequent layers. Furthermore,in the back-propagation process, the forward and backward GRU layers are processed in decreas-ing andincreasing time order,respectively,by usingthe BPTT al-gorithm[29].

Comparedwithagenericrecurrentlayer,theB-GRUcanbetter deal withboth the spatialdependenciesinprotein sequence and the gradient problemsthat normallyoccur duringnetwork train-ing, which are usually the biggest challenge for trainingclassical recurrentnetworks.

3.3. Training

The network was trained withAdam, the stochastic gradient-based optimization method proposed by Kingma [24]. It com-bine the advantagesof two recently popular optimization meth-ods: AdaGrad[30],theadaptivegradientalgorithm,andRMSProp, which is similar but introduces an additional decay term. Adam

Algorithm1 B-GRUlearningAlgorithm. Require:

InputData:

X =

{

x 1,x 2,...,x N

}

:Sequenceinput X withlength N Ensure: Feedforward: for t =1to N do a tprorsad=GRU

(

x t

)

a prorsad=

{

a prorsad 1 ,a prorsad 2 ,...,a prorsad N

}

endfor for t = N to1do a reverse t =GRU

(

x N+1−t

)

a reverse=

{

a reverse N ,...,a 2reverse,a re1verse

}

endfor for all t ∈

(

1,2,..,N

)

do a t=concat

a tprorsad,a reverse N+1−t

a =

{

a 1,a 2,...,a N

}

endfor Backpropogation: for all t

(

1,2,..,N

)

do

Backward pass for output layer, storing the back error

δ

at eachtimestep

endfor

for t = N to1 do

BPTT for reverse GRU layer, using the stored

δ

from output layer

endfor

for t =1to N do

BPTT for prorsad GRU layer, using the stored

δ

fromoutput layer

endfor

computesindividual adaptivelearningrates fordifferent parame-tersofthenetworksfromestimatesoffirstandsecondmoments ofthegradients.IthasbeenshownthatAdamperformsequaltoor betterthansomeotheroptimizationmethods,regardlessof hyper-parametersetting.

InSSREDNs,thelearningoftherepresentationlayerisvery im-portantforthefinalperformance,becauseitconnectsboththe en-coder andthe decoder. As we known, the outputs of SS predic-tionmustbeacontinuousstructuralstatesthat adjacentresidues formtheuniformstructureinspace,eg.theSSstateHmayconsist ofseveralacids.Soa better activationofthe representationlayer shouldbethatdifferentinputfeaturesmayhavesimilar represen-tation at the hidden representation layer if they have the same outputlabel attheendduringthe trainingprocess.Forthis pur-pose, aadditional outputlayer is addedfollowed the representa-tion layer with Mean Square Error cost function to constrain its activation level.It will force differentinput feature to obtain ap-proximateactivationsatrepresentationlayeriftheyhavethesame label.Sothereareactuallytwooutputlayersinthetraining proce-dure,thegradientsflowfrombothoutputlayers.Oneupdatesthe wholenetworksbuttheother onejustreflectstheencoder.After theweightsofthenetworksistrainedwe onlyusethefinal out-putlayer(redpartinFig.6).Thecostfunctionsofthetwooutput layersareasfollows:

L1

(

x,

)

= 1 m m i=1 n j=1

xijyij

2, (11) L2

(

x,

)

=− 1 m m i=1 n j=1

(

yijlog

(

P

(

xij

))

+

(

1−yij

)(

1−log

(

P

(

xij

))))

, (12)
(7)

120

Fig.6. Multi-outputtraining.A additionaloutputlayerwith MeanSquareError (MSE)costisaddedfollowedtherepresentationlayerduringthetrainingprocess. Ithelptoadjusttheactivityoftherepresentationlayerforthefinalprediction.It alsohelptospeedupthetrainingprocesstosomeextent.

where

istheparameterset,andx ij, y ijdenotethej-thelementof thei-thsampleoftheinputset x andthetargetset y ,respectively. We use the Adam algorithm [24] to optimize the SSREDNs in training process. Moreover, the final cost function L (x,

) of SSREDNsisasumofcostatthetwooutputlayers.

denotesthe wholeparametersetofthenetworksandweuse

γ

tobalancethe contributionofthetwooutputlayers.

L

(

x,

)

=L1

(

x,

)

+

γ

L2

(

x,

)

(13)

4. Experiments

In order to evaluate the performance of proposed SSREDNs, a seriesof experiments are performedon open datasets CullPDB [31]andCB513[10]. Detaileddatadescription isfirstprovided in Section4.1.In Section4.2,the modeltrainingsetup ispresented. First, variationsof network architectures of different layers were testedto refinethemodel’sperformance. Amongthem,we select theonewithbestperformanceforlatercomparisonwithother ex-istingapproaches.Wealsohaveevaluatedtheinfluenceofthe pre-trainstrategy using the SAEon thefinal predictionperformance. TheperformanceanalysisispresentedinSection4.3.Itshowsthat our method smoothly converges with fewer trainingepochs and improvesalmost all ofthe 8statas’ssensitivity andprecision ac-curacyontheCullPDBtestingsetcompared with[17] whichmay bethe best 8-state predictor.Our methodis alsocompared with some public methods (SSpro[32], RaptorX [10], PSIPRED [8]) on bothCullPDBandCB513dataset.Itoutperformstheothersonboth Q3andQ8accuracies.

4.1. Features and dataset

Themajor purposeofpredicttheSStypesforeachaminoacid ofagivenproteinsequence.WesolveboththeQ8 andQ3 predic-tionproblemsinthispaper.Weevaluated theproposed SSREDNs

forbothQ3andQ8ontwodatasets CullPDB[31]andCB513[10], whichare alsoused byother relatedinvestigations [10].CullPDB, alarge non-homologousdataset(identitylessthan 30%),contains 6128 protein amino acid sequences labeled with Q8 secondary structure,andhasbeenrandomlydividedintotraining(5600), val-idation(248), andtesting(280)sets.Aseparateevaluationisalso performed on CB513 dataset, which contains513 proteins, while trainingon CullPDB datasetfurther filtered to remove sequences withmorethan25%identitywiththeCB513dataset.

Protein sequence profiles with evolutionary information have become a breakthrough forSS prediction. Thus, Position Specific Scoring Matrix (PSSM) features have been used here, which are widelyusedfeaturesthatcanbeextractedfromproteinprofilesby DefineSecondaryStructureofProteins(DSSP)andPositionSpecific IteratedBasicLocalAlignmentSearchTool(PSI-BLAST).In[17],the datausedfortrainingcontainedfeaturesandlabelsin56channels (22 forPSSM, 22 foramino acidsequence, 2for terminals, 2for solvent accessibilitylabels, 8 forsecondary structure labels). The trainingdatainclude700aminoacids.It’s consideredto provides a good balance between efficiency andcoverage as the majority of proteinchains are shorter than 700 aminoacids. When train-ingandtesting,shortersequences(lessthan700aminoacids)are paddedwith0.

Thoroughourexperiments,onlyPSSMfeatures(22)andamino acidsequencefeatures(22)wereusedhereforthe8-state predc-tion.Wealsofurtherremovedthe‘Noseq’channelasaconvention [17]. So the input profiles inour experiment consist of50 chan-nels(21forPSSM,21fortheaminoacidsequence,8forsecondary structurelabels).

4.2. Training setup 4.2.1. Networks setup

In the encoder and decoder framework, the ReLU layers with 50–300 unitsare used asfeed-forward layers. GRUlayers always contains100–300gated recurrentunits.In the B-GRU layer, out-put fromthe bidirectionalforward andbackward layers are con-catenatedintoasinglevectortobeusedasinputinthefollowing layers.

WeightsoftheSAEwereinitializedinagreedylayer-wise man-ner, which map the layers’ inputs back to themselves. Next, the otherweightsineachlayeraresampleduniformlybetween0.05 and0.05andbiasesareinitializedat0.Theinitialhiddenstatesof theGRUlayers(h ˜,h )areallsetto0andupdatedduringthe train-ingprocedure. Weuse50%dropoutattheB-GRU andGRU layers toavoidtheover-fittingproblem.

AReLUactivationfunction[33]isusedforallthefeed-forward layersotherthantheoutputlayer.Whentraining,asoftmax activa-tionfunctionisusedonthefinalclassificationlayerandasigmoid activation function is used on the middle extra output layer, as showninFig.6.ThelearningratefortheSAEstageis0.05.When trainingthewholeencoder–decodernetwork,theAdamalgorithm, whichonlyrequiresfirst-ordergradientsandlittlememory,is ap-plied to control the parameter updateswith the default settings (

α

=0.001,

β

1=0.9,

β

2=0.999,

ε

=1e −8). Here,

α

is the

learn-ingrate,

β

1,

β

2areparametersforexponentialmovingaveragesof

thegradientandthesquaredgradient,respectively,

ε

is a small pa-rameterusedtoavoidsingularitiesassociatedwithazero denom-inator.We justset

γ

to1foran equivalentcontributionbetween thetwooutputlayersinEq.(13).

All parameters were trainedglobally by Adamalgorithm with thefinalcostfunctioninEq.(13).Themaximumnumberof train-ing epochs is 100 and the batch size is 40 in our experiment. All thetrainingprocedures are implementedby Pythonbased on Theano and Keras libraries.The trainingprocedure wasexecuted basedonNvidiaTeslaK40GPUs.

(8)

Table1

PredictionaccuraciesisgivenforSSREDNswithdifferentarchitectures. Model Q8Accuracy(%) Segmentoverlapscore 3SAE,1BGRU,2FNN,1GRU 71.84 77.17 5SAE,1BGRU,2FNN,1GRU 70.51 76.70 3SAE,2BGRU,2FNN,2BGRU 72.51 77.95 3SAE,2BGRU,1GRU,2FNN,1BGRU 72.36 77.64 3SAE,2BGRU,2FNN,1BGRU 73.14 78.20 Table2

Comparisonoftheperformancewithandwithoutpre-trainusingSAE. Model(3SAE,2BGRU,2FNN,1BGRU) Q8Accuracy Segmentoverlapscore Withoutpre-train 72.13 77.12

Pre-trainusingSAE 73.14 78.20

4.2.2. Architecture analysis

We tried asetof variousnetwork architecturestofinda suit-ablemodelforproteinSSpredictionproblem. Theirperformances are showninTable1.Thesegment ofoverlap(SOV)score,which canbeinterpretedasSSsegment-basedaccuracy,isalsousedhere toevaluatetheperformanceofdifferentnetworkarchitecture. Em-pirically,we found that a3-layerSAEis asuitable choice. The B-GRUlayersalways getbetterresultsthanthestandardGRUlayers duringthetrainingprocess.Itverifiesthatthebidirectional struc-turecanbettercapturetheinteractionrelationsofresiduesinthe protein chainbothin theencoder partandthedecoder part. Be-sides, therecurrentlayerwillalsoaffectthetrainingresults(row 4vs5inTable1).The optimalstructure(inrow5)wasfoundto be:{300-256-128StackedAuto-encoder}– {256BidirectionalGRU layers(0.5 dropout)}∗2– {256-128 ReLUlayers }– {64 ReLU Rep layer }– {128 BidirectionalGRU layer(0.5 dropout)}– {256-128 ReLUlayers}– {8sigmoidlayer}.Furthermore,a{8sigmoidlayer} layerwithMSEcostfunctionwasaddedonthe{64ReLUReplayer }. It obtained 73.14% Q8 accuracy and 78.20% SOV scores on the CullPDBtesting(272)set.

4.2.3. Stack auto-encoder

Feature learningis avery importantprocess forSSprediction. Thediscovery ofgoodfeaturesmaybenefitthepredictionprocess and improve the prediction accuracy. A SAEarchitecture is used followed the input layer in our model to extract better features from the input protein features automatically. After pre-training, the SAE part learned robust representation from the input fea-tures.Itwillalsobefine-tunedwhentrainingthewholenetworks. To confirm if this pre-train strategy using SAE is really helpful forimprovingthepredictionaccuracy,wetrainedamodelonthe CullPDB trainingset withandwithout the pre-train strategy and test iton theCullPDBtestingset.The resultisshowninTable 2. It improvesthe finalQ8 accuracy from72.13%to 73.14%andSOV scorefrom77.12%to78.20%.

4.3. Performance

4.3.1. Performance on testing set

Thepredictionsensitivities,precisionsandfrequenciesfor indi-vidual secondary structure states of theCullPDB testing set with 272 sequences are shown in Table 3. Compared with [17], for the four major states, H, E, L and T, our SSREDNs method im-proves the prediction sensitivityandprecision. The improvement forthepredictionofstateT,i.e.,the

β

turn predictionwhich de-pends on the long-rangeinter-residue interactions, indicates that the SSREDNs model can learn the long-range structure features from theinput proteinchain sequence. Predictionsfor thestates G, S andBare difficultbecause oftheir lessfrequencies,andthe SSREDNs also makes better predictions for them compared with

Table3

Performanceofindividual secondarystructurestateon CullPDB testingset.

Sec. Sensitivity Precision Frequency Description H 94.11/93.5 86.96/82.8 35.8/35.4 α-helix E 84.52/82.3 78.07/74.8 24.1/21.8 β-helix L 64.01/63.3 58.24/54.1 21.3/18.6 loop T 53.97/50.6 57.96/54.8 10.1/11.1 βturn S 28.29/15.9 49.13/42.3 5.0/7.9 bend G 35.99/13.3 40.48/49.6 3.4/4.1 310−helix B 7.4/0.1 45.63/50 0.2/1.1 βbrige I – /— – /00- 0/0 πhelix

Fig.7. ModelperformanceduringthetrainingprocessonCullPDBdataset.(a)The curveconvergestoastablestatewithin20epochesonCullPDBtrainingdataset.(b) ThewholeperformancecurveofSSREDNswithin100trainingepochs.

[17]onpredictionsensitivities.Thelowerpredictionprecisionsfor states G and B are accounted for by their relative rarity in the trainingset compared withour test set (3.1 vs 4.1, 0.2 vs 1.1). On the other hand, SSREDNs is essentially a data-driven model, andit canlearn the complex interdependenciesof residuesfrom a mass of trainingsequences. The SSREDNs has some limitation forpredictingtheraritystateG,BandI(310−helix,

β

brige and

π

helix ) due to their few frequencies (only around 5% intotal) inthetrain sequences.Thesestateswillalso bea focusoffuture work.

4.3.2. Convergence

Fig.7(a) shows the trainingaccuracy for the CullPDB dataset. The SSREDNs show a powerful learning ability that always con-verges within20 trainingepochs toa predictionaccuracy rateof

(9)

122

Table4

Q8accuracyonCullPDBandCB513.

Model Q8Accuracy

CB513 CullPDB

GSN[17] 66.4 72.1

TheoptimalSSREDNs 68.2 73.1

Table5

Q8andQ3accuracyondatasetsCB513andCullPDB.

Method Q8 Q3 CB513 CullPDB CB513 CullPDB SSpro 63.5 66.6 78.5 79.5 RaptorX-SS8 64.9 69.7 78.3 81.2 PSIPRED – – 79.2 82.5 Ours 68.2 73.1 82.9 84.2

about 72%, equal to the best result in [17], which requires 300 epochs.InFig.7(b),itshowsthatSSREDNswillsmoothlyconverges to its optimal result on validation set within about 70 training epochs,andthisoptimalmodel isusedforthe testingsetin our experiments.

4.3.3. Comparasion with other methods

As shown in Table 4, we compared the Q8 accuracy of our methodwiththedeepgenerativestochasticnetwork(GSN)method in[17], which maybe the best 8-state predictor.Forthis valida-tion, we trained a model in which the CullPDB dataset was fil-teredtoremovesequenceshavinghomologywithCB513sequences (morethan25%identity).WealsojustconsiderthePSSMfeatures andaminoacidsequencefeaturesinthetesting.Withtheoptimal model,weachieveaQ8 accuracyof68.2%onCB513and73.1%on CullPDB,whichoutperformsthebestresultofGSNmodel.

With the same network architecture and parameters set, we alsocompareourmethodwiththefollowingpublicavailable pro-grams:PSIPREDfor3-statesSSprediction;SSpro,RaptorXforboth 8-statesand3-states SSpredictionon the datasets ofCB513 and CullPDB.TheSSpropackageisusedwithouttemplate(i.e.,notusing asolvedstructure inPDBastemplate). Alltheprograms are run-ningwiththeirparameterssetaccordingtotheirrespectivepapers. AslistedinTable5,ourmethodoutperformstheothers,including thepopularPSIPREDonQ3prediction,SSproandRaptorXonboth Q3 and Q8 prediction. In terms of both Q3 and Q8 accuracy on CullPDBandCB513, weobtains 84.2%,82.9%,73.1%and66.4%, re-spectively.

5. Conclusionandfuturework

In thisarticle,we proposed adeeprecurrentencoder–decoder network and employed it to predict the secondary structure of residues in amino acid sequences. The combination of encoder– decoder architecture and GRUs is well suited to model both the sequence-structure relationship between input protein fea-turesandSS,andthemutualinteractions amongresidues.It also achievesbetterperformanceonbothQ3andQ8accuracy.For fur-therdevelopment ofthe learning ability ofSSREDNs, however,a methodmust be developed for determining the architecture and parametersofthenetworks,suchaslayertype,layersize,optimal methodandinitialization.Thisisalsoachallengefordeeplearning andmachine learningingeneral. Tofurther improvethelearning abilityofthepresentmodel,multi-taskpredictionmaybe an av-enueworthpursuing,suchasthepredictionofboththesecondary structureandthesolventaccessiblesurfacearea.

Acknowledgment

ThisworkwassupportedbytheNationalScienceFoundationof

China(Grantnos.61432012and61402306).

Supplementarymaterial

Supplementary material associated with this article can be found,intheonlineversion,at10.1016/j.knosys.2016.11.015. References

[1]X.Luo, Z.Ming, Z.You, S. Li, Y. Xia,H. Leung,Improvingnetwork topolo-gy-basedproteininteractomemappingviacollaborativefiltering,Knowl.-Based Syst.90(2015)23–32.

[2]X.Lei, Y. Ding, H. Fujita, A.Zhang, Identification ofdynamicprotein com-plexesbasedonfruitflyoptimizationalgorithm,Knowl.-BasedSyst.105(2016) 270–277.

[3]J.Cheng,A.N.Tegge,P.Baldi,Machinelearningmethodsforproteinstructure prediction,IEEERev.Biomed.Eng.1(2008)41–49.

[4]J.K.Myers,T.G.Oas,Preorganizedsecondarystructureasanimportant deter-minantoffastproteinfolding,Nat.Struct.Biol.8(6)(2001)552–558.

[5]Z.Yang,I-Tasserserverforprotein3dstructureprediction,BMCBioinform.9 (3)(2008)297–315.

[6]Z.Aydin,Y.Altunbasak,M.Borodovsky,Proteinsecondarystructureprediction forasingle-sequenceusinghiddensemi-markovmodels,BMCBioinform.7(1) (2006)178.

[7]X.D.Sun,R.B.Huang,Predictionofproteinstructuralclassesusingsupport vec-tormachines,AminoAcids,30(4)(2006)469–475.

[8]D.T.Jones,Proteinsecondarystructurepredictionbasedonposition-specific scoringmatrices,J.Mol.Biol.292(2)(1999)195–202.

[9]G.Pollastri,D.Przybylski,B.Rost,P.Baldi,Improvingthepredictionofprotein secondarystructureinthreeandeightclassesusingrecurrentneuralnetworks andprofiles,Proteins:Struct.Funct.Bioinform.47(2)(2002)228–235.

[10]Z.Wang,F.Zhao,J.Peng,J.Xu,Protein8-classsecondarystructureprediction usingconditionalneuralfields,Proteomics,11(19)(2011)3786–3792.

[11]S.Babaei,A.Geranmayeh,S.A.Seyyedsalehi,Proteinsecondarystructure pre-dictionusingmodularreciprocalbidirectionalrecurrentneuralnetworks, Com-put.MethodsProgramsBiomed.100(3)(2010)237–247.

[12] R.Salakhutdinov,G.Hinton,DeepBoltzmannmachines,J.Mach.Learn.Res.5 (2)(2009)1967–2006.

[13]Y.Bengio,G.Mesnil,Y.Dauphin,S.Rifai,Bettermixingviadeep representa-tions,in:ProceedingsofInternationalConferenceonMachineLearning(ICML), 2012,pp.552–560.

[14]G.E.Hinton,R.R.Salakhutdinov,Reducingthedimensionalityofdatawith neu-ralnetworks,Science,313(5786)(2006)504–507.

[15]M.Spencer,J.Eickholt,J.Cheng,Adeeplearningnetworkapproachtoabinitio proteinsecondarystructureprediction,IEEE/ACMTrans.Comput.Biol. Bioin-form.12(1)(2015)103–112.

[16]R.Heffernan,K.Paliwal,J.Lyons,A.Dehzangi,A.Sharma,J.Wang,A.Sattar, Y.Yang,Y.Zhou,Improvingpredictionofsecondarystructure,localbackbone angles,andsolventaccessiblesurfaceareaofproteinsbyiterativedeep learn-ing,Sci.Rep.5(11476)(2015),doi:10.1038/srep11476.

[17]J. Zhou, O.G. Troyanskaya, Deep supervised and convolutional generative stochasticnetworkforproteinsecondarystructureprediction,in:Proceedings ofInternationalConferenceonMachineLearning(ICML),2014,pp.745–753.

[18]R.Pascanu,T.Mikolov,Y.Bengio,Onthedifficultyoftrainingrecurrent neu-ralnetworks,in:ProceedingsofInternationalConferenceonMachineLearning (ICML),2013,pp.1310–1318.

[19]K.Cho,B.V.Merrienboer,D.Bahdanau,Y.Bengio,Onthepropertiesofneural machinetranslation:encoder-decoderapproaches,arXivpreprint,arXiv:1409. 1259(2014).

[20]J.Chung,C.Gulcehre,K.H.Cho,Y.Bengio,Empiricalevaluationofgated recur-rentneuralnetworksonsequencemodeling,arXivpreprint,arXiv:1412.3555

(2014).

[21]P.Vincent,H.Larochelle,I.Lajoie,Y.Bengio,P.A.Manzagol,Stackeddenoising autoencoders:learningusefulrepresentationsinadeepnetworkwithalocal denoisingcriterion,J.Mach.Learn.Res.11(6)(2010)3371–3408.

[22]C.D.Huang,S.F.Liang,C.T.Lin,R.C.Wu,Machinelearningwithautomatic fea-tureselectionformulti-classproteinfoldclassification,J.Inform.Eng.21(4) (2005)711–720.

[23]G.E.Dahl,T.N.Sainath,G.E.Hinton,Improvingdeepneuralnetworksforlvcsr using rectified linear units and dropout, in: Proceedings of IEEE Interna-tionalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP),2013, pp.8609–8613.

[24]D.Kingma, J.Ba,Adam:amethod forstochastic optimization,in: Proceed-ings of International Conference on Learning Representations (ICLR),2015, pp.254–269.

[25]M.Hermans,B.Schrauwen,Trainingandanalysingdeeprecurrentneural net-works,in:ProceedingsofAdvancesinNeuralInformationProcessingSystems (NIPS),2013,pp.190–198.

[26]Q. You, Y.J. Zhang, Anew trainingprinciple for stackeddenoising autoen-coders, in:ProceedingsofInternationalConference onImageand Graphics, 2013,pp.384–389.

(10)

[27]R.Hecht-Nielsen,Theoryofthebackpropagationneuralnetwork,NeuralNetw. 1(1)(1988)65–93.

[28]M.Schuster,K.K.Paliwal,Bidirectionalrecurrentneuralnetworks,IEEETrans. SignalProcess.45(11)(1997)2673–2681.

[29]P.J.Werbos,Backpropagationthroughtime:whatit doesand howtodoit, Proc.IEEE,78(10)(1990)1550–1560.

[30]J.Duchi,E.Hazan,Y.Singer,Adaptivesubgradientmethodsforonlinelearning andstochasticoptimization,J.Mach.Learn.Res.12(7)(2011)257–269.

[31] W.Guoli,R.L.DunbrackJr,Pisces:aproteinsequencecullingserver, Bioinfor-matics,19(12)(2003)1589–1591.

[32]C.N. Magnan, P.Baldi, Sspro/accpro5: almostperfect prediction ofprotein secondarystructureandrelativesolventaccessibilityusingprofiles,machine learningandstructuralsimilarity,Bioinformatics,30(18)(2014)2592–2597.

[33]X.Glorot,A.Bordes,Y.Bengio,Deepsparserectifierneuralnetworks,in: Pro-ceedingsofInternational Conference onArtificial Intelligenceand Statistics (AISTATS),2011,pp.315–323.

http://nrl.northumbria.ac.uk/policies.html 6140230 10.1016/j.knosys.2016.11.015 23–32 270–277 41–49 552–558 297–315 178 469–475 195–202 228–235 3786–3792 237–247 1967–2006 552–560 504–507 103–112 10.1038/srep11476 745–753 1310–1318 machine arXiv:1412.3555 3371–3408 711–720 8609–8613 254–269 190–198 384–389 65–93 2673–2681 1550–1560 257–269 1589–1591 2592–2597 315–323

References

Related documents

This is because the ET of 250 aphids per plant is already set well be- low the damage boundary, so no measurable yield loss occurs at this soybean aphid population level..

The results from this study indicate the addition of animation to an agricultural power lesson did not significantly affect low-level cognitive understanding for students

This study compares health insurance payments and patient utilization patterns for episodes of care for common lumbar and low back conditions treated by chiropractic and

Carol Wilson first experienced coaching during a decade working at board level with ‘natural’ coach, Sir Richard Branson, and has since worked with many corporations on

In fact, half (51 percent) of all employees report obtaining the majority of their financial protection products, such as life, disability income, and long-term care insurance,

Another perspective on recent trends in productivity growth in New Zealand and the UK is provided by decomposing average annual rates of ALP growth in aggregate market sectors,

From these observations, we infer that a prediction-based policy should delay prescriptions for patients with low predicted risk until test results are available, assign

This paper produces annual aggregate and industry productivity series for the market sector of the New Zealand economy for the period 1988 to 2002, using index number techniques and