Are grammatical representations useful for learning from biological sequence data?— a case study

(1)

Are grammatical representations useful for

learning from biological sequence data?—

a case study

Muggleton, SH, Bryant, CH, Srinivasan, A, Whittaker, A, Topp, S and Rawlings, C

http://dx.doi.org/10.1089/106652701753216512

Title

Are grammatical representations useful for learning from biological

sequence data?— a case study

Authors

Muggleton, SH, Bryant, CH, Srinivasan, A, Whittaker, A, Topp, S and

Rawlings, C

Type

Article

URL

This version is available at: http://usir.salford.ac.uk/33399/

Published Date

2001

USIR is a digital collection of the research output of the University of Salford. Where copyright

permits, full text material held in the repository is made freely available online and can be read,

downloaded and copied for noncommercial private study or research purposes. Please check the

manuscript for any further copyright restrictions.

(2)

learning from biological sequence data?

{ a case study

S.H.Muggleton y

C.H. Bryant z

DepartmentofComputerScience,UniversityofYork,YorkYO105DD,UK.

A.Srinivasan

OxfordUniversityComputingLaboratory,WolfsonBuilding,ParksRoad,OxfordOX13QD,UK.

A.Whittaker x

S.Topp C.Rawlings {

SmithKlineBeecham,NewFrontiers,SciencePark,ThirdAvenue,Harlow,EssexCM195AW,UK.

Keywords: Bioinformatics, Machine Learning, Inductive Logic Program-ming,CostFunction,GrammaticalInference

Abrief account of part of thiswork was published inthe proceedings of seventeenth internationalconferenceonMachineLearning(Stanford,29June-2July2000).

y

Currentaddress:DepartmentofComputing,ImperialCollegeofScience,Technologyand Medicine,180Queen'sGate,LondonSW72BZ.

z

Addresscorrespondenceto:SchoolofComputerandMathematicalSciences,TheRobert GordonUniversity,Aberdeen,AB251HG,Scotland,UK.

x

Currentaddress:PsygnosisLtd.,37,KentishTownRoad,LondonNW18NX,UK. {

(3)

This paper investigates whether Chomsky-like grammar representa-tions are useful for learning cost-eective, comprehensible predictors of membersof biological sequencefamilies. TheInductiveLogic Program-ming(ILP)Bayesianapproachtolearningfrompositiveexamplesisused to generatea grammarfor recognising aclass ofproteins knownas hu-manneuropeptideprecursors(NPPs). Collectively,veoftheco-authors ofthispaper,haveextensiveexpertiseonNPPsandgeneral bioinformat-ics methods. Theirmotivationfor generatingaNPPgrammarwas that noneoftheexistingbioinformaticsmethodscouldprovidesuÆcient cost-savingsduringthesearchfornewNPPs. Priortothisprojectexperienced specialists at SmithKlineBeecham had triedfor many monthsto hand-code such a grammar but without success. Our best predictor makes thesearchfor novelNPPsmore than100 times moreeÆcientthan randomlyselecting proteins forsynthesisand testingthemfor biological activity.Asfarastheseauthorsareaware,thisisboththerstbiological grammarlearntusingILPandtherstreal-worldscienticapplicationof theILPBayesianapproachtolearningfrompositiveexamples.

(4)

Thispaperattemptstoanswer,bywayofacase-study,thequestionofwhether grammatical representations are useful for learning from biological sequence data. Weaddressthequestionwithexperimentalresultsthatsignicantly con-tradictthefollowingnullhypothesis.

Nullhypothesis: Themostcost-eective,comprehensiblemulti-strategy pre-dictors of human neuropeptide precursors do not employ acontext-free denite-clause-grammar.

Multi-strategylearning(Michalski&Wnek,1997)aimsat integrating mul-tiplestrategiesin asinglelearningsystem, wherestrategiesmaybe inferential (e.g. induction, deduction etc) or computational. Computational strategy is dened by therepresentationalsystemandthecomputational methodused in thelearningsystem(e.g.decisiontreelearning,neuralnetworklearningetc).

A grammar for alanguagetells us whether a sentence is properly formed. NoamChomsky, afounderofformal languagetheory, providedaninitial clas-sicationoflanguagetypes. Those readersrequiringanintroductiontoformal grammarsorthishierarchyarereferredto (Linz,1996).

We obtainresultswhich signicantly contradict thenullhypothesis as fol-lows. A grammaris generatedfor aparticular classofbiologicalsequences. A groupof features is derivedfrom this grammar. Other groupsof features are derivedusing otherlearningstrategies. Amalgamsofthese groupsareformed. Arecognitionmodelis generatedfor each amalgam usingC4.5 andC4.5rules. Theresultssignicantlycontradictthenullhypothesis

because:-1. the best performance achieved using any of the models which include grammar-derived features is higher than the best performance achieved using anyof the models which do notinclude thegrammar-derived fea-tures;

(5)

morecomprehensiblethanthebest `non-grammar'model.

Performance is measured using a new cost function, Relative Advantage (RA). Appendix A denes RA and explains why it is used in preference to otherperformancemeasures. A method ofestimating theRA ofarecognition modelis presentedwhich subsequentlyallowsthestatistical signicanceof the dierencebetweentheRAoftwomodelstobegauged.

Thedomainofthecasestudyistherecognitionofaclassofproteinsknown ashuman neuropeptide precursors (NPPs). These proteins haveconsiderable therapeuticpotentialand areof widespreadinterestin thepharmaceutical in-dustry (see Section 3). Our best multi-strategy predictor of NPPsemploys a context-freedenite-clause-grammar.

AnInductiveLogicProgramming(ILP)(Muggleton&Raedt,1994)system is used to generatea grammar for NPPs. As far as these authors are aware, this is the rst attempt to generate a grammar for abiological domain using ILP. ILPistheareaofArticialIntelligencewhich dealswith theinductionof hypothesised predicate denitions from examples and background knowledge. Logic programs are used as a single representation for examples, background knowledgeandhypotheses. ForarecentoverviewofILPissuesandresultssee (Muggleton,1999).

(6)

(7)

Research in the biological and medical sciences is being transformed by the volume ofdata comingfrom projectswhich will reveal theentire geneticcode (genomesequence)ofHomosapiensaswellasotherorganisms. Oncecomplete, these projects should help us understand the genetic basis of human disease. Thegrowthinthevolumeofdataandimprovementsinsoftwareforinterpreting thisinformationhasincreasedinterestintheuseofcomputationalmethodsfor identifyinggenesinvolvedin humandisease(Rawlings&Searls,1997). Know-ingthe genesimplicated in adiseaseidenties theproteinsthat theycodefor and possibly suggests the biochemical processes that may be inuencing the developmentofthedisease. Thisinformationiscrucialforthegenerationofthe experimental reagents needed for the development of new drugs and explains thewidespreadinvestmentbythebiotechnologyandpharmaceuticalindustries inbioinformaticsstaandtechnologies(Lyall,1996;Spence,1998).

(8)

informationprocessinginfrastructureinbothcommercialandacademicresearch sectors.Currentstateoftheartcomputationalsequenceinterpretationmethods arenotcapableofsolvingthesequencetofunctionproblemforallnewproteins, andthedevelopmentofnewtechniquesisstillrequired.

Asignicantchallengeintheanalysisandinterpretationofgeneticsequence datais thereforetheaccurate recognitionof patternswithin the datathat are diagnosticforknownstructural orfunctional features within theprotein. The languageofgenesiswritteninasimplealphabetfA, C, G, Tgrepresentingthe fourDNAbasecodes. Thelanguageofproteinsusesatwentycharacteralphabet fA, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Yg repres-entingamino acidresidues. These residuesareencodedin genes bysuccessive DNAbasetriplets. Attheirsimplest,thesepatternscanbedescribedasregular expressions. Manyfeaturescanbedescribedthroughtheuseofregular expres-sions and a database PROSITE (Bairoch et al., 1997) is available in which thesepatternsarecurated. APROSITEpatternsuchas:

[AC]-x-V-x(4)-fEDg istranslatedas:

[A or C]-any-V-any-any-any-any-fany but E or Dg

AmoreextensiveexampleofaPROSITEpatternisthat forthefamilyof pro-teinscalledshort-chaindehydrogenases(enzymesinvolvedincellmetabolism). The PROSITE pattern that includes two perfectly conserved residues, a tyr-osine(Y)andalysine(K) is:

[LIVSPADNK]-x(12)-Y-[PSTAGNCV]-[STAGNQCIVM]-[STAGC]-K-fPCg-[SAGFY-R]-[LIVMSTAGD]-x(2)-[LIVMFYW]-x(3)-[LIVMFYWGAPTHQ]-[GSACQRHM]

(9)

plemented with well-established methods for training models from examples. Training regimes generally(but not always)requirethat thesequences in the trainingset bearrangedso that those regionsof the sequencethat havebeen conservedthroughevolutionarealignedinthesamecolumn. Theaccurate mul-tiple alignment of biological sequenceshas been thesubjectof much research and discussion (Doolittle, 1996)), and is considered by many to be a solved problem. However,itrelies ontheassumptionthat thesequencestobealigned showsomehomologyatthesequencelevelwitheachother. Complexbiological signalsalso requirecomplex models and it is often thecase that considerable expertiseisrequiredintheselectionoftheoptimalneuralnetworkarchitecture orhiddenMarkovmodelbefore trainingcantakeplace.

A generallinguisticapproachto representingthestructure andfunction of genesandproteinshasintrinsicappealasanalternativeapproachto probabil-isticmethods becauseof thedeclarative andhierarchicalnature of grammars. Searls(Searls,1993)hasundertakenthemostthoroughanalysisofthelinguistic classicationof geneticgrammars startingwith the Denite Clause Grammar (DCG).SearlsproposesaStringVariableGrammar(SVG)extensiontoaDCG to providefeatures necessaryfor representinghigher-orderinteractions among genetic sequence elements found in nucleic acids such as non-linear features foundin RNA pseudoknots and other secondarystructures formedas aresult ofinternalnucleicacidbase-pairing.

(10)

(11)

Neuropeptides are an important group of short proteins that act as neuro-transmittersmediatingthepassageofsignalswithinthecentralnervoussystem (CNS)and betweentheCNSandtherestofthebody. Thetermneuropeptide was rstintroduced in1971byD. deWeid (Klavdieva,1995)to describe frag-mentsofhormonesthatproducedbehaviouralchangeswheninjected,butlacked theactivityofthe intacthormone. Morerecentlythe termhasbeenaccepted tocoverpeptides united by anumberofcommon features, includingtheir tis-sue expression(brain, nervous tissue, secretorycells from organssuch asgut, heart,lungs,placentaetc)metabolism,secretion,biosynthesisandhighpotency (Klavdieva,1995).

Drugmoleculesworkbyinteractingwithtargetsiteswithinthebody. These sitescommonlyareproteinmolecules,eitherenzymesorreceptors. By interac-tion with these protein molecules, drugscan modulate their actions and gen-erally suppress undesirable biochemical reactions. Neuropeptides exert their biological actions through binding as ligands to specic receptors. The term ligandis usedformolecules which bindto thetarget site. (Aligand mightbe highly active against the target, but not a `drug', because of alack of other requiredpropertiessuchasmetabolicstabilityorsafety.)

Activeresearchhasincreasedthenumberofmammalianneuropeptidesfrom about18in1978tomorethan80by1999. However,despitealltheseeorts,the biologyof many ofthese neuropeptides aswellastheirinteractions withtheir receptorsremaintobeelucidated. Thereceptorsofsomeneuropeptideshavenot yetbeenidentiedandtherearesomeorphanreceptorswithasyetunidentied ligands. The in-vitro pairing of a novel receptor with its ligand is a critical rststepinunderstandingthemechanismofdisease. Thusnovelneuropeptides and theirorphan receptors haveconsiderabletherapeutic potential andare of widespreadinterestinthepharmaceuticalindustry.

(12)

grammatic representation of several other precursors is shown in Figure 2. Precursorsmaycontaineitherasingleneuropeptide,multiplecopiesofthesame Put

Fig2 here

neuropeptideorseveraldierentneuropeptides. Thesecanoccurconsecutively in the precursororcanbe separatedbylarge stretchesof ller peptidewhich is believed to play a purely structural role. Neuropeptide precursors contain a short prex of residues called a signal peptide of about 20{30 amino acids (aa)in length. Theknownprecursors rangein lengthfrom 70 to 600aa, and the cleavedpeptides range from 3-200 aa. It is this hugevariation in length, sequenceandinternalorganisationthatmakesneuropeptideprecursorsdiÆcult to use when searching for novel remote homologues using sequence database searchingmethods(e.g. BLAST).Theyalsoconfoundtypicalmultiplesequence alignment methods used to identify conservedfeatures among functionally re-latedsequences.

Many proteins are cleaved and trimmed after synthesis. Forexample, di-gestiveenzymesaresynthesisedasinactiveprecursorsthatcanbestoredsafely inthepancreas. Afterbeingreleasedintotheintestine,theseprecursorsbecome activatedbycleavage. Neuropeptideprecursorsundergothis`splitting'process. The signalpeptide targets the protein for secretionthrough a cell membrane whereitis thencleavedfrom theprecursor. Theremainderoftheprecursoris furthercleavedtoreleasetheneuropeptide. Withintheprecursor,thelocation ofthecleavagesofthesignalsequenceandtheneuropeptidesarereferredtoas cleavage sites.

(13)

(14)

Thissectiondescribesanexperimentwhose resultssignicantlycontradict the null hypothesis (see Section 1). The section begins by describing the mater-ials (data, background knowledge and machine learning systems) used in the experiment. This is followed by an account of the three steps of the experi-mentalmethod. Finallythesectionends withthepresentationand analysisof theresults.

4.1 Materials

4.1.1 Data

The data was taken from the SWISS-PROT database (Bairoch & Apweiler, 2000). SWISS-PROTisanannotatedproteinsequencedatabaseestablishedin 1986 and maintained, with collaborators, by the Department of Medical Bio-chemistryoftheUniversityofGeneva. Itcanbeaccessedat

http://www.expasy.ch/sprot/sprot-top.html.

Ourdata-setcomprisesasubsetof positivesi.e. knownNPPsand asubset ofrandomly-selectedsequences. Itisnotpossibletogeneratealarge,unbiased set of negative examples because there will be proteinswhich have yet to be recognised scientically as a NPP. The characteristics of the two subsets of sequencesareasfollows.

Positives This subsetcontains allofthe 44knownNPPsequences that were inSWISS-PROTin Spring1997,thetimethedata-setwasprepared(see Table8). TheSWISS-PROTidentiersofthese44sequencesarelistedin Tables1and2.

Put Tab1 and Tab2 here

(15)

Thesesequencesareunrelatedbysequencehomologytotheremaining34.

Randoms Thissubsetcontainsallofthe3910fulllengthhumansequencesin SWISS-PROTinSpring 1997.

1000ofthe3910randomswerereservedforthetest-set.

Thedata-setisavailableat

ftp://ftp.cs.york.ac.uk/pub/aig/Datasets/neuropeps/

4.1.2 MachineLearning Systems

Thepropositional learningwasperformedusing the decision-treelearnerC4.5 (Release 8) in conjunction with the companion program C4.5rules that con-structsrulesfromatreebuiltbyC4.5(Quinlan,1993). Thegrammarlearning was performed usingCProgol(Muggleton, 1995)version4.4whichis available from

ftp://ftp.cs.york.ac.uk/pub/MLGROUP/progol4.4.

4.1.3 Background Knowledge

Duringboththegeneration ofthegrammarusing CProgolandthegeneration of propositional rule-sets using C4.5 and C4.5rules we adopt the background informationusedin(Muggletonet al.,1992)todescribephysicalandchemical propertiesoftheaminoacids(seeTable3).

Put Tab3 here

4.2 Method

Themethod maybesummarisedas

follows:-1. AgrammarisgeneratedforNPPsequencesusingCProgol(seeSection4.2.1).

(16)

amalgam using C4.5 and C4.5rulesand its performance is measured us-ing Mean R A. This is a new cost function which is described in Ap-pendixA. Thenull-hypothesis(seeSection1)isthentestedbycomparing theMeanR Aachievedfromthevariousamalgams. (See Section4.2.3).

4. A hiddenMarkovmodel(HMM) isgenerated forNPPsequencesandits MeanR Aismeasured(seeSection4.2.4).

4.2.1 Grammar Generation

A NPP grammar contains rules that describe legal neuropeptide precursors. Figure3showsanincompleteexampleof suchagrammar,written asaProlog Put

Fig3 here

program. This section describeshowproduction rules for signalpeptides and neuropeptide starts, middle-sections and ends were generated using CProgol. Thesewereusedtocompletethecontext-freedenite-clause-grammarstructure shownin Figure3. Thestartand endrepresentcleavagesites andthe middle-sectionrepresentsthematureneuropeptidei.e.whatremainsaftercleavagehas takenplace.

TheproductionrulestobelearntbyCProgolcontainsdyadicpredicatesof the form p(X,Y), which denote that property p began the sequence X and is followedbyasequenceY.Tolearnsuchrulesfromthethetraining-set,CProgol wasprovidedwiththefollowingextensionaldenitions:

Precursor data. Using details of the start and nishing positions for signal peptidesand neuropeptidesit waspossibletogenerate examplesof non-terminalsasbelow:

signalpep(S,[]) whereSis alistofprecursorresiduesconstituting the signalpeptide.

(17)

neuropeptide. ThestartingresidueforSistherstresidueafter the end of thesequencefor start/2 above. Theend of Swastaken to be3residuesfrom thelast position oftheneuropeptide.

end(S,[]) whereSisalistof3precursorresiduesconstitutingtheendof aneuropeptide. Scommences with theresidue after theend of the sequenceformiddle/2above.

Randomdata. Onerandomexampleforeachofsignalpep/2,start/2,middle/2 and end/2 was generated from each sequence in the set of randoms se-quences. Randomexamplesaredistinguishedbytheprex*.

*signalpep(S,[]) where S is a list of residuesstarting at the rst po-sition in the sequence. The length of S is obtained from drawing randomlyfromthedistributionofsignalpeptidelengthsofNPPsin thetrainingdata.

*start(S,[]) whereSisapairofsequenceresidues. Thestartingresidue israndomlychosen,andisensurednotto conictwiththesequence chosenfor*signalpep/2above.

*middle(S,[]) where Sisalist ofresiduesstartingafter theendof the sequencefor*start/2above. ThelengthofSisobtainedbydrawing randomlyfromthelengthofneuropeptidemiddle-sectionsof precurs-orsinthetrainingdata.

*end(S,[]) where S is a list of 3 residues starting at the end of the sequenceterminatingthedenitionof*middle/2above.

CProgol was provided with denitions of the non-terminals star/2 and run/3(see Table14). star/2represents somesequence of unnamed residues whoselengthisnotspecied. run/3representsarunofresidueswhich sharea speciedproperty.

(18)

important: KR;GKRandGRR.ThesubsequencesKRandGKRareestablished pro-teolyticcleavage sites found in NPPs; GRR is a relatively commonalternative cleavage site to GKR. Pilot experiments suggested that the following patterns maybesignicant:

K,positive;

positive,positive; Y,veryhydrophobic;

hydrophilic,agap ofsomeresidues,M,negative; HP;

WMDF.

All these subsequencesand patterns were coded as Prolog predicates and includedasbackgroundknowledge(seeTable14).

Other pilot experiments onthe training data showedthat theaccuracy of CProgol'sgrammarwashigherwithcertainrestrictionsonthelengthofNPPs, signalpeptidesand neuropeptides. Specic constraintswere obtainedby pro-gressivelycheckingthefollowing: alllengthslessthanthemeanlengthon train-ing data; lengths that are within 1;2;::: standard deviation of the mean on trainingdata. This resulted in the following additionalrestrictions: (1) NPP lengthsnottoexceed200residues;(2)signalpeptidelengthstobebetween19 and29residues;and(3)middle-sectionsofneuropeptideslengthstobebetween 4and52 residues. These constraintsonlyaect thevalues offeatures derived from the grammar. They do not constrain the value of the sequence length featuredescribedattheendofSection 4.2.2.

(19)

The grammar features PredictionsaboutaNPPsequencecanbemadeby parsingit usingtheNPPgrammar. Thevaluesof thefeatures shown in Table 4wereobtainedbysuchparses. Notethat wheneverthegrammar Put

Tab4 here

predictsthat asequenceisnotaNPP,allofthefeaturesareassigned the valuezero.

The SIGNALP features Eachfeatureinthisgroupisasummaryofthe res-ult of using theSIGNALP program on asequence. The SIGNALP pro-gram(Nielsenet al.,1997)representsthepre-eminentautomatedmethod for predicting the presence and location of N-terminal signal peptides. SIGNALPisavailableonthewebat

http://www.cbs.dtu.dk/services/SignalP.

Thetechniqueusedcombines thepredictionsoftwodierentneural net-works groups { one that recognises cleavage sites, and the other that identiessignalpeptides. Whenprovidedwith asequence ofN-terminal residues, the following are reported as summaries: (a) C scores: which consist of the maximumvalue of the scorefrom the cleavage-recogniser, theposition in thesequencewhere this valueisachieved,and anominal `y'or`n'denotingtheanswertowhether acleavagesiteispresent;(b)S scores: thecorrespondingvaluesfromthesignal-peptiderecogniser;(c)Y scores: ascorethatcombinestheCandSscores;and(d)Mean scores: a meanoftheS-scoreandS-conclusionsfromtheN-terminalendtothe pre-dicted cleavagesite. Forthe experiments here, SIGNALP was provided with 50 amino acids from the N-terminal for each sequence. The sum-marieswere extracted andrepresented by11 features shownin Table 5. Put

Tab5 here

(20)

ertiesshownin Table3.

The sequence length feature Thisfeatureisthelengthofthesequence. In theremainderofthis paperthis featurewillbereferredtoaslength.

4.2.3 PropositionalLearning

ThetrainingandtestdatasetsforC4.5werepreparedasfollows.

1. Recall from Section 4.1.1that ourdata comprises44 positivesand 3910 randoms. 40ofthe44positivesoccurinthesetof3910randoms. AsC4.5 is designed to learnfrom a set of positivesand a set of negatives, these 40 positiveswere removed from theset of randoms. Of the40 positives which are in theset of randoms,10 are in thetest-set. Hence theset of (3910 40)sequenceswere splitintoatraining-setof(2910 30=2880) andatest-setof(1000 10=990).

2. Valuesofthefeaturesweregeneratedforeachtrainingandtestsequence. Eachsequencewasrepresentedbyadatavectorcomprisedofthesefeature valuesand1classvalue(`1'todenote aNPPand`0'otherwise).

3. Finallyto ensure that there were asmany`1' sequencesas`0'sequences atraining-setof2880NPPswasobtainedbysamplingwith replacement. Thusthetrainingdata-setinputtoC4.5comprised(22880)examples. (Nore-adjustingwasdoneonthetestdata.)

Amalgams of the feature groups described in the previous section were formed. The amalgams are listed in Table 6. The following procedure was followedforeach

one:-1. trainingand testsetswerepreparedasdescribedabove;

2. adecisiontreewasgeneratedfromthetraining-setusingC4.5;

(21)

rule-setonthetest-set;

5. MeanR Awasestimated asdescribedinAppendix A.3.

Thedefault settingsofC4.5andC4.5ruleswereused.

The contradiction of the null hypothesis was then attempted by testing

theMeanR Aofthebestmodelwhichincludesgrammar-derivedfeatures was higherthan the best performance achievedusing anyof the models which donotinclude thegrammar-derivedfeatures.

suchanincreasewasstatisticallysignicant. EstimatesofMeanR Awere comparedusingthestatisticalmethod describedinAppendix A.4.

thebest model whichincludes grammar-derivedfeatures wassuÆciently morecomprehensiblethanthebest `non-grammar'model.

4.2.4 Hidden Markov Model Comparison

AHMMforNPPswasgeneratedandtestedusingHMMER 1

version2.1.1(Eddy, 1998)whichisavailable fromhttp://hmmer.wustl.edu/.

CLUSTALW(1.8)(Thompsonet al., 1994)wasusedto alignthepositive sequencesinthetraining-set. ThehmmbuildprogramofHMMERwasthenused togenerateaHMMfromtheCLUSTALWalignment. TheresultingHMMand thehmmsearchprogram ofHMMERwerethen usedtosearchforNPPsin the test-set. The default settings of CLUSTAL W and HMMER were used. The MeanR AoftheHMMwasestimatedbasedonthepredictionsonthetest-set.

4.3 Results and Analysis

Table 13 shows the grammar that wasgenerated using CProgol. The gram-mar is veryrich in terms of non-terminals. All but one of the non-terminals

1

(22)

mar. The grammar also includes star/2, run/3 and three of the six non-terminalswhich represent the patterns mentioned in Section 4.2.1. The non-terminalsyvhand pp, which representthepatterns Y,veryhydrophobicand positive,positive,bothappeartwice. Thethirdnon-terminalwhichappears ishmn,whichcorrespondstothepattern

hydrophilic,agap ofsomeresidues,M,negative.

Table6showsthatpredictiveaccuracyisnotagoodmeasureofperformance Put

Tab6 here

for this domain because it does notdiscriminate well betweenthe amalgams: despite covering varying numbers of (the rare) positives, all the models are awardedasimilar(high)scorebypredictiveaccuracybecausetheyallexclude mostoftheabundantnegatives.

Table6showstheMeanR AforboththehiddenMarkovmodelandforeach amalgamof feature groups. TheMean R Aof theHMMis zero. The highest MeanR A(107.7)wasachievedbyoneofthegrammaramalgams,namely the `Proportions+Length+SIGNALP+Grammar'amalgam. ThebestMeanR A achieved by any of the amalgams which do not include the grammar-derived featureswasthe49.0attainedbythe`Proportions+Length'amalgam.

The P

90 M=57

R Afor the`Proportions+ Length+ SIGNALP+ Grammar' amalgamwas 3661.376. The

P 90 M=57

R Afor the`Proportions+Length' amal-gamwas1666.733. For theamalgams `Proportions +Length +SIGNALP + Grammar'and `Proportions +Length', ^

D

=1994:643 and ^ D

= p

n = 2:081. This dierenceis statistically signicant: substituting these valuesof ^

D and ^ D = p

nintoEquation15showsthatp(d<0)iswellbelow0.0001.

(23)

pectedtobeanovelNPP.Thiscanbeseenfromthefollowingsimplecalculation. RearrangingEquation2givesPr(NPP j R ec)=R APr(NPP). Substitut-ing in the MeanR A for the best recognitionmodel gives Pr(NPP j R ec) = 107:688(90=79449)=1=8:2. Multiplying1/8.2bytheproportionofNPPsin SWISS-PROTwhicharenovel(33/90)givesapproximately1/22.

Appendix C lists the complete rule-sets for the amalgams `Proportions + Length+SIGNALP+Grammar'and`Proportions+Length'. Therulesthat were generated from the `grammaramalgam' suggest that the NPPgrammar isuseful forlearningfrom NPP sequencedata. Nine of the25 rules include a grammar-derivedfeature. Theserulesrefertoavarietyofthegrammar-derived

whetherthegrammarpredictstheexistenceofanneuropeptidestart(e.g. seeRule14in Figure5);

therstresiduein theneuropeptide(e.g.see Rules20 and21in Figures 5and 6respectively);

the position of the rst residue in the neuropeptide (e.g. see Rule 6 in Figure5);

the property of thethird from last residue in the neuropeptide (e.g. see Rule17inFigure 5).

thelengthof thesignalpeptide;

Ourmethoddidnottrytoremovethepotentialredundancybetweenvalues ofsomeoftheSIGNALP featuresand grammarfeatures. Theresultslisted in Table 6justify this. It is of interest to note that thegrammar-derived length ofsignalpeptideisusedfrequentlybyC4.5(seeRules2,9,23and24),despite the availabilityof similarfeaturesderivedfromSIGNALP(seeTable5).

(24)

Aim The aim of the second experimentis to demonstrate that overtting of theNPP sequencesdidnotinadvertentlyoccurduring the generationof grammar in the rst experiment (see Section 4.2.1). The data-set used in the rst experimentcontainedall of the44 known NPPsequences in SWISS-PROT in Spring 1997. The second experiment utilises 13 addi-tional NPP sequences which had been added to SWISS-PROTby May 1999(seeTable8). Noneofthese13additionalNPPsequenceswereused for training or testing in the rst experiment. Indeed the identiers of these13sequences werenotknownto C.H.B(whoperformedthesecond experiment)untilafter theresultsoftherstexperimentwere published (Muggletonet al.,2000).

Data Twotest-sets areusedinthisexperiment.

1. Thetest-setusedin therstexperimenti.e.theonewhich contains 10ofthe44originalNPPsandthe990'other'sequences.

2. Anewtestsetcomprisingthe990'other'sequencesfromthetest-set used in the rst experiment and 10 of the 13 additional NPP se-quenceswhichhadbeenaddedtoSWISS-PROTbyMay1999. The SWISS-PROTidentiersoftheadditional13sequencesarelistedin Table7and thesequencesareavailable at

ftp://ftp.cs.york.ac.uk/pub/aig/Datasets/neuropeps/.Three of the 13additional NPP sequences(P10092, P07492and Q00072) werenotusedastheyarehomologuesofNPPsintheoriginaltraining setof34NPPs.

(25)

andthenewtest-set.

(26)

Thispaperhasshownthatthemostcost-eective,comprehensiblemulti-strategy predictorofhumanneuropeptideprecursorsdoesemployacontext-free denite-clause-grammar.

The ILP Bayesian approach to learning from positive examples was used to generate a grammar for recognising a class of proteins known as human neuropeptideprecursors(NPPs). Collectively,veoftheco-authorsofthis pa-per, have extensive expertise on NPPs and general bioinformatics methods. Their motivation for generating a NPP grammar was that none of the ex-isting bioinformatics methods could provide suÆcient cost-savings during the searchfor newNPPs. Priorto this projectexperienced specialistsat SmithK-line Beecham had tried for many months to hand-code such a grammar but without success. Our best predictor makes the search for novel NPPsmore than 100 times more eÆcient than randomly selecting proteins for syn-thesisand testing themfor biologicalactivity(see Figure 4) . Asfar asthese Put

Fig4 here

authorsareaware,thisis boththerstattempt tolearnabiologicalgrammar using ILP and the rst real-world scientic application of the ILP Bayesian approachtolearningfrompositiveexamples.

We rst published that our best predictor delivers more than a hundred-fold cost-saving in the proceedings of seventeenth international conferenceon MachineLearning(Muggletonet al.,2000). Sincethenwehaveobtainedfurther evidenceto supportourconclusion that wehavedevelopedaNPPrecognition methodwhichwouldprovidecost-savingsduring asearchfor novelNPPs. We haveshown,using NPPsequenceswhich hadnotbeenusedpreviouslyonthis project, that overtting of the original set of NPP sequences did not occur duringthegenerationofthegrammar.

(27)

ofthepredicatesnpp/2andneuropeptide/2showninFigure3. Experiments withamoreexiblegrammarshouldthereforeformthesubjectoffuturework. Inouropinion,thebest`non-grammar'recognitionmodel doesnotprovide any biological insight. However the best recognition model which includes grammar-derivedfeaturesisbroadlycomprehensibleandcontainssomeintriguing associationsthatmaywarrantfurtheranalysis. Thismodelisbeingevaluatedas anextensiontoexistingmethodsusedinSmithKlineBeechamfortheselection ofpotentialneuropeptidesforuseinexperimentstohelpelucidatethebiological functions ofG-protein coupledreceptors. It isclear however,that therulesof themodelarenotanoptimalrepresentationofsequencedataandresidue prop-erties. A moreintuitive(e.g. graphicallyoriented,sequencecentred)displayof themeaningofthese ruleswouldberequiredto buildtoolsthat theexperts in theeld wouldndacceptable.

The new cost function presentedin this paper, RelativeAdvantage (R A), may be used to measure performance of a recognitionmodel for any domain where

1. theproportionofpositivesin thesetofexamplesisverysmall.

2. thereis noguaranteethat allpositivescanbeidentiedassuch. Insuch domains, the proportion of positive examples in the population is not known and a large, unbiased set of negatives cannot be identied with completecondence.

3. thereisnobenchmarkrecognitionmethod.

(28)

(29)

WewouldliketothankI.S.Gloger,M.LawrenceandR.B.Russellforsharingtheir knowledgeofNPPsandsequencehomology. WewouldliketothankJ.Cussens forhis help with Section A, M.Turcottefor his advice on HMM software and D.Michieforhiscommentsonthiswork.

(30)

NPPsare identiedeither throughpurely biologicalmeansorbyscreening ge-nomic or protein sequence databases for likely NPPs, followed by biological evaluation. Ifwewishtogobeyondusingsequencehomologytondnew mem-bers of the (generally small) NPP families, we need a recognition model for NPPsin general. Howeverifthisrecognitionmodelispoorthenitmaynotbe muchbetterthanrandomsamplingofsequencedatabases(e.g. SWISS-PROT) and thecost-benetof anyexperimental evaluation of NPPs foundby such a procedurewould beprohibitivelysmall.

In developinga general recognition model for human NPPs, we are faced withthreesignicantobstacles.

1. The number of known NPPs in the public domain databasesof protein sequence (e.g. SWISS-PROT) is very small in proportion to the total numberofsequences. Whenwedevelopedourmethod ofestimating RA (May1999),SWISS-PROTcontained79,449sequences,ofwhichsome57 coulddenitelybeidentied ashumanNPPs.

2. There is no guarantee that all the human NPPs in SWISS-PROT have been properly identied. We estimate there may, in fact be up to 90 NPPsin SWISS-PROT.

3. There isnobenchmarkmethod forNPPrecognitionthat canbe usedto compare any newmethods. Wemust therefore compareour recognition modelwithrandomsampling toevaluatesuccess.

This domain requires aperformance measure which addresses all of these issues.

Table8summarises howsome of theproperties of SWISS-PROT changed Put

Tab8 here

(31)

PROT.

A.1 Limitations of Existing Performance Measures

Fordomains in whichpositivesare rare, predictiveaccuracy, asit isnormally measuredin MachineLearning(assumingequalmisclassicationcosts):

gives a poor estimate of the performance of a recognition model. For instance,ifalearnerinducesaveryspecic modelforsuchadomain, the predictiveaccuracyofthemodelmaybeveryhighdespitethenumberof truepositivesbeingverysmallorevenzero.

does not discriminate well between models which exclude most of the (abundant) negatives but cover varying numbersof (the rare) positives. (Thiswasillustratedearlierinthis paper-see Table6.)

Fordomainsinwhichthereisnobenchmarkrecognitionmethodthatcanbe usedtocompareanynewmethods,Lift(Ling&Li,1998)isnottheappropriate measure of performance because it does notquantify the reductionin cost in usingthepredictorversusrandomsampling. Furthermore,intheirpaper,Ling & Li gave no explanation of how to assess the signicance of the dierence betweentheLiftoftwomodels. ROCcurves(orLorentzdiagrams)(Provost& Fawcett,1998)alsodonotquantifythereductionincostinusingthepredictor versusrandomsampling.

Therefore wedene a relative advantage (R A) function which predicts the reductionin cost in using the model versus random sampling. In contrast to otherperformance measures,R A ismeaningful and relevantto experts in the domain.

A.2 Denition of RA

(32)

incostinusingthemodelversusrandomsampling.

R A= A B

(1)

where

A = the expected cost of nding one NPPby repeated independent random samplingfromSWISS-PROTandperformingalaboratoryanalysisofeach protein.

B = the expected cost of nding oneNPP by repeated independent random samplingfromSWISS-PROTandanalysingonlythoseproteinswhichare predictedbythelearnedmodeltobeaNPP.

RAcanbedenedin termsofprobabilityasfollows. Let

C =thecostoftestingthebiologicalactivityofoneproteinviawet-experiments inthelaboratory;

NPP =SequenceisaNPP;

Rec =ModelrecognisessequenceasaNPP.

Equation1cannowberewrittenas:

R A=

C=Pr(NPP) C=Pr(NPP jR ec)

=

Pr(NPP jR ec) Pr(NPP)

(2)

Lettestingthemodelontestdata yieldthe22contingencytable shown in Table 9with the cellsn

1 , n

2 , n

3

, and n 4

. Let n = n 1 +n 2 +n 3 +n 4 be Put Tab9 here

thenumberofinstancesin thetest-set. Notethattherandomsetofsequences referredtointheright-handcolumnmayincludesomeNPPsequences. Table10 shows anestimate of the contingency table that would beobtained if it were Put

Tab10 here

possibletoidentify andremoveallthepositivesfromthesetofrandoms. Ifthe proportionofNPPsinthetest-setwasknowntobethesameastheproportion ofNPPsin thedatabasethenwecouldestimate Pr(NPP)to be(n

1 +n

3 )=n andPr(NPP jR ec)toben

1 =(n

1 +n

2

(33)

inthetest-setanddatabase.

InordertoderiveaformulaforestimatingRAgivenbothasetofpositives andaset of randoms,weestimatePr(NPP)and Pr(NPP j R ec)asfollows. LetS bethetotalnumberofsequencesinthedatabase,ofwhichM areNPPs.

Pr(NPP) =

no:of NPPsin thedatabase no:of sequencesinthedatabase

= M=S (3)

Pr(NPP j R ec)= N

dbNPP recog N

dbseqpred pos

(4)

where N

dbNPP recog

is the number of NPPs in db which are recognised by model and N

dbseqpred pos

is the number of sequences in dbwhich the model predictsto beNPP.

Table11 showsthe expected resultof using the learnedrecognitionmodel Put

Tab11 here

on the entire SWISS-PROT database. Note that the factor (1 Æ) does not appearasitcancelsout. FromEquation4andTable11itfollowsthat:

Pr(NPP jR ec) '

n1 n1+n3 M n1 n1+n3 M+ n2 n2+n4 (S M) = (Mp

1 )=(Mp

1

+(S M)p 2 ) (5) wherep 1 =n 1 =(n 1 +n 3

)andp 2 =n 2 =(n 2 +n 4

). SubstitutingEquations3and 5intoEquation2gives

R A = (Mp

1 )=(Mp

1

+(S M)p 2 ) M=S = Sp 1 Mp 1

+(S M)p 2 = Sp 1 Sp 2 +M(p 1 p 2 ) (6)

A.3 Estimating Relative Advantage

In thefollowing Relative Advantageover theentire population is represented byR Aincapital letterswhereasRelativeAdvantageoverasampleisdenoted bylowercasei.e.ra. AsthevalueofM isnotknown,weestimate

P 90 M=57

(34)

is equalto thenumber of known NPPsin SWISS-PROT. The upperlimit of M is the mostprobable number of NPPsin SWISS-PROT i.e. atotal of the known NPPsandthose proteinswhich haveyet to bescienticallyrecognised asaNPP.

90 X

M=57

R A ' Sp 1 Z 91 M=57 1 (p 1 p 2

)M+Sp 2 @M = Sp 1 (p 1 p 2 ) ln((p 1 p 2

)M+Sp 2

)+k 91 57 = Sp 1 (p 1 p 2 ) ln 91(p 1 p 2 )+Sp

2 57(p

1 p

2 )+Sp

2 (7) Weestimate P 90 M=57

R Aby summing anestimate of the P

90 M=57

R A for each instance in the test-set asfollows, where n is the number of instances in the test-set. This method hasthe advantagethat it allowsthe signicance of the dierencebetweentheRAoftwomodelstobegauged(seeSectionA.4).

n X k =1 90 X M=57 ra k (8)

FromEquation8andthecontingencytableitfollowsthat: 90 X M=57 ra= 1 n 4 X i=1 n i 90 X M=57 ra i ! (9) Each P 90 M=57 ra i

is estimated by substituting p 1 = a a+c and p 2 = b b+d into Equation7. Thevaluesofa,b,c anddaredeterminedbythree steps.

1. Whatever the i value, a, b, c and d are initially given the valuesof the correspondingcounts/frequenciesinthecontingencytableforthetest-set (seeTable9).

2. Eachoneofa,b,candd, isdecrementedprovidingthat thevaluebefore subtractionisgreaterthan1.

We donot decrementwhen the valuebefore subtraction iszero because thiscanresultinp

1 orp

2

(35)

1 2

is one becausethis can cause p 1

or p 2

to havethe valuezero, which in turnhasahighly disproportionate eectonthevalueof

P 90 M=57 ra i .

3. Thevalueofeither a,b,c ordis incrementedto reecttheclassication ofaninstanceinthecelln

i .

For instance, ifi =2 and all the counts in the contingency table are greater thanonethena=n

1

1;b=n 2

;c=n 3

1;d=n 4

1.

Note thatSteps 1and2assign thesameprior probabilityto each instance because the eect of each step is not dependent upon which cell the current instance belongs to. Therefore this method of estimating

P 90 M=57

R A has the propertiesofa)producingidenticallydistributedrandomvariablesrepresenting theoutcomeforeachinstance; b)havingasamplemeanwhichapproachesthe populationmeaninthelimitandc)havingarelativelysmallsamplevariance.

Thenal stepof ourmethodfor estimatingR A isto takethe meanof the summedvalues.

MeanR A= P

90 M=57

ra i 90 (57 1)

= P 90 M=57 ra i 34 (10)

A.4 Assessing the signicance of the dierence between the RA of two models

Nextwedevelopamethodforassessingthesignicanceofthedierencebetween theRA oftwomodels. Thismethod tacklesaproblem whichissimilar tothat posed bythethird questionin Dietterich's taxonomyofstatisticalquestions in MachineLearning(Dietterich,1998). Thatis,howtochoosebetweenclassiers forasingle applicationdomain in which theamountof available data is suÆ-cient to allowsome of it to beset aside for evaluating classiers. However a newmethodisneededbecauseofthefundamentaldierencesbetweenRelative Advantageandpredictiveaccuracy.

We compare the performance of two recognition models, H 1 and H 2 , by comparing their P 90 M=57

R A values. Let d be dierence in P

90 M=57

(36)

overthe entirepopulation,i.e. forall theproteinsin SWISS-PROT,andd be theobserveddierenceonthetest-set.

d= 90 X M=57 R A H 1 90 X M=57 R A H 2 (11) ^ d= 90 X M=57 ra H 1 90 X M=57 ra H 2 (12) ^

dis anunbiasedestimatorforthetruedierencebecauseitiscalculatedusing anindependenttest-set. Todeterminewhether theobserveddierenceis stat-istically signicantweaddressthefollowingquestion. Whatis theprobability that P 90 M=57 R A H1 > P 90 M=57 R A H2

,giventheobserveddierence, ^ d.

IfD isarandomvariablerepresentingtheoutcomeofestimatingd by ran-domsamplingthen,accordingtotheCentralLimitTheorem,^

D

isnormally dis-tributedinthelimit. Ithasanestimatedmean

^

dandhasanestimatedvariance of^

2 D

=n. Thevarianceof arandomvariable, X, is 2 X

=E((X) 2

) (E(X)) 2

. Therefore,sinceD isarandomvariable:

^ 2 D =^ D

2 ^ 2 D

(13)

We calculate ^ D

2 as follows. Lettesting themodel ontest data yield the 44contingencytableshowninTable12withthecellsn

i;j . Put Tab12 here ^ D 2 = 1 n 4 X i=1 4 X j=1 0 @ n i;j 90 X M=57 ra i 90 X M=57 ra j ! 2 1 A (14)

Giventhatp( P 90 M=57 R A H 1 > P 90 M=57 R A H 2 )=p(

P 90 M=57 R A H 1 P 90 M=57 R A H 2 > 0) we evaluate our null hypothesis by estimating p(d < 0) using the Central LimitTheorem.

Z 0 x= 1

Pr(d=x)dx= Z 0 x= 1 1 p 2 2 e 1 2 ( x ) 2 dx (15)

where=^ D

and=^ D

= p

(37)

Table13 showsthe productionrules generatedby CProgol. Therulescomply withPrologsyntax. signal(X;Y)istrueifthere isasignalpeptideat the be-ginningofthesequenceX,anditisfollowedbyasequenceY. Theotherdyadic predicatesaredenedsimilarly. Non-terminalsandterminalswhich appearon the right hand side of the production rules listed in Table 13 are dened in Table14. Table14 showsthe Prologcode representingthebackground know-ledge input to Progol. The production rules, when taken together with the partialgrammarshownin Figure3,form agrammarforNPPsequences.

B.1 Dening a Hypothesis Language for Progol

AHypothesisLanguageforProgolisdenedby:{

mode and type declarations which state the forms that atoms in hypo-thesesmaytake(seeSectionsB.1.1andB.1.2);

prunedeclarationswhichfurtherrestricttheformofhypotheses(see Sec-tionB.1.3);

the maximum numberof layers of variables introduced by atoms in the bodyofinducedclausesfrom variablesintheheadoftheclauses;

themaximumnumberofliteralsinthebodyofinducedclauses.

ThehypothesislanguageusedintheexperimentisdenedbyTables15,16 and17.

B.1.1 Mode Declarations

(38)

:- modeb(Recallnumber, Bodyliteraltemplate)?

HeadtemplateandBodyliteraltemplatearetemplatesofpredicatesand take the form predicate(ts1, ts2, ...), where ts is a term specication. Each term specication comprisestwo parts: a mode and atype. Typesare describedinSection B.1.2. Thethreepossiblemodesare:{

+ Thisindicatesthatthetermisaninput. Thatis,inallcallstothispredicate, thetermwillbeboundtoavalue.

{ Thisindicatesthetermisanoutput.

# This indicatesthataconstantshouldappearinthisterm.

Recallnumberreferstothedeterminacy ofthepredicatetemplate, thatis itspecies the maximumnumberof times acall to thepredicate cansucceed foragivensetofinputvariables. Hencefordeterminatepredicatetemplatesit isset to oneand for indeterminatepredicatetemplates to valuesgreaterthan one. IftheUserspeciesaRecallnumbertobe*thenProgolassignsadefault valueof100toit.

B.1.2 Type Declarations

Typesthat areincludedin modedeclarationsmaybeunarypredicatesdened in the background knowledge. All but one of the mode declarations listed in Table15refertothetyperlist/1;thisisapredicatewhosedenition islisted onTable14. Progoltype-checksaconstantby executingaqueryin which the predicatecorrespondsto thetypeandthetermisinstantiatedtotheconstant. If the query succeeds then Progol accepts that the constant is of the correct type.

B.1.3 PruneDeclarations

(39)

whenHeadisinstantiatedto theliteralin theheadoftheproposedclauseand Bodyisinstantiatedtotheproposedbodyoftheclause.

B.2 The Time and Space Complexity of Progol

The Progolalgorithm, as analysed in (Muggleton, 1995), has time and space complexity which increases linearly in both the number of examples and the numberofclausesinthelearnedtheory. However,thescalingconstantsinvolved varydependingonthesizeofthehypothesisspacesearchedforeachclause. This iscontrolledusinganumberofparameters,includingaclauselengthboundand aproofdepthbound.

(40)

Therule-setthat wasgeneratedfrom the`Proportions+Length+SIGNALP +Grammar'amalgamisshowninFigures5and6. Eachboxcontainsaruleas itwasoutputbyC4.5rulestogetherwiththeEnglishtranslationwhichisshown in italics. The percentagein square brackets refersto the predictedaccuracy ofthe correspondingrule. Eachcolumn of rulesis tried in turn. Within each column,eachruleistriedinorderofappearance. Thelastruleisthe`default' rulewhichisused ifnoneoftheotherrulesapply.

Therule-setthat was generatedfrom the`Proportions+Length'amalgam isshowninFigures 7,8and9.

(41)

Abe,N.,&Mamitsuka,H.(1997).Predictingproteinsecondarystructureusing stochastictreegrammars. Machine Learning,29,275{301.

Bairoch,A., &Apweiler,R.(2000). TheSWISS-PROTproteinsequence data-baseanditssupplementTrEMBLin 2000. Nucleic Acids Res,28,45{48.

Bairoch, A., Bucher, P., & Hofman, K. (1997). The PROSITE database, its statusin1997. NucleicAcidsResearch,25, 217{221.

Bakalkin,G.,Rakhmaninova,A.,Akparov,V.,Volodin,A.,Ovchinnikov,V.,& Sarkisyan,R.(1991).Aminoacidsequencepatternintheregulatorypeptides. InternationalJournalof Peptide ProteinResearch,38, 505{510.

Baldi,P.,&Brunak,S.(1998).Bioinformatics: the machinelearningapproach. Cambridge,MA:MITPress.

Brazma,A., Jonassen,I.,Eidhammer,I.,&Gilbert,D. (1998). Approachesto the automatic discoveryof patterns in biosequences. Journal of Computa-tionalBiology, 5,279{305.

Claros,M.,Brunak,S.,&vonHeijne,G.(1997).PredictionofN-terminalsorting signals. Current OpinioninStructuralBiology, 7,394{398.

Devi, L. (1991). Consensus sequence for processing of peptide precursors at monobasicsites. FEBS,280,189{194.

Dietterich,T.G.(1998).Approximatestatisticaltestsforcomparingsupervised classicationlearningalgorithms. NeuralComputation, 10,1895{1924.

Doolittle, R. (Ed.). (1996). Computer methods for macromolecular sequence analysis. NewYork: AcademicPress.

(42)

USA.Version2.1.1edition.

Heniko,J.,&Heniko,S.(1996). Blocksdatabaseanditsapplications. Meth-ods inEnzymology, 266,88{105.

Hillier, L., Lennon, G., Becker, M., Bonaldo, F., Chiapelli, B., Chissoe, S., Dietrich, N., DuBuque, T., Favello, A., Gish, W., Hawkins, M., Hultman, M., Kucaba, T., Lacy, M., Le, M., Le, N., Mardis, E., Moore, B., Morris, M.,Parsons,J.,Prange,C.,Rifkin,L.,Rohlng,T.,Schellenberg,K.,Soares, M.,Tan,F.,Thierry-Mieg,J.,Trevaskis,E.,Underwood,K.,Wohldman,P., Waterston,R.,Wilson, R., &Marra,M.(1996). Generationand analysisof 280,000humanexpressedsequencetags. Genome Research,6,807{828.

Klavdieva,M. (1995). The historyof neuropeptides. Frontiers in Neuro endo-crinology,16, 293{321.

Ling, C., & Li, C. (1998). Data mining for direct marketing: Problems and solutions. Proceedings of the FourthInternational Conference on Knowledge DiscoveryandDataMining(pp. 73{79). NewYorkCity.

Linz,P.(1996).An introductiontoformallanguagesandautomata. Lexington, Massachusetts: D.C.HeathandCompany.

Lyall, A. (1996). Bioinformatics in the pharmaceutical industry. Trends In Biotechnology, 14,308{312.

Michalski,R.,&Wnek,J.(1997). Guesteditors'introduction. Machine Learn-ing,27,205{208.

Muggleton,S. (1995). Inverse entailment and Progol. New Generation Com-puting,13, 245{286.

(43)

challenge. Articial Intelligence,114,283{296.

Muggleton, S., Bryant, C., & Srinivasan, A. (2000). Learning chomsky-like grammars for biological sequence families. Proceedings of the Seventeenth InternationalConferenceonMachine Learning(pp.631{638). Stanford Uni-versity,USA:SanFrancisco,CA:MorganKaufmann.

Muggleton,S.,King,R.,&Sternberg,M.(1992). Proteinsecondarystructure predictionusing logic-basedmachine learning. Protein Engineering, 5,647{ 657.

Muggleton,S., & Raedt, L. D. (1994). Inductivelogic programming: Theory andmethods. JournalofLogicProgramming, 19,20,629{679.

Nielsen,H., Brunak,S.,&vonHeijne, G.(1999). Machine learningapproaches forthepredictionofsignalpeptidesandotherproteinsortingsignals.Protein Engineering,12, 3{9.

Nielsen,H.,Engelbrecht,J.,Brunak,S.,&vonHeijne,G.(1997). Identication ofprokaryoticandeukaryoticsignalpeptidesandpredictionoftheircleavage sites. Protein Engineering,10,1{6.

Pennisi,E.(1999). Humangenome- Academicsequencerschallenge Celerain sprintto thenish. Science, 283,1822{1823.

Provost,F., &Fawcett,T.(1998). Analysisand visualizationof classier per-formance: Comparison under imprecise class and cost distributions. Pro-ceedings of the ThirdInternational Conference onArticialIntelligence(pp. 706{713).Menlo Park,CA:AAAIPress.

Quinlan, J. (1993). C4.5: Programs for machine learning. San Mateo, CA: MorganKaufmann.

(44)

Boileau,G., &Cohen, P.(1995). Role of amino-acid-sequencesanking di-basiccleavage sites in precursor proteolytic processing {the importance of therst residueC-terminalof the cleavagesite. Euroupean Journal of Bio-chemistry,227,707{714.

Rholam,M.,Nicolas,P.,&Cohen,P.(1986). Precursorsforpeptide-hormones sharecommon secondarystructuresforming features at theproteolytic pro-cessingsites. FEBSLetters,,207,1{6.

Sakakibara,Y. (1997). Recent advancesof grammaticalinference. Theoretical ComputerScience,185,15{45.

Searls,D.(1993). Thecomputationallinguisticsofbiologicalsequences. Arti-cialIntelligencein Molecular Biology. California,USA:AAAIPress.

Searls,D. (1997). Linguisticapproachestobiologicalsequences[review]. Com-puterApplications in theBiosciences,13, 333{344.

Sonnhammer, E., Eddy, S., Birney, E., Bateman, A., & Durbin, R. (1998). PFAM:multiplesequencealignmentsandHMM{prolesofproteindomains. NucleicAcidsResearch,26, 320{322.

Sonnhammer, E., Eddy, S., & Durbin, R. (1997). PFAM: a comprehensive databaseof proteindomain families basedonseedalignments. Proteins,28, 405{420.

Spence, P. (1998). Obtainingvalue from thehumangenome {achallengefor thepharmaceuticalindustry[review]. DrugDiscovery Today, 3,179{188.

(45)

(46)

(47)

PhysicochemicalProperty Aminoacidswithproperty Hydrophobic H,W,Y,F,M,L,I,V,C,A,G,T,K Veryhydrophobic A,F,G,I,L,M,V

Hydrophilic S,E,Q,R,D,N

Electropositive R,K,H Electronegative D,E

Neutral A,C,F,G,I,L,M,N,P,Q,S,T,V,W,Y

Large Q,E,R,K,H,W,Y,F,M,L,I

Small P,V,C,A,G,T,S,N,D

Tiny A,G,S

Polar Y,T,S,N,D,E,Q,R,K,H,W

Aliphatic L,I,V

Aromatic H,W,Y,F

(48)

Feature Description

grampred Abooleanwhichindicateswhetherthegrammarpredictsa sequencetobeaNPPornot.

gramsigl Lengthofthesignalpeptide. gramnpl Lengthoftheneuropeptide.

gramfirst Positionoftherstresiduein theneuropeptide. gramlast Positionofthelastresidueintheneuropeptide.

gramnpstartfirst Therstresidueintheneuropeptideoroneofitsproperties. gramnpstartsecondThesecondresidueinneuropeptideoroneofitsproperties. gramnpendfirst Therstresidue/property/starinthebodyoftheendrule. gramnpendsecond The second residue/property/starin the body of the end

rule.

(49)

Feature Description

sigpcmax MaximumSIGNALPCscore

sigpcmaxpos PositionwheremaximumSIGNALPCscoreisachieved sigpcconcl SIGNALPCscoreconclusion(`y'or`n')

sigpymax MaximumYscorereportedbySIGNALP

sigpymaxpos PositionwheremaximumSIGNALPYscoreisachieved sigpyconcl SIGNALPY scoreconclusion(`y'or`n')

sigpsmax MaximumSIGNALPS score

sigpsmaxpos PositionwheremaximumSIGNALPS scoreisachieved sigpsconcl SIGNALPS scoreconclusion(`y'or`n')

(50)

thedecisiontreesgeneratedfromtheamalgamsofthefeaturegroups.MeanR A wasestimated usingthemethod describedinSection A.3.

Predictor MeanR A PredictiveAccuracy(%)

HiddenMarkovModel 0 99.0

+ 0.3

Onlyprops 0 96.7

+ 0.6

OnlyLength 1.6 91.8

+ 0.9

OnlySignalP 11.7 98.1

+ 0.4

OnlyGrammar 10.8 97.0

+ 0.5

Props+Length 49.0 98.6

+ 0.4

Props+SignalP 15.0 98.3

+ 0.4

Props+Grammar 31.7 98.2

+ 0.4

SignalP+Grammar 0 98.6

+ 0.4

Length+Grammar 0 96.2

+ 0.6

Length+SignalP 34.4 98.7

+ 0.4

Length+SignalP+Grammar 0 98.0

+ 0.4

Props+Length+SignalP 29.2 98.7

+ 0.4

Props+Length+Grammar 33.2 98.5

+ 0.4

Props+SignalP+Grammar 15.0 98.3

+ 0.4 Props+Length+SignalP+Grammar 107.7 99.0

(51)

addedtoSWISS-PROTbyMay1999.

(52)

1999.

Spring1997 May1999

Numberofsequences 64,000 79,449

(53)

are labelled by the sets NPP sequences, Random sequences, H (Hypothesis predictions) and H (complement of H). Thecells of the matrix represent the cardinalitiesofthecorrespondingintersectionsofthesesets. n

1 +n

2 +n

3 +n

4 = n,wherenisthenumberofinstancesin thetest-set.

SetoftestNPPsequences SetoftestRandomsequences

H n

1

n 2

H n

3

(54)

Theaxesof the22matrixarelabelledbythesets NPPsequences,Negative sequences, H (Hypothesis predictions) and H (complement of H). The cells of the matrix represent the cardinalities of the corresponding intersections of these sets. Æ = M=S where S is the total number of sequences in the entire SWISS-PROTdatabase,ofwhichM areNPPs.

SetoftestNPPsequences SetoftestNegativesequences

H n

1

n 2

(1 Æ)

H n

3

n 4

(55)

are labelled by the sets NPP sequences, Random sequences, H (Hypothesis predictions) andH (complementofH). Thetotal ofthecounts/frequenciesin thefourcells=S,whereSisthetotalnumberofsequencesintheSWISS-PROT database.

NPPsequencesinSWISS-PROT Negativesequencesin SWISS-PROT H

n1 n1+n3

M

n2 n2+n4

(S M) H

n3 n1+n3

M

n4 n2+n4

(56)

bythe cells of the22 contingency table for H 1

. The columns of the 44 matrixarelabelledbythecellsofthe22contingencytableforH

2

. Thecells ofthe44matrixrepresentthecardinalitiesofthecorrespondingintersections of these sets.

P 4 i=1 P 4 j=1 n i;j

(57)

sigpep(A,B):- g(A,C),star(C,D), s(D,B).

sigpep(A,B):- m(A,C),star(C,D), hydrophilic(D,E),tiny(E,B). sigpep(A,B):- hydrophobic(A,C),star(C,D),w(D,E), hydrobacc(E,B). sigpep(A,B):- large(A,C), run(C,D,hydrophobic),star(D,E),t(E,B). sigpep(A,B):- m(A,C), star(C,D),t(D,E),neutral(E,F),small(F,B).

sigpep(A,B):- m(A,C), star(C,D),veryhydrophobic(D,E),positive(E,F),tiny(F,B).

sigpep(A,B):- hydrophobic(A,C),run(C,D,hydrophobic),star(D,E),f(E,F), hydrophobic(F,B). sigpep(A,B):- hydrophobic(A,C),star(C,D),h(D,E), hydrophobic(E,F),tiny(F,B).

sigpep(A,B):- hydrophobic(A,C),star(C,D),v(D,E), hydrophobic(E,F),neutral(F,B). sigpep(A,B):- large(A,C), star(C,D),a(D,E),hydrophobic(E,F),small(F,B).

sigpep(A,B):- large(A,C), star(C,D),s(D,E),neutral(E,F),small(F,B). start(A,B):-a(A,C), veryhydrophobic(C,B).

start(A,B):-d(A,C), t(C,B). start(A,B):-g(A,C), v(C,B). start(A,B):-h(A,C), r(C,B). start(A,B):-k(A,C), r(C,B). start(A,B):-l(A,C), r(C,B). start(A,B):-q(A,C), g(C,B).

start(A,B):- s(A,C),l(C,B). start(A,B):- w(A,C),q(C,B).

start(A,B):- hydrophilic(A,C),a(C,B).

start(A,B):- hydrophilic(A,C),hydrophilic(C,B). start(A,B):- positive(A,C),k(C,B).

start(A,B):- small(A,C),r(C,B).

middle(A,B):- yvh(A,C), star(C,D),large(D,E),large(E,B).

middle(A,B):- positive(A,C),star(C,D), neutral(D,E),large(E,F), large(F,B).

middle(A,B):- hydrobacc(A,C), star(C,D),hydrophobic(D,E),neutral(E,F),aromatic(F,B). middle(A,B):- hydrobacc(A,C), yvh(C,D), star(D,B).

middle(A,B):- small(A,C), star(C,D),p(D,E),large(E,F), large(F,B). middle(A,B):- y(A,C), star(C,D),g(D,E),hydrophobic(E,B).

middle(A,B):- hydrobacc(A,C), star(C,D),k(D,E), neutral(E,F),small(F,B). middle(A,B):- small(A,C), star(C,D),l(D,E),m(E,B).

middle(A,B):- small(A,C), star(C,D),f(D,E),hydrophobic(E,F),aliphatic(F,B). middle(A,B):- tiny(A,C),star(C,D),m(D,B).

middle(A,B):- q(A,C), star(C,D),positive(D,E),neutral(E,F),neutral(F,B). middle(A,B):- hydrophobic(A,C),star(C,D),m(D,E), hydrophilic(E,F),neutral(F,B). middle(A,B):- e(A,C), star(C,D),i(D,B).

middle(A,B):- q(A,C), star(C,D),l(D,B).

middle(A,B):- aromatic(A,C),star(C,D), v(D,E),neutral(E,F),hydrobdon(F,B). middle(A,B):- aromatic(A,C),star(C,D), a(D,E),e(E,B).

middle(A,B):- c(A,C), star(C,D),c(D,B).

middle(A,B):- y(A,C), star(C,D),hydrobdon(D,E), hydrobdon(E,B). middle(A,B):- hmn(A,C), star(C,D),d(D,B).

middle(A,B):- tiny(A,C),star(C,D),l(D,E), hydrobdon(E,F),hydrobdon(F,B).

middle(A,B):- neutral(A,C),star(C,D),veryhydrophobic(D,E),negative(E,F),aromatic(F,B). middle(A,B):- h(A,C), star(C,D),veryhydrophobic(D,E),neutral(E,B).

middle(A,B):- h(A,C),star(C,D), positive(D,E),neutral(E,F),hydrobdon(F,B). middle(A,B):- hydrophilic(A,C),star(C,D),e(D,E), small(E,B).

middle(A,B):- hydrobdon(A,C), star(C,D),g(D,E), hydrophobic(E,F),neutral(F,B). middle(A,B):- hydrophobic(A,C),star(C,D),n(D,E), neutral(E,F),large(F,B). middle(A,B):- hydrophobic(A,C),star(C,D),a(D,E), f(E,B).

middle(A,B):- hydrobdon(A,C), star(C,D),negative(D,E),aromatic(E,B). middle(A,B):- hydrobacc(A,C), star(C,D),r(D,E), hydrophobic(E,B).

middle(A,B):- aromatic(A,C),star(C,D), a(D,E),veryhydrophobic(E,F),large(F,B). middle(A,B):- tiny(A,C),star(C,D),r(D,E), tiny(E,B).

end(A,B):-pp(A,C),d(C,B). end(A,B):-pp(A,C),large(C,B). end(A,B):-e(A,C), l(C,D),s(D,B). end(A,B):-e(A,C), v(C,D),v(D,B). end(A,B):-g(A,C), positive(C,D),

hydrobdon(D,B). end(A,B):-q(A,C), a(C,D),g(D,B).

end(A,B) :-r(A,C), tiny(C,D),hydrobacc(D,B). end(A,B):-t(A,C), neutral(C,D),hydrobacc(D,B). end(A,B):-positive(A,C), r(C,D),small(D,B). end(A,B):-positive(A,C), r(C,D),hydrobacc(D,B). end(A,B):-large(A,C), l(C,D),v(D,B).

end(A,B):-small(A,C), hydrophobic(C,D),positive(D,B). end(A,B):-tiny(A,C),star(C,D), r(D,B).

(58)

unary predicates representingthe properties shown in Table 3 are not shown hereforreasonsofspace.

rlist([]).

rlist([RjT]):- res(R),rlist(T). res(a). res(b). res(c). ... res(z).

any([jS],S). %residueof anytype or property

kr(A,C):- k(A,B),r(B,C). kp(A,C):- k(A,B),positive(B,C). pp(A,C):- positive(A,B),positive(B,C). gkr(A,D):-g(A,B), k(B,C),r(C,D). grr(A,D):-g(A,B), r(B,C),r(C,D). yvh(A,C):-y(A,B), veryhydrophobic(B,C). hp(B,C):- h(B,C),p(C,D).

hmn(A,E):-hydrophilic(A,B),star(B,C), m(C,D),negative(D,E). wmdf(A,B):- w(A,C),m(C,D), d(E,F),f(F,B), end(B).

veryhydrophobic([RjT],T):-veryhydrophobic(R). small([RjT],T):- small(R). hydrophobic([RjT],T):-hydrophobic(R). tiny([RjT],T):-tiny(R). hydrophilic([RjT],T):-hydrophilic(R). polar([RjT],T):- polar(R). positive([RjT],T):-positive(R). aliphatic([RjT],T):-aliphatic(R). negative([RjT],T):-negative(R). aromatic([RjT],T):-aromatic(R). neutral([RjT],T):-neutral(R). hydrobdon([RjT],T):-hydrobdon(R). large([RjT],T):-large(R). hydrobacc([RjT],T):-hydrobacc(R).

star(S,S).

star([jS],T) :- star(S,T).

a([ajT],T).b([bjT],T). c([cjT],T). ... z([zjT],T).

run([XjS],T,P):- prop(P),docall(P,X),run(S,T,P). run([X,YjS],S,P):- prop(P),docall(P,X),docall(P,Y). docall(P,X):- Call=.. [P,X], Call.

(59)

TARGETPREDICATEiseithersigpep,start,middleorend.

:-modeh(1,TARGET_PREDICATE(+rlist,-rlist))?

:-modeb(1,yvh(+rlist,-rlist))? :- modeb(*,hmn(+rlist,-rlist))? :-modeb(1,hp(+rlist,-rlist))? :- modeb(1,wmdf(+rlist,-rlist))?

:-modeb(1,a(+rlist,-rlist))? :- modeb(1,b(+rlist,-rlist))? ...:- modeb(1,z(+rlist,-rlist))?

:-modeb(1,hydrophobic(+rlist,-rlist))? :-modeb(1,small(+rlist,-rlist))? :-modeb(1,very_hydrophobic(+rlist,-rlist))? :-modeb(1,tiny(+rlist,-rlist))? :-modeb(1,hydrophilic(+rlist,-rlist))? :-modeb(1,tiny(+rlist,-rlist))? :-modeb(1,positive(+rlist,-rlist))? :-modeb(1,aliphatic(+rlist,-rlist))? :-modeb(1,negative(+rlist,-rlist))? :-modeb(1,aromatic(+rlist,-rlist))? :-modeb(1,neutral(+rlist,-rlist))? :-modeb(1,hydro_b_don(+rlist,-rlist))? :-modeb(1,large(+rlist,-rlist))? :-modeb(1,hydro_b_acc(+rlist,-rlist))?

%Thenext five modedeclarations wereonly used when generatingrulesforthe ends.

:-modeb(1,pp(+rlist,-rlist))? :- modeb(1,gkr(+rlist,-rlist))? :-modeb(1,kp(+rlist,-rlist))? :- modeb(*,run(+rlist,-rlist,#prop))? :-modeb(1,kr(+rlist,-rlist))?

%Thenext mode declarationwas onlyused when generatingrulesfor signals,middlesandends.

(60)

TARGETPREDICATEiseithersigpep,start,middleorend.

prune(_,Body):-in(star(A,B),Body),A==B. %No star(X,X)in body

prune(Head,Body):-Head=.. [_,U,_],not(chain(U,Body)).%Body must formvariablechain fromhead

prune(_,Body):-suffix(Body,Suffix), %No star(X,Y),star(Y,Z)in body (Suffix=(star(_,_),(star(_,_),_))

;Suffix=(star(_,_),(star(_,_)))).

%Thefollowing prunewasnotused when generatingtherules forthestarts or middles.

prune(_,Body):-suffix(Body,Suffix), %No run(X,Y),run(Y,Z)in body (Suffix=(run(_,_,P),(run(_,_,P),_))

;Suffix=(run(_,_,P),(run(_,_,P)))).

%Thefollowing prunewasnotused when generatingthestart rules.

prune(_,star(_,_)).

:-TARGET_PREDICATE(x,y). % NotallowedeverythingaNPP

chain(U,true).

chain(U,A):-A=.. [_,V,_|_],U==V.

chain(U,(A,B)):-A=.. [_,V,W|_],U==V,chain(W,B).

suffix(S,S).

(61)

pos inate i c nodes v h r s signal yes 100000 6 5 4000 0 100000000 100000000 100000000 start yes 100000 6 5 4000 0 100000000 100000000 100000000 middle yes 100000 6 5 1000 0 200 400 100000000 end yes 100000 3 3 4000 0 100000000 100000000 100000000 pos Theposonlysetting. WhenthisissettoyesCProgoladoptstheILPBayesianapproachto

learningfrompositiveexamples.

inate Controlsthespecicityofclausesobtained.

i Anupperboundonthenumberoflayersofvariablesintroducedbyatomsinthebodyofinduced clausesfromvariablesintheheadoftheclauses.

c Anupperboundonthenumberofliteralsinthebodyofinducedclauses.

nodes AnupperboundonthenodestobeexploredbyCProgolwhensearchingforaconsistent clause.

v Theverbosityoftheoutput.

h Adepthboundonthetheoremprover.

r Anupperboundonthenumberofresolutionsbeyondwhichthewholeprooffailsi.e.backtracking doesnotoccur.

(62)

N t e r m

i n u s

m

r

k

a

p

g

d

r

v

y

f

h

l

1

2

3

4

5 33 34 35 36 37 38

41

42

43 signal peptide

angiotensin I

angiotensin II

neuropeptide precursor

485

(63)

(64)

(65)

(66)

npp(A,B):- signal(A,C), star(C,D),

neuro_peptide(D,E), star(E,B).

signal(A,C):- ...

neuro_peptide(D,E):- start(D,F), middle(F,G), end(G,E). start(D,F):- ...

middle(F,G):- ... end(G,E):- ...

m

B

k

p

i

...

k

r

d

a

g

k

r

...

A

signal

star

C

D

start

middle

F

G

end

E

(67)

(68)

Filter

Using our best recognition model as a filter

Random

Sample

using

Model

SWISS-PROT

NPPs

novel

1 : 22

NPPs

1 : 8

NPPs

novel

1 : 883

NPPs

(69)

(70)

sigpymax<=0.457 ->class0[99.9%]

AsequenceisnotaNPPifthe maximumY scorereportedby SIGNALP0:457.

Rule31: length>267 proportionl<=0.141717 proportionpolar<=0.285366 ->class0[99.9%]

A sequence is not a NPP if it is more than 267 residues long and the proportion of its residueswhichare:-1)leucine (L) is 0:141717 2) polar is 0:285366.

Rule14:

gramnpstartfirst=0 proportionh>0.00588235 proportioni>0.0141343 proportionneutral<=0.793103 proportiontiny>0.208253 ->class0[99.9%]

A sequence is not a NPP if the grammar predicts that a neuropeptide start is not present and the proportion of residuesinthesequencewhich are: 1) histidine (H) is > 0:00588235;2)isoleucine(I)is > 0:0141343; 3) not surroun-ded by an electrostatic charge is 0:793103; 4) tiny is > 0:208253.

Rule29: sigpcmaxpos>29 proportionr>0.047043 ->class0[99.8%]

AsequenceisnotaNPPifthe positionwheremaximum SIG-NALP C score is achieved is > 29 and the proportion of itsresidueswhichareArginine (R)is>0:047043

Rule11:

proportiong>0.040555 proportionr>0.047043 proportionhydrophobic>0.584906 proportiontiny<=0.208253 ->class0[99.7%]

A sequence is not a NPP if the proportion of its residues which are:- 1) glycine (G) is > 0:040555; 2) arginine (R) is > 0:047043; 3) hydrophobic is > 0:584906; 4) tiny is 0:208253.

sigpcmaxpos<=29

proportionhydrophobic>0.636591 proportiontiny<=0.301205 ->class0[99.7%]

AsequenceisnotaNPPifthe position wheremaximum SIG-NALP C score is achieved is 29and theproportionofits residues which are:-1) hydro-phobic is > 0:636591; 2) tiny 0:301205.

Rule5:

proportiontiny<=0.176282 ->class0[99.7%]

AsequenceisnotaNPPifthe proportionofitsresidueswhich aretinyis0:176282. Rule4:

length<=267 proportionr<=0.047043 ->class0[99.6%]

AsequenceisnotaNPPifthe number of residues in the se-quenceis 267 and the pro-portion of its residues which arearginine(R)is0:047043. Rule9:

gramsigl<=25 proportiong>0.040555 proportionq>0.0395683 proportionr>0.047043 proportiontiny<=0.208253 ->class0[99.6%]

A sequence is not a NPP if the grammar predictsthat the length of the signal peptideis 25and theproportionofits residueswhichare:-1)glycine (G)is >0:040555; 2) glutam-ine(Q)is>0:0395683;3) ar-ginine (R) is > 0:047043; 4) tinyis0:208253.

Rule26:

proportioni>0.00882353 proportionhydrophobic<=0.636591 proportionneutral>0.793103 ->class0[99.5%]

A sequence is not a NPP if the proportion of its residues whichare:-1)isoleucine(I)is > 0:00882353; 2) hydrophobic is0:636591;3)not surroun-ded by an electrostatic charge is>0:793103.

gramfirst<=55 proportiong<=0.040555 ->class0[99.5%]

A sequence is not a NPP if the grammar predicts that po-sition of the rst residue in the neuropeptide is less than or equal to 55 residues from the N-terminal and the pro-portion of residues in the se-quencewhich areglycine is 0:040555.

Rule23: gramsigl<=27 proportionp>0.0873016 proportionneutral<=0.793103 ->class0[99.4%]

A sequence is not a NPP if the grammar predicts that the length of the signal peptide is 27 and the proportion of residues in the sequence which are:- 1) proline (P) is > 0:0873016; 2) not surroun-ded by an electrostatic charge is0:793103.

Rule33: length>267 proportionl>0.142268 ->class0[98.4%]

A sequence is not a NPP if it is more than 267 residues long and the proportion of its residueswhichareleucine(L) is>0:142268.

Rule17:

gramnpendfirst=small

proportionveryhydrophobic>0.435146 ->class0[96.8%]

A sequence is not a NPP if the grammar predicts that the third from last residue in the neuropeptide is small and the proportion of residues in the sequencewhicharevery hydro-phobicis>0:435146.

Rule21:

gramnpstartfirst=s ->class0[95.8%]

(71)

Are grammatical representations useful for learning from biological sequence data?— a case study

Are grammatical representations useful for

learning from biological sequence data?—

a case study

Muggleton, SH, Bryant, CH, Srinivasan, A, Whittaker, A, Topp, S and Rawlings, C

http://dx.doi.org/10.1089/106652701753216512

Title

Are grammatical representations useful for learning from biological

sequence data?— a case study

Authors

Muggleton, SH, Bryant, CH, Srinivasan, A, Whittaker, A, Topp, S and

Rawlings, C

Type

Article

URL

This version is available at: http://usir.salford.ac.uk/33399/

Published Date

2001

USIR is a digital collection of the research output of the University of Salford. Where copyright

permits, full text material held in the repository is made freely available online and can be read,

downloaded and copied for non­commercial private study or research purposes. Please check the

manuscript for any further copyright restrictions.

m

r

k

a

p

g

d

r

v

y

f

h

l

1

2

3

4

5

33 34 35 36 37 38

41

42

43

signal peptide

angiotensin I

angiotensin II

neuropeptide precursor

485

m

B

k

p

i

...

k

r

d

a

g

k

r

...

A

signal

star

C

D

start

middle

F

G

end

E

Filter

Using our best recognition model as a filter

Random

Sample

Sample

using

downloaded and copied for noncommercial private study or research purposes. Please check the