Are grammatical representations useful for learning from biological sequence data?— a case study

(1)

learning from biological sequence data? { a case study S.H.Muggleton y C.H. Bryant z

DepartmentofComputerScience,UniversityofYork,YorkYO105DD,UK.

A.Srinivasan

OxfordUniversityComputingLaboratory,WolfsonBuilding,ParksRoad,OxfordOX13QD,UK.

A.Whittaker x

S.Topp C.Rawlings

{

SmithKlineBeecham,NewFrontiers,SciencePark,ThirdAvenue,Harlow,EssexCM195AW,UK.

Keywords: Bioinformatics, Machine Learning, Inductive Logic

Program-ming,CostFunction,GrammaticalInference

Abrief account of part of thiswork was published inthe proceedings of seventeenth

internationalconferenceonMachineLearning(Stanford,29June-2July2000).

y

Currentaddress:DepartmentofComputing,ImperialCollegeofScience,Technologyand

Medicine,180Queen'sGate,LondonSW72BZ.

z

Addresscorrespondenceto:SchoolofComputerandMathematicalSciences,TheRobert

GordonUniversity,Aberdeen,AB251HG,Scotland,UK.

x

Currentaddress:PsygnosisLtd.,37,KentishTownRoad,LondonNW18NX,UK.

{

(2)

This paper investigates whether Chomsky-like grammar

representa-tions are useful for learning cost-eective, comprehensible predictors of

membersof biological sequencefamilies. TheInductiveLogic

Program-ming(ILP)Bayesianapproachtolearningfrompositiveexamplesisused

to generatea grammarfor recognising aclass ofproteins knownas

hu-manneuropeptideprecursors(NPPs). Collectively,veoftheco-authors

ofthispaper,haveextensiveexpertiseonNPPsandgeneral

bioinformat-ics methods. Theirmotivationfor generatingaNPPgrammarwas that

noneoftheexistingbioinformaticsmethodscouldprovidesuÆcient

cost-savingsduringthesearchfornewNPPs. Priortothisprojectexperienced

specialists at SmithKlineBeecham had triedfor many monthsto

hand-code such a grammar but without success. Our best predictor makes

thesearchfor novelNPPsmore than100 times moreeÆcientthan

randomlyselecting proteins forsynthesisand testingthemfor biological

activity.Asfarastheseauthorsareaware,thisisboththerstbiological

grammarlearntusingILPandtherstreal-worldscienticapplicationof

theILPBayesianapproachtolearningfrompositiveexamples.

A groupof features is derivedfrom this grammar. Other groups of

features ofNPPsarederivedusingotherlearning strategies. Amalgams

of these groups are formed. A recognition model is generated for each

amalgamusingC4.5 andC4.5rules anditsperformance is measured

us-ing both predictive accuracyand a newcost function, Relative

Advant-age (R A). The highest R A was achieved by a model which includes

grammar-derivedfeatures. ThisR Aissignicantlyhigherthanthebest

R Aachievedwithouttheuseofthegrammar-derivedfeatures. Predictive

accuracyis notagoodmeasure ofperformance for thisdomainbecause

it doesnot discriminate well between NPP recognitionmodels: despite

coveringvaryingnumbersof(therare)positives,allthemodelsare

awar-dedasimilar(high)scorebypredictiveaccuracybecausetheyallexclude

(3)

Thispaperattemptstoanswer,bywayofacase-study,thequestionofwhether

grammatical representations are useful for learning from biological sequence

data. Weaddressthequestionwithexperimentalresultsthatsignicantly

con-tradictthefollowingnullhypothesis.

Nullhypothesis: Themostcost-eective,comprehensiblemulti-strategy

pre-dictors of human neuropeptide precursors do not employ acontext-free

denite-clause-grammar.

Multi-strategylearning(Michalski&Wnek,1997)aimsat integrating

mul-tiplestrategiesin asinglelearningsystem, wherestrategiesmaybe inferential

(e.g. induction, deduction etc) or computational. Computational strategy is

dened by therepresentationalsystemandthecomputational methodused in

thelearningsystem(e.g.decisiontreelearning,neuralnetworklearningetc).

A grammar for alanguagetells us whether a sentence is properly formed.

NoamChomsky, afounderofformal languagetheory, providedaninitial

clas-sicationoflanguagetypes. Those readersrequiringanintroductiontoformal

grammarsorthishierarchyarereferredto (Linz,1996).

We obtainresultswhich signicantly contradict thenullhypothesis as

fol-lows. A grammaris generatedfor aparticular classofbiologicalsequences. A

groupof features is derivedfrom this grammar. Other groupsof features are

derivedusing otherlearningstrategies. Amalgamsofthese groupsareformed.

Arecognitionmodelis generatedfor each amalgam usingC4.5 andC4.5rules.

Theresultssignicantlycontradictthenullhypothesis

because:-1. the best performance achieved using any of the models which include

grammar-derived features is higher than the best performance achieved

using anyof the models which do notinclude thegrammar-derived

fea-tures;

(4)

morecomprehensiblethanthebest `non-grammar'model.

Performance is measured using a new cost function, Relative Advantage

(RA). Appendix A denes RA and explains why it is used in preference to

otherperformancemeasures. A method ofestimating theRA ofarecognition

modelis presentedwhich subsequentlyallowsthestatistical signicanceof the

dierencebetweentheRAoftwomodelstobegauged.

Thedomainofthecasestudyistherecognitionofaclassofproteinsknown

ashuman neuropeptide precursors (NPPs). These proteins haveconsiderable

therapeuticpotentialand areof widespreadinterestin thepharmaceutical

in-dustry (see Section 3). Our best multi-strategy predictor of NPPsemploys a

context-freedenite-clause-grammar.

AnInductiveLogicProgramming(ILP)(Muggleton&Raedt,1994)system

is used to generatea grammar for NPPs. As far as these authors are aware,

this is the rst attempt to generate a grammar for abiological domain using

ILP. ILPistheareaofArticialIntelligencewhich dealswith theinductionof

hypothesised predicate denitions from examples and background knowledge.

Logic programs are used as a single representation for examples, background

knowledgeandhypotheses. ForarecentoverviewofILPissuesandresultssee

(Muggleton,1999).

MostILPsystemsrequirebothasetofpositiveexamplesoftheconceptto

belearntandaset ofnegativeexamples. Howeveritisnotpossibletoidentify

alarge,unbiasedsetofnegativeexamplesofNPPswithcertaintybecausethere

willbeproteinswhichhaveyettoberecognisedscienticallyasaNPP.Therefore

advantage was taken of the ILP Bayesianapproach to learningfrom positive

examples(Muggleton,1996). Thisapproachdoesnotrequireaset ofnegative

examples. Itisabletolearnaconceptfromasetofpositiveexamplesandaset

ofexamplessampledat random. As farastheseauthors areaware, thisis the

(5)

informationinmolecularbiologyandprevioustechniquesforlearningfromit,

in-cludinggrammaticalinference. Thissectionreviewsgrammaticalinferencefrom

theviewpointof learningcost-eective, comprehensiblepredictorsof members

ofbiologicalsequencefamilies. Thosereadersinterestedin previoustheoretical

resultsand applicationsto naturallanguagearereferredto (Sakakibara, 1997)

andthereferencestherein. Section 3introducesneuropeptideprecursor

recog-nition,thedomain of thecase-study. Sections 4and5detailtheexperimental

materials, methods and results. Section 6is the Discussion. Appendix A

de-scribesthenew costfunction RelativeAdvantage(R A). Appendix Bincludes

the production rules generated by CProgol. Appendix C includes our best

(6)

Research in the biological and medical sciences is being transformed by the

volume ofdata comingfrom projectswhich will reveal theentire geneticcode

(genomesequence)ofHomosapiensaswellasotherorganisms. Oncecomplete,

these projects should help us understand the genetic basis of human disease.

Thegrowthinthevolumeofdataandimprovementsinsoftwareforinterpreting

thisinformationhasincreasedinterestintheuseofcomputationalmethodsfor

identifyinggenesinvolvedin humandisease(Rawlings&Searls,1997).

Know-ingthe genesimplicated in adiseaseidenties theproteinsthat theycodefor

and possibly suggests the biochemical processes that may be inuencing the

developmentofthedisease. Thisinformationiscrucialforthegenerationofthe

experimental reagents needed for the development of new drugs and explains

thewidespreadinvestmentbythebiotechnologyandpharmaceuticalindustries

inbioinformaticsstaandtechnologies(Lyall,1996;Spence,1998).

Recentannouncementsindicateacommitmenttocompletesequencingofthe

entire human genome during the year 2000, through acceleratedinternational

funding of public research (Pennisi, 1999). This initiative has promoted the

developmentoftechnologytogenerateraw(uninterpreted)genesequencedata

at a rate of at least 1Gigabase of new DNA peryear. Once the full human

genomesequenceisavailable,itwillnotbelongbeforethenucleotidesequenceof

everyhumangeneisknown. Theaminoacidsequencesoftheproteinsencoded

bythesegenescanthenbededuced. Currentestimatesofthenumberofgenesin

thehumangenomevarygreatly,buttendtoaveragearound60,000. Ifdierent,

but similar biologicalsequences are believed to havearisenby evolution from

acommon ancestor, the proteins orgenes are said to behomologous to each

other. For asignicantportionofthese deduced proteinsequences, afunction

canbeinferredduetothehomologyofthesequencetoanotherknownprotein.

However,giventhelargenumbersofnewproteinsequencesexpectedtobesolved

inthenearfuture therewill beagreatmanyforwhich noclearhomologuesof

(7)

informationprocessinginfrastructureinbothcommercialandacademicresearch

sectors.Currentstateoftheartcomputationalsequenceinterpretationmethods

arenotcapableofsolvingthesequencetofunctionproblemforallnewproteins,

andthedevelopmentofnewtechniquesisstillrequired.

Asignicantchallengeintheanalysisandinterpretationofgeneticsequence

datais thereforetheaccurate recognitionof patternswithin the datathat are

diagnosticforknownstructural orfunctional features within theprotein. The

languageofgenesiswritteninasimplealphabetfA, C, G, Tgrepresentingthe

fourDNAbasecodes. Thelanguageofproteinsusesatwentycharacteralphabet

fA, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Yg

repres-entingamino acidresidues. These residuesareencodedin genes bysuccessive

DNAbasetriplets. Attheirsimplest,thesepatternscanbedescribedasregular

expressions. Manyfeaturescanbedescribedthroughtheuseofregular

expres-sions and a database PROSITE (Bairoch et al., 1997) is available in which

thesepatternsarecurated. APROSITEpatternsuchas:

[AC]-x-V-x(4)-fEDg

istranslatedas:

[A or C]-any-V-any-any-any-any-fany but E or Dg

AmoreextensiveexampleofaPROSITEpatternisthat forthefamilyof

pro-teinscalledshort-chaindehydrogenases(enzymesinvolvedincellmetabolism).

The PROSITE pattern that includes two perfectly conserved residues, a

tyr-osine(Y)andalysine(K) is:

[LIVSPADNK]-x(12)-Y-[PSTAGNCV]-[STAGNQCIVM]-[STAGC]-K-fPCg-[SAGFY-R]-[LIVMSTAGD]-x(2)-[LIVMFYW]-x(3)-[LIVMFYWGAPTHQ]-[GSACQRHM]

Thereare,however,limitationsinherentintheuseofsimpleregular

expres-sionsas arepresentationofbiologicalsequencepatterns. Inrecentyears

atten-tionhasshiftedtowardsboththeuseofneuralnetworkapproaches(see(Baldi

& Brunak, 1998)) and to probabilistic models, in particular hidden Markov

models (see (Durbin et al., 1998)). Both these methods directly address the

(8)

plemented with well-established methods for training models from examples.

Training regimes generally(but not always)requirethat thesequences in the

trainingset bearrangedso that those regionsof the sequencethat havebeen

conservedthroughevolutionarealignedinthesamecolumn. Theaccurate

mul-tiple alignment of biological sequenceshas been thesubjectof much research

and discussion (Doolittle, 1996)), and is considered by many to be a solved

problem. However,itrelies ontheassumptionthat thesequencestobealigned

showsomehomologyatthesequencelevelwitheachother. Complexbiological

signalsalso requirecomplex models and it is often thecase that considerable

expertiseisrequiredintheselectionoftheoptimalneuralnetworkarchitecture

orhiddenMarkovmodelbefore trainingcantakeplace.

A generallinguisticapproachto representingthestructure andfunction of

genesandproteinshasintrinsicappealasanalternativeapproachto

probabil-isticmethods becauseof thedeclarative andhierarchicalnature of grammars.

Searls(Searls,1993)hasundertakenthemostthoroughanalysisofthelinguistic

classicationof geneticgrammars startingwith the Denite Clause Grammar

(DCG).SearlsproposesaStringVariableGrammar(SVG)extensiontoaDCG

to providefeatures necessaryfor representinghigher-orderinteractions among

genetic sequence elements found in nucleic acids such as non-linear features

foundin RNA pseudoknots and other secondarystructures formedas aresult

ofinternalnucleicacidbase-pairing.

Whilelinguisticmethodshaveprovidedsomeinterestingresultsinthe

recog-nitionofcomplexbiologicalsignals(Searls,1997)generalmethodsforlearning

newgrammars from example sentences are much lessdeveloped. Brazma has

reviewedthedevelopmentofmethods fortheautomaticdiscoveryofbiological

patterns(Brazmaet al.,1998)muchofwhichhastakenplaceinthecontextof

buildingdatabasesofsequencemotifssuchasPROSITE(Bairochet al.,1997),

BLOCKS (Heniko &Heniko, 1996) and PFAM (Sonnhammer et al., 1997;

Sonnhammeret al.,1998). Abe&Mamitsukaproposedamethodforpredicting

(9)

Weconsidered it valuableto investigatethe application of InductiveLogic

Programmingmethodstothediscoveryofalanguagethatwoulddescribea

par-ticularlyinterestingclassofsequences{neuropeptideprecursorproteins(NPP).

Theyarehighly variableinlengthandundergospecic enzymaticdegradation

(proteolysis) before the biologically active short peptides (neuropeptides) are

released. Unlikeenzymesorstructuralproteins,NPPstend toshowalmostno

overallsequencesimilaritywiththeexceptionofsomeevidenceforcommon

an-cestrywithin certain groups. Roughly, the50 humanneuropeptideprecursors

currently known contain about 140 known cleaved peptides. These peptides

belongtoatleast40dierentneuropeptidefamilies,withonly1-5membersper

family. Prositemotifsdoexist forsomeofthemorehighlypopulatedfamilies,

but these arebased onthecleavedpeptideand notthe precursor. These

mo-tifscouldbeusedtodiscovernewmembersofaneuropeptidefamily,but they

willneverdetectnovelfamiliesofneuropeptides. Itisbelievedthat therearea

greatmanymorenovelsubgroupsyettobediscovered. Thisconfoundspattern

discoverymethodsthatrelyonmultiple sequencealignmentandrecognitionof

biologicalconservation. As aconsequenceNPPsposeaparticularchallenge in

(10)

Neuropeptides are an important group of short proteins that act as

neuro-transmittersmediatingthepassageofsignalswithinthecentralnervoussystem

(CNS)and betweentheCNSandtherestofthebody. Thetermneuropeptide

was rstintroduced in1971byD. deWeid (Klavdieva,1995)to describe

frag-mentsofhormonesthatproducedbehaviouralchangeswheninjected,butlacked

theactivityofthe intacthormone. Morerecentlythe termhasbeenaccepted

tocoverpeptides united by anumberofcommon features, includingtheir

tis-sue expression(brain, nervous tissue, secretorycells from organssuch asgut,

heart,lungs,placentaetc)metabolism,secretion,biosynthesisandhighpotency

(Klavdieva,1995).

Drugmoleculesworkbyinteractingwithtargetsiteswithinthebody. These

sitescommonlyareproteinmolecules,eitherenzymesorreceptors. By

interac-tion with these protein molecules, drugscan modulate their actions and

gen-erally suppress undesirable biochemical reactions. Neuropeptides exert their

biological actions through binding as ligands to specic receptors. The term

ligandis usedformolecules which bindto thetarget site. (Aligand mightbe

highly active against the target, but not a `drug', because of alack of other

requiredpropertiessuchasmetabolicstabilityorsafety.)

Activeresearchhasincreasedthenumberofmammalianneuropeptidesfrom

about18in1978tomorethan80by1999. However,despitealltheseeorts,the

biologyof many ofthese neuropeptides aswellastheirinteractions withtheir

receptorsremaintobeelucidated. Thereceptorsofsomeneuropeptideshavenot

yetbeenidentiedandtherearesomeorphanreceptorswithasyetunidentied

ligands. The in-vitro pairing of a novel receptor with its ligand is a critical

rststepinunderstandingthemechanismofdisease. Thusnovelneuropeptides

and theirorphan receptors haveconsiderabletherapeutic potential andare of

widespreadinterestinthepharmaceuticalindustry.

Neuropeptides are subsequences of neuropeptide precursor sequences. An

exampleof aneuropeptide precursoris shown in Figure 1. This neuropeptide Put

(11)

grammatic representation of several other precursors is shown in Figure 2.

Precursorsmaycontaineitherasingleneuropeptide,multiplecopiesofthesame Put

Fig2

here

neuropeptideorseveraldierentneuropeptides. Thesecanoccurconsecutively

in the precursororcanbe separatedbylarge stretchesof ller peptidewhich

is believed to play a purely structural role. Neuropeptide precursors contain

a short prex of residues called a signal peptide of about 20{30 amino acids

(aa)in length. Theknownprecursors rangein lengthfrom 70 to 600aa, and

the cleavedpeptides range from 3-200 aa. It is this hugevariation in length,

sequenceandinternalorganisationthatmakesneuropeptideprecursorsdiÆcult

to use when searching for novel remote homologues using sequence database

searchingmethods(e.g. BLAST).Theyalsoconfoundtypicalmultiplesequence

alignment methods used to identify conservedfeatures among functionally

re-latedsequences.

Many proteins are cleaved and trimmed after synthesis. Forexample,

di-gestiveenzymesaresynthesisedasinactiveprecursorsthatcanbestoredsafely

inthepancreas. Afterbeingreleasedintotheintestine,theseprecursorsbecome

activatedbycleavage. Neuropeptideprecursorsundergothis`splitting'process.

The signalpeptide targets the protein for secretionthrough a cell membrane

whereitis thencleavedfrom theprecursor. Theremainderoftheprecursoris

furthercleavedtoreleasetheneuropeptide. Withintheprecursor,thelocation

ofthecleavagesofthesignalsequenceandtheneuropeptidesarereferredtoas

cleavage sites.

Toour knowledge there has beenno previous attempt to computationally

predictorrecogniseneuropeptides. Severalinvestigatorshavestatistically

ana-lysedvariousfeaturesofneuropeptidesandtheirprecursors(Devi,1991;Rholam

et al., 1995; Rholam et al., 1986; Bakalkin et al., 1991). Other

investigat-orshavedevelopedmethodsfortheidenticationofproteinsortingsignalsand

thepredictionof theircleavagesites(Claroset al.,1997;Nielsenet al.,1999;

Nielsenet al.,1997). Howevertherecognitionofsignalpeptidesonlysolvespart

(12)

SmithK-centredonthe useof regularexpression searchingof translatedExpressed

Se-quence Tag (EST)databases of gene fragments. The results of this approach

were limited by the rigidity of the regular expressions, the high frequency of

errorsinESTsequences(Hillieret al.,1996),andtheirrelativelyshortlength.

The ultimate goal of this project will be to dierentiate novel neuropeptides

fromthewealthof othernewsequencesproducedbythecompletionofthe

hu-mangenome sequencingproject. This isataskthat currentsequenceanalysis

(13)

Thissectiondescribesanexperimentwhose resultssignicantlycontradict the

null hypothesis (see Section 1). The section begins by describing the

mater-ials (data, background knowledge and machine learning systems) used in the

experiment. This is followed by an account of the three steps of the

experi-mentalmethod. Finallythesectionends withthepresentationand analysisof

theresults.

4.1 Materials

4.1.1 Data

The data was taken from the SWISS-PROT database (Bairoch & Apweiler,

2000). SWISS-PROTisanannotatedproteinsequencedatabaseestablishedin

1986 and maintained, with collaborators, by the Department of Medical

Bio-chemistryoftheUniversityofGeneva. Itcanbeaccessedat

http://www.expasy.ch/sprot/sprot-top.html.

Ourdata-setcomprisesasubsetof positivesi.e. knownNPPsand asubset

ofrandomly-selectedsequences. Itisnotpossibletogeneratealarge,unbiased

set of negative examples because there will be proteinswhich have yet to be

recognised scientically as a NPP. The characteristics of the two subsets of

sequencesareasfollows.

Positives This subsetcontains allofthe 44knownNPPsequences that were

inSWISS-PROTin Spring1997,thetimethedata-setwasprepared(see

Table8). TheSWISS-PROTidentiersofthese44sequencesarelistedin

Tables1and2. Put Tab1 and Tab2 here

All but three of these are human proteins. The three non-human

se-quences were included asthe humanequivalenthad notbeendiscovered

andtheywereconsideredtobeimportantexamples. Theyareexpectedto

beverycloselyrelatedtothehumanandarepossiblyareasonablemodel

(14)

Thesesequencesareunrelatedbysequencehomologytotheremaining34.

Randoms Thissubsetcontainsallofthe3910fulllengthhumansequencesin

SWISS-PROTinSpring 1997.

1000ofthe3910randomswerereservedforthetest-set.

Thedata-setisavailableat

ftp://ftp.cs.york.ac.uk/pub/aig/Datasets/neuropeps/

4.1.2 MachineLearning Systems

Thepropositional learningwasperformedusing the decision-treelearnerC4.5

(Release 8) in conjunction with the companion program C4.5rules that

con-structsrulesfromatreebuiltbyC4.5(Quinlan,1993). Thegrammarlearning

was performed usingCProgol(Muggleton, 1995)version4.4whichis available

from

ftp://ftp.cs.york.ac.uk/pub/MLGROUP/progol4.4.

4.1.3 Background Knowledge

Duringboththegeneration ofthegrammarusing CProgolandthegeneration

of propositional rule-sets using C4.5 and C4.5rules we adopt the background

informationusedin(Muggletonet al.,1992)todescribephysicalandchemical

propertiesoftheaminoacids(seeTable3). Put

Tab3

here

4.2 Method

Themethod maybesummarisedas

follows:-1. AgrammarisgeneratedforNPPsequencesusingCProgol(seeSection4.2.1).

2. Agroupoffeaturesisderivedfromthisgrammar. Othergroupsoffeatures

(15)

amalgam using C4.5 and C4.5rulesand its performance is measured

us-ing Mean R A. This is a new cost function which is described in

Ap-pendixA. Thenull-hypothesis(seeSection1)isthentestedbycomparing

theMeanR Aachievedfromthevariousamalgams. (See Section4.2.3).

4. A hiddenMarkovmodel(HMM) isgenerated forNPPsequencesandits

MeanR Aismeasured(seeSection4.2.4).

4.2.1 Grammar Generation

A NPP grammar contains rules that describe legal neuropeptide precursors.

Figure3showsanincompleteexampleof suchagrammar,written asaProlog Put

Fig3

here

program. This section describeshowproduction rules for signalpeptides and

neuropeptide starts, middle-sections and ends were generated using CProgol.

Thesewereusedtocompletethecontext-freedenite-clause-grammarstructure

shownin Figure3. Thestartand endrepresentcleavagesites andthe

middle-sectionrepresentsthematureneuropeptidei.e.whatremainsaftercleavagehas

takenplace.

TheproductionrulestobelearntbyCProgolcontainsdyadicpredicatesof

the form p(X,Y), which denote that property p began the sequence X and is

followedbyasequenceY.Tolearnsuchrulesfromthethetraining-set,CProgol

wasprovidedwiththefollowingextensionaldenitions:

Precursor data. Using details of the start and nishing positions for signal

peptidesand neuropeptidesit waspossibletogenerate examplesof

non-terminalsasbelow:

signalpep(S,[]) whereSis alistofprecursorresiduesconstituting the

signalpeptide.

start(S,[]) where S is a list of residues constituting the start of a

(16)

neuropeptide. ThestartingresidueforSistherstresidueafter the

end of thesequencefor start/2 above. Theend of Swastaken to

be3residuesfrom thelast position oftheneuropeptide.

end(S,[]) whereSisalistof3precursorresiduesconstitutingtheendof

aneuropeptide. Scommences with theresidue after theend of the

sequenceformiddle/2above.

Randomdata. Onerandomexampleforeachofsignalpep/2,start/2,middle/2

and end/2 was generated from each sequence in the set of randoms

se-quences. Randomexamplesaredistinguishedbytheprex*.

*signalpep(S,[]) where S is a list of residuesstarting at the rst

po-sition in the sequence. The length of S is obtained from drawing

randomlyfromthedistributionofsignalpeptidelengthsofNPPsin

thetrainingdata.

*start(S,[]) whereSisapairofsequenceresidues. Thestartingresidue

israndomlychosen,andisensurednotto conictwiththesequence

chosenfor*signalpep/2above.

*middle(S,[]) where Sisalist ofresiduesstartingafter theendof the

sequencefor*start/2above. ThelengthofSisobtainedbydrawing

randomlyfromthelengthofneuropeptidemiddle-sectionsof

precurs-orsinthetrainingdata.

*end(S,[]) where S is a list of 3 residues starting at the end of the

sequenceterminatingthedenitionof*middle/2above.

CProgol was provided with denitions of the non-terminals star/2 and

run/3(see Table14). star/2represents somesequence of unnamed residues

whoselengthisnotspecied. run/3representsarunofresidueswhich sharea

speciedproperty.

The grammar-based approach presents a powerful method for describing

(17)

important: KR;GKRandGRR.ThesubsequencesKRandGKRareestablished

pro-teolyticcleavage sites found in NPPs; GRR is a relatively commonalternative

cleavage site to GKR. Pilot experiments suggested that the following patterns

maybesignicant:

K,positive;

positive,positive;

Y,veryhydrophobic;

hydrophilic,agap ofsomeresidues,M,negative;

HP;

WMDF.

All these subsequencesand patterns were coded as Prolog predicates and

includedasbackgroundknowledge(seeTable14).

Other pilot experiments onthe training data showedthat theaccuracy of

CProgol'sgrammarwashigherwithcertainrestrictionsonthelengthofNPPs,

signalpeptidesand neuropeptides. Specic constraintswere obtainedby

pro-gressivelycheckingthefollowing: alllengthslessthanthemeanlengthon

train-ing data; lengths that are within 1;2;::: standard deviation of the mean on

trainingdata. This resulted in the following additionalrestrictions: (1) NPP

lengthsnottoexceed200residues;(2)signalpeptidelengthstobebetween19

and29residues;and(3)middle-sectionsofneuropeptideslengthstobebetween

4and52 residues. These constraintsonlyaect thevalues offeatures derived

from the grammar. They do not constrain the value of the sequence length

featuredescribedattheendofSection 4.2.2.

AppendixBshowsthemodeandtypedeclarations,settingsandprune

pre-dicatesthatwereneededtoenableCProgoltocompletethegrammarshownin

(18)

The grammar features PredictionsaboutaNPPsequencecanbemadeby

parsingit usingtheNPPgrammar. Thevaluesof thefeatures shown in

Table 4wereobtainedbysuchparses. Notethat wheneverthegrammar Put

Tab4

here

predictsthat asequenceisnotaNPP,allofthefeaturesareassigned the

valuezero.

The SIGNALP features Eachfeatureinthisgroupisasummaryofthe

res-ult of using theSIGNALP program on asequence. The SIGNALP

pro-gram(Nielsenet al.,1997)representsthepre-eminentautomatedmethod

for predicting the presence and location of N-terminal signal peptides.

SIGNALPisavailableonthewebat

http://www.cbs.dtu.dk/services/SignalP.

Thetechniqueusedcombines thepredictionsoftwodierentneural

net-works groups { one that recognises cleavage sites, and the other that

identiessignalpeptides. Whenprovidedwith asequence ofN-terminal

residues, the following are reported as summaries: (a) C scores: which

consist of the maximumvalue of the scorefrom the cleavage-recogniser,

theposition in thesequencewhere this valueisachieved,and anominal

`y'or`n'denotingtheanswertowhether acleavagesiteispresent;(b)S

scores: thecorrespondingvaluesfromthesignal-peptiderecogniser;(c)Y

scores: ascorethatcombinestheCandSscores;and(d)Mean scores: a

meanoftheS-scoreandS-conclusionsfromtheN-terminalendtothe

pre-dicted cleavagesite. Forthe experiments here, SIGNALP was provided

with 50 amino acids from the N-terminal for each sequence. The

sum-marieswere extracted andrepresented by11 features shownin Table 5.

Put

Tab5

here

The proportions features Eachfeature in this groupisaproportion of the

numberofresiduesin agivensequencewhich eitherareaspecic

(19)

ertiesshownin Table3.

The sequence length feature Thisfeatureisthelengthofthesequence. In

theremainderofthis paperthis featurewillbereferredtoaslength.

4.2.3 PropositionalLearning

ThetrainingandtestdatasetsforC4.5werepreparedasfollows.

1. Recall from Section 4.1.1that ourdata comprises44 positivesand 3910

randoms. 40ofthe44positivesoccurinthesetof3910randoms. AsC4.5

is designed to learnfrom a set of positivesand a set of negatives, these

40 positiveswere removed from theset of randoms. Of the40 positives

which are in theset of randoms,10 are in thetest-set. Hence theset of

(3910 40)sequenceswere splitintoatraining-setof(2910 30=2880)

andatest-setof(1000 10=990).

2. Valuesofthefeaturesweregeneratedforeachtrainingandtestsequence.

Eachsequencewasrepresentedbyadatavectorcomprisedofthesefeature

valuesand1classvalue(`1'todenote aNPPand`0'otherwise).

3. Finallyto ensure that there were asmany`1' sequencesas`0'sequences

atraining-setof2880NPPswasobtainedbysamplingwith replacement.

Thusthetrainingdata-setinputtoC4.5comprised(22880)examples.

(Nore-adjustingwasdoneonthetestdata.)

Amalgams of the feature groups described in the previous section were

formed. The amalgams are listed in Table 6. The following procedure was

followedforeach

one:-1. trainingand testsetswerepreparedasdescribedabove;

2. adecisiontreewasgeneratedfromthetraining-setusingC4.5;

(20)

rule-setonthetest-set;

5. MeanR Awasestimated asdescribedinAppendix A.3.

Thedefault settingsofC4.5andC4.5ruleswereused.

The contradiction of the null hypothesis was then attempted by testing

theMeanR Aofthebestmodelwhichincludesgrammar-derivedfeatures

was higherthan the best performance achievedusing anyof the models

which donotinclude thegrammar-derivedfeatures.

suchanincreasewasstatisticallysignicant. EstimatesofMeanR Awere

comparedusingthestatisticalmethod describedinAppendix A.4.

thebest model whichincludes grammar-derivedfeatures wassuÆciently

morecomprehensiblethanthebest `non-grammar'model.

4.2.4 Hidden Markov Model Comparison

AHMMforNPPswasgeneratedandtestedusingHMMER 1

version2.1.1(Eddy,

1998)whichisavailable fromhttp://hmmer.wustl.edu/.

CLUSTALW(1.8)(Thompsonet al., 1994)wasusedto alignthepositive

sequencesinthetraining-set. ThehmmbuildprogramofHMMERwasthenused

togenerateaHMMfromtheCLUSTALWalignment. TheresultingHMMand

thehmmsearchprogram ofHMMERwerethen usedtosearchforNPPsin the

test-set. The default settings of CLUSTAL W and HMMER were used. The

MeanR AoftheHMMwasestimatedbasedonthepredictionsonthetest-set.

4.3 Results and Analysis

Table 13 shows the grammar that wasgenerated using CProgol. The

gram-mar is veryrich in terms of non-terminals. All but one of the non-terminals

1

(21)

mar. The grammar also includes star/2, run/3 and three of the six

non-terminalswhich represent the patterns mentioned in Section 4.2.1. The

non-terminalsyvhand pp, which representthepatterns Y,veryhydrophobicand

positive,positive,bothappeartwice. Thethirdnon-terminalwhichappears

ishmn,whichcorrespondstothepattern

hydrophilic,agap ofsomeresidues,M,negative.

Table6showsthatpredictiveaccuracyisnotagoodmeasureofperformance Put

Tab6

here

for this domain because it does notdiscriminate well betweenthe amalgams:

despite covering varying numbers of (the rare) positives, all the models are

awardedasimilar(high)scorebypredictiveaccuracybecausetheyallexclude

mostoftheabundantnegatives.

Table6showstheMeanR AforboththehiddenMarkovmodelandforeach

amalgamof feature groups. TheMean R Aof theHMMis zero. The highest

MeanR A(107.7)wasachievedbyoneofthegrammaramalgams,namely the

`Proportions+Length+SIGNALP+Grammar'amalgam. ThebestMeanR A

achieved by any of the amalgams which do not include the grammar-derived

featureswasthe49.0attainedbythe`Proportions+Length'amalgam.

The P

90

M=57

R Afor the`Proportions+ Length+ SIGNALP+ Grammar'

amalgamwas 3661.376. The P

90

M=57

R Afor the`Proportions+Length'

amal-gamwas1666.733. For theamalgams `Proportions +Length +SIGNALP +

Grammar'and `Proportions +Length', ^

D =1994:643 and ^ D = p n = 2:081.

This dierenceis statistically signicant: substituting these valuesof ^

D and ^ D = p

nintoEquation15showsthatp(d<0)iswellbelow0.0001.

If one searches for a NPP by randomly selecting sequences from

SWISS-PROTfor synthesisand subsequentbiologicaltesting then, at most,only one

inevery2408sequencestestedisexpectedtobeanovelNPP.Thisfollowsfrom

thefact that number of sequencesin SWISS-PROT =79449, the most

prob-able number of novel NPPs in SWISS-PROT = 90 57 = 33 (see Table 8)

and 33=79449 = 1=2408. Using our best recognitionmodel as a lter makes

(22)

pectedtobeanovelNPP.Thiscanbeseenfromthefollowingsimplecalculation.

RearrangingEquation2givesPr(NPP j R ec)=R APr(NPP).

Substitut-ing in the MeanR A for the best recognitionmodel gives Pr(NPP j R ec) =

107:688(90=79449)=1=8:2. Multiplying1/8.2bytheproportionofNPPsin

SWISS-PROTwhicharenovel(33/90)givesapproximately1/22.

Appendix C lists the complete rule-sets for the amalgams `Proportions +

Length+SIGNALP+Grammar'and`Proportions+Length'. Therulesthat

were generated from the `grammaramalgam' suggest that the NPPgrammar

isuseful forlearningfrom NPP sequencedata. Nine of the25 rules include a

grammar-derivedfeature. Theserulesrefertoavarietyofthegrammar-derived

whetherthegrammarpredictstheexistenceofanneuropeptidestart(e.g.

seeRule14in Figure5);

therstresiduein theneuropeptide(e.g.see Rules20 and21in Figures

5and 6respectively);

the position of the rst residue in the neuropeptide (e.g. see Rule 6 in

Figure5);

the property of thethird from last residue in the neuropeptide (e.g. see

Rule17inFigure 5).

thelengthof thesignalpeptide;

Ourmethoddidnottrytoremovethepotentialredundancybetweenvalues

ofsomeoftheSIGNALP featuresand grammarfeatures. Theresultslisted in

Table 6justify this. It is of interest to note that thegrammar-derived length

ofsignalpeptideisusedfrequentlybyC4.5(seeRules2,9,23and24),despite

the availabilityof similarfeaturesderivedfromSIGNALP(seeTable5).

Finally,itshouldbenotedthattwooftherulesreferonlytogrammar-derived

(23)

Aim The aim of the second experimentis to demonstrate that overtting of

theNPP sequencesdidnotinadvertentlyoccurduring the generationof

grammar in the rst experiment (see Section 4.2.1). The data-set used

in the rst experimentcontainedall of the44 known NPPsequences in

SWISS-PROT in Spring 1997. The second experiment utilises 13

addi-tional NPP sequences which had been added to SWISS-PROTby May

1999(seeTable8). Noneofthese13additionalNPPsequenceswereused

for training or testing in the rst experiment. Indeed the identiers of

these13sequences werenotknownto C.H.B(whoperformedthesecond

experiment)untilafter theresultsoftherstexperimentwere published

(Muggletonet al.,2000).

Data Twotest-sets areusedinthisexperiment.

1. Thetest-setusedin therstexperimenti.e.theonewhich contains

10ofthe44originalNPPsandthe990'other'sequences.

2. Anewtestsetcomprisingthe990'other'sequencesfromthetest-set

used in the rst experiment and 10 of the 13 additional NPP

se-quenceswhichhadbeenaddedtoSWISS-PROTbyMay1999. The

SWISS-PROTidentiersoftheadditional13sequencesarelistedin

Table7and thesequencesareavailable at

ftp://ftp.cs.york.ac.uk/pub/aig/Datasets/neuropeps/.Three

of the 13additional NPP sequences(P10092, P07492and Q00072)

werenotusedastheyarehomologuesofNPPsintheoriginaltraining

setof34NPPs.

Method TheMeanR Aof thegrammar onboththeoriginal test-setand the

newtest-setwasmeasured. NotethatitwastheMeanR Aofthegrammar

generatedbyCProgolthatwasmeasuredasopposedtooneofthedecision

treesgeneratedbyC4.5andC4.5rules. MeanR Awasestimatedusingthe

(24)

andthenewtest-set.

Conclusion OverttingoftheoriginalsetofNPPsequencesdidnotoccur

dur-ing thegeneration ofthe grammar: theperformance ofthe grammaron

the newand original data-setsis the same. Thisresult provides further

evidencetosupport ourconclusionthat wehavedevelopedaNPP

recog-nitionmethodwhichwouldprovidecost-savingsduringasearchfornovel

(25)

Thispaperhasshownthatthemostcost-eective,comprehensiblemulti-strategy

predictorofhumanneuropeptideprecursorsdoesemployacontext-free

denite-clause-grammar.

The ILP Bayesian approach to learning from positive examples was used

to generate a grammar for recognising a class of proteins known as human

neuropeptideprecursors(NPPs). Collectively,veoftheco-authorsofthis

pa-per, have extensive expertise on NPPs and general bioinformatics methods.

Their motivation for generating a NPP grammar was that none of the

ex-isting bioinformatics methods could provide suÆcient cost-savings during the

searchfor newNPPs. Priorto this projectexperienced specialistsat

SmithK-line Beecham had tried for many months to hand-code such a grammar but

without success. Our best predictor makes the search for novel NPPsmore

than 100 times more eÆcient than randomly selecting proteins for

syn-thesisand testing themfor biologicalactivity(see Figure 4) . Asfar asthese Put

Fig4

here

authorsareaware,thisis boththerstattempt tolearnabiologicalgrammar

using ILP and the rst real-world scientic application of the ILP Bayesian

approachtolearningfrompositiveexamples.

We rst published that our best predictor delivers more than a

hundred-fold cost-saving in the proceedings of seventeenth international conferenceon

MachineLearning(Muggletonet al.,2000). Sincethenwehaveobtainedfurther

evidenceto supportourconclusion that wehavedevelopedaNPPrecognition

methodwhichwouldprovidecost-savingsduring asearchfor novelNPPs. We

haveshown,using NPPsequenceswhich hadnotbeenusedpreviouslyonthis

project, that overtting of the original set of NPP sequences did not occur

duringthegenerationofthegrammar.

A shortcomingoftheNPPgrammar generatedis thatit willnotrecognise

allNPPsbecauseitimpliesthat 1) bothstartand endcleavagesitesare

com-pulsory;2) amatureneuropeptidecannotbeadjacenttothesignalpeptideor

(26)

ofthepredicatesnpp/2andneuropeptide/2showninFigure3. Experiments

withamoreexiblegrammarshouldthereforeformthesubjectoffuturework.

Inouropinion,thebest`non-grammar'recognitionmodel doesnotprovide

any biological insight. However the best recognition model which includes

grammar-derivedfeaturesisbroadlycomprehensibleandcontainssomeintriguing

associationsthatmaywarrantfurtheranalysis. Thismodelisbeingevaluatedas

anextensiontoexistingmethodsusedinSmithKlineBeechamfortheselection

ofpotentialneuropeptidesforuseinexperimentstohelpelucidatethebiological

functions ofG-protein coupledreceptors. It isclear however,that therulesof

themodelarenotanoptimalrepresentationofsequencedataandresidue

prop-erties. A moreintuitive(e.g. graphicallyoriented,sequencecentred)displayof

themeaningofthese ruleswouldberequiredto buildtoolsthat theexperts in

theeld wouldndacceptable.

The new cost function presentedin this paper, RelativeAdvantage (R A),

may be used to measure performance of a recognitionmodel for any domain

where

1. theproportionofpositivesin thesetofexamplesisverysmall.

2. thereis noguaranteethat allpositivescanbeidentiedassuch. Insuch

domains, the proportion of positive examples in the population is not

known and a large, unbiased set of negatives cannot be identied with

completecondence.

3. thereisnobenchmarkrecognitionmethod.

InAppendixAwehavedevelopedageneralmethodforassessingthe

signi-canceofthedierencebetweenR Avaluesobtainedincomparativetrials. R A

isestimatedbysummingtheestimateofperformanceoneachtest-setinstance.

Themethod uses a) identically distributed random variables representing the

outcomeforeachinstance;b)asamplemeanwhichapproachesthepopulation

(27)

Theco-authorsfromSmithKlineBeecham identiedtheproblemaddressedby

thecase-study,preparedthedata-set,providedexpertiseonNPPsand general

Bioinformatics methods and also helped to write the domain-specic aspects

ofthepaper. S.H.M.developedCProgoland themethodof generatingaNPP

grammar.A.S.developedthemethodofusingC4.5tolearnpropositionalrules

fromfeaturesderivedfromSIGNALP,proportionsandsequencelength. C.H.B.

devisedthesetofgrammarfeatures,performedtheexperimentsdescribedinthis

paperand preparedallthesections ofthis paperexcept thoseon thedomain.

C.H.B.and S.H.M. developed themethod for assessingthe signicance of the

dierencebetweentheRAoftwomodelsandanassociatedmethodestimating

(28)

WewouldliketothankI.S.Gloger,M.LawrenceandR.B.Russellforsharingtheir

knowledgeofNPPsandsequencehomology. WewouldliketothankJ.Cussens

forhis help with Section A, M.Turcottefor his advice on HMM software and

D.Michieforhiscommentsonthiswork.

Thisresearchwasconductedaspartofaprojectedentitled`UsingMachine

Learningto DiscoverDiagnostic Sequence Motifs' supported by a grant from

SmithKlineBeecham. DuringpartofhistimespentwritingthisarticleC.H.B.

was supported bythe EPSRC grant(GR/M56067)entitled`ClosedLoop

Ma-chineLearning'. A.S.holdsaNuÆeldTrustResearchFellowshipatGreen

(29)

NPPsare identiedeither throughpurely biologicalmeansorbyscreening

ge-nomic or protein sequence databases for likely NPPs, followed by biological

evaluation. Ifwewishtogobeyondusingsequencehomologytondnew

mem-bers of the (generally small) NPP families, we need a recognition model for

NPPsin general. Howeverifthisrecognitionmodelispoorthenitmaynotbe

muchbetterthanrandomsamplingofsequencedatabases(e.g. SWISS-PROT)

and thecost-benetof anyexperimental evaluation of NPPs foundby such a

procedurewould beprohibitivelysmall.

In developinga general recognition model for human NPPs, we are faced

withthreesignicantobstacles.

1. The number of known NPPs in the public domain databasesof protein

sequence (e.g. SWISS-PROT) is very small in proportion to the total

numberofsequences. Whenwedevelopedourmethod ofestimating RA

(May1999),SWISS-PROTcontained79,449sequences,ofwhichsome57

coulddenitelybeidentied ashumanNPPs.

2. There is no guarantee that all the human NPPs in SWISS-PROT have

been properly identied. We estimate there may, in fact be up to 90

NPPsin SWISS-PROT.

3. There isnobenchmarkmethod forNPPrecognitionthat canbe usedto

compare any newmethods. Wemust therefore compareour recognition

modelwithrandomsampling toevaluatesuccess.

This domain requires aperformance measure which addresses all of these

issues.

Table8summarises howsome of theproperties of SWISS-PROT changed Put

Tab8

here

overthedurationoftheexperimentsdescribedinthispaper. AlltheR A

meas-urementsin this paper are based on the properties asthey stood at May 99.

(30)

PROT.

A.1 Limitations of Existing Performance Measures

Fordomains in whichpositivesare rare, predictiveaccuracy, asit isnormally

measuredin MachineLearning(assumingequalmisclassicationcosts):

gives a poor estimate of the performance of a recognition model. For

instance,ifalearnerinducesaveryspecic modelforsuchadomain, the

predictiveaccuracyofthemodelmaybeveryhighdespitethenumberof

truepositivesbeingverysmallorevenzero.

does not discriminate well between models which exclude most of the

(abundant) negatives but cover varying numbersof (the rare) positives.

(Thiswasillustratedearlierinthis paper-see Table6.)

Fordomainsinwhichthereisnobenchmarkrecognitionmethodthatcanbe

usedtocompareanynewmethods,Lift(Ling&Li,1998)isnottheappropriate

measure of performance because it does notquantify the reductionin cost in

usingthepredictorversusrandomsampling. Furthermore,intheirpaper,Ling

& Li gave no explanation of how to assess the signicance of the dierence

betweentheLiftoftwomodels. ROCcurves(orLorentzdiagrams)(Provost&

Fawcett,1998)alsodonotquantifythereductionincostinusingthepredictor

versusrandomsampling.

Therefore wedene a relative advantage (R A) function which predicts the

reductionin cost in using the model versus random sampling. In contrast to

otherperformance measures,R A ismeaningful and relevantto experts in the

domain.

A.2 Denition of RA

Inthefollowing,`themodel'referstothelearnedrecognitionmodelfor

(31)

incostinusingthemodelversusrandomsampling. R A= A B (1) where

A = the expected cost of nding one NPPby repeated independent random

samplingfromSWISS-PROTandperformingalaboratoryanalysisofeach

protein.

B = the expected cost of nding oneNPP by repeated independent random

samplingfromSWISS-PROTandanalysingonlythoseproteinswhichare

predictedbythelearnedmodeltobeaNPP.

RAcanbedenedin termsofprobabilityasfollows. Let

C =thecostoftestingthebiologicalactivityofoneproteinviawet-experiments

inthelaboratory;

NPP =SequenceisaNPP;

Rec =ModelrecognisessequenceasaNPP.

Equation1cannowberewrittenas:

R A= C=Pr(NPP) C=Pr(NPP jR ec) = Pr(NPP jR ec) Pr(NPP) (2)

Lettestingthemodelontestdata yieldthe22contingencytable shown

in Table 9with the cellsn

1 , n 2 , n 3 , and n 4 . Let n = n 1 +n 2 +n 3 +n 4 be Put Tab9 here

thenumberofinstancesin thetest-set. Notethattherandomsetofsequences

referredtointheright-handcolumnmayincludesomeNPPsequences. Table10

shows anestimate of the contingency table that would beobtained if it were Put

Tab10

here

possibletoidentify andremoveallthepositivesfromthesetofrandoms. Ifthe

proportionofNPPsinthetest-setwasknowntobethesameastheproportion

ofNPPsin thedatabasethenwecouldestimate Pr(NPP)to be(n

1 +n 3 )=n andPr(NPP jR ec)toben 1 =(n 1 +n 2

(32)

inthetest-setanddatabase.

InordertoderiveaformulaforestimatingRAgivenbothasetofpositives

andaset of randoms,weestimatePr(NPP)and Pr(NPP j R ec)asfollows.

LetS bethetotalnumberofsequencesinthedatabase,ofwhichM areNPPs.

Pr(NPP) =

no:of NPPsin thedatabase

no:of sequencesinthedatabase

= M=S (3)

Pr(NPP j R ec)= N

dbNPP recog

N

dbseqpred pos

(4)

where N

dbNPP recog

is the number of NPPs in db which are recognised by

model and N

dbseqpred pos

is the number of sequences in dbwhich the model

predictsto beNPP.

Table11 showsthe expected resultof using the learnedrecognitionmodel Put

Tab11

here

on the entire SWISS-PROT database. Note that the factor (1 Æ) does not

appearasitcancelsout. FromEquation4andTable11itfollowsthat:

Pr(NPP jR ec) ' n1 n1+n3 M n1 n1+n3 M+ n2 n2+n4 (S M) = (Mp 1 )=(Mp 1 +(S M)p 2 ) (5) wherep 1 =n 1 =(n 1 +n 3 )andp 2 =n 2 =(n 2 +n 4

). SubstitutingEquations3and

5intoEquation2gives

R A = (Mp 1 )=(Mp 1 +(S M)p 2 ) M=S = Sp 1 Mp 1 +(S M)p 2 = Sp 1 Sp 2 +M(p 1 p 2 ) (6)

A.3 Estimating Relative Advantage

In thefollowing Relative Advantageover theentire population is represented

byR Aincapital letterswhereasRelativeAdvantageoverasampleisdenoted

bylowercasei.e.ra. AsthevalueofM isnotknown,weestimate P

90

M=57 R A.

(33)

is equalto thenumber of known NPPsin SWISS-PROT. The upperlimit of

M is the mostprobable number of NPPsin SWISS-PROT i.e. atotal of the

known NPPsandthose proteinswhich haveyet to bescienticallyrecognised

asaNPP. 90 X M=57 R A ' Sp 1 Z 91 M=57 1 (p 1 p 2 )M+Sp 2 @M = Sp 1 (p 1 p 2 ) ln((p 1 p 2 )M+Sp 2 )+k 91 57 = Sp 1 (p 1 p 2 ) ln 91(p 1 p 2 )+Sp 2 57(p 1 p 2 )+Sp 2 (7) Weestimate P 90 M=57

R Aby summing anestimate of the P

90

M=57

R A for each

instance in the test-set asfollows, where n is the number of instances in the

test-set. This method hasthe advantagethat it allowsthe signicance of the

dierencebetweentheRAoftwomodelstobegauged(seeSectionA.4).

n X k =1 90 X M=57 ra k (8)

FromEquation8andthecontingencytableitfollowsthat:

90 X M=57 ra= 1 n 4 X i=1 n i 90 X M=57 ra i ! (9) Each P 90 M=57 ra i is estimated by substituting p 1 = a a+c and p 2 = b b+d into

Equation7. Thevaluesofa,b,c anddaredeterminedbythree steps.

1. Whatever the i value, a, b, c and d are initially given the valuesof the

correspondingcounts/frequenciesinthecontingencytableforthetest-set

(seeTable9).

2. Eachoneofa,b,candd, isdecrementedprovidingthat thevaluebefore

subtractionisgreaterthan1.

We donot decrementwhen the valuebefore subtraction iszero because

thiscanresultinp

1 orp

2

(34)

1 2

is one becausethis can cause p

1 or p

2

to havethe valuezero, which in

turnhasahighly disproportionate eectonthevalueof P 90 M=57 ra i .

3. Thevalueofeither a,b,c ordis incrementedto reecttheclassication

ofaninstanceinthecelln

i .

For instance, ifi =2 and all the counts in the contingency table are greater

thanonethena=n

1 1;b=n 2 ;c=n 3 1;d=n 4 1.

Note thatSteps 1and2assign thesameprior probabilityto each instance

because the eect of each step is not dependent upon which cell the current

instance belongs to. Therefore this method of estimating P

90

M=57

R A has the

propertiesofa)producingidenticallydistributedrandomvariablesrepresenting

theoutcomeforeachinstance; b)havingasamplemeanwhichapproachesthe

populationmeaninthelimitandc)havingarelativelysmallsamplevariance.

Thenal stepof ourmethodfor estimatingR A isto takethe meanof the

summedvalues. MeanR A= P 90 M=57 ra i 90 (57 1) = P 90 M=57 ra i 34 (10)

A.4 Assessing the signicance of the dierence between

the RA of two models

Nextwedevelopamethodforassessingthesignicanceofthedierencebetween

theRA oftwomodels. Thismethod tacklesaproblem whichissimilar tothat

posed bythethird questionin Dietterich's taxonomyofstatisticalquestions in

MachineLearning(Dietterich,1998). Thatis,howtochoosebetweenclassiers

forasingle applicationdomain in which theamountof available data is

suÆ-cient to allowsome of it to beset aside for evaluating classiers. However a

newmethodisneededbecauseofthefundamentaldierencesbetweenRelative

Advantageandpredictiveaccuracy.

We compare the performance of two recognition models, H

1 and H 2 , by comparing their P 90 M=57

R A values. Let d be dierence in P

90

M=57

(35)

overthe entirepopulation,i.e. forall theproteinsin SWISS-PROT,andd be

theobserveddierenceonthetest-set.

d= 90 X M=57 R A H 1 90 X M=57 R A H 2 (11) ^ d= 90 X M=57 ra H 1 90 X M=57 ra H 2 (12) ^

dis anunbiasedestimatorforthetruedierencebecauseitiscalculatedusing

anindependenttest-set. Todeterminewhether theobserveddierenceis

stat-istically signicantweaddressthefollowingquestion. Whatis theprobability

that P 90 M=57 R A H1 > P 90 M=57 R A H2

,giventheobserveddierence, ^

d.

IfD isarandomvariablerepresentingtheoutcomeofestimatingd by

ran-domsamplingthen,accordingtotheCentralLimitTheorem,^

D

isnormally

dis-tributedinthelimit. Ithasanestimatedmean ^

dandhasanestimatedvariance

of^ 2

D

=n. Thevarianceof arandomvariable, X, is 2 X =E((X) 2 ) (E(X)) 2 .

Therefore,sinceD isarandomvariable:

^ 2 D =^ D 2 ^ 2 D (13) We calculate ^ D

2 as follows. Lettesting themodel ontest data yield the

44contingencytableshowninTable12withthecellsn

i;j . Put Tab12 here ^ D 2 = 1 n 4 X i=1 4 X j=1 0 @ n i;j 90 X M=57 ra i 90 X M=57 ra j ! 2 1 A (14) Giventhatp( P 90 M=57 R A H 1 > P 90 M=57 R A H 2 )=p( P 90 M=57 R A H 1 P 90 M=57 R A H 2 >

0) we evaluate our null hypothesis by estimating p(d < 0) using the Central

LimitTheorem. Z 0 x= 1 Pr(d=x)dx= Z 0 x= 1 1 p 2 2 e 1 2 ( x ) 2 dx (15) where=^ D and=^ D = p n.

(36)

Table13 showsthe productionrules generatedby CProgol. Therulescomply

withPrologsyntax. signal(X;Y)istrueifthere isasignalpeptideat the

be-ginningofthesequenceX,anditisfollowedbyasequenceY. Theotherdyadic

predicatesaredenedsimilarly. Non-terminalsandterminalswhich appearon

the right hand side of the production rules listed in Table 13 are dened in

Table14. Table14 showsthe Prologcode representingthebackground

know-ledge input to Progol. The production rules, when taken together with the

partialgrammarshownin Figure3,form agrammarforNPPsequences.

B.1 Dening a Hypothesis Language for Progol

AHypothesisLanguageforProgolisdenedby:{

mode and type declarations which state the forms that atoms in

hypo-thesesmaytake(seeSectionsB.1.1andB.1.2);

prunedeclarationswhichfurtherrestricttheformofhypotheses(see

Sec-tionB.1.3);

the maximum numberof layers of variables introduced by atoms in the

bodyofinducedclausesfrom variablesintheheadoftheclauses;

themaximumnumberofliteralsinthebodyofinducedclauses.

ThehypothesislanguageusedintheexperimentisdenedbyTables15,16

and17.

B.1.1 Mode Declarations

The mode declarations state the mode of call for those predicates that can

appearinahypothesisinducedbyProgol. Therearetwotypesofmode

declar-ations,asshownbelow. Therstdescribestheformofliteralsthatmayappear

in the headof clauses induced by Progoland the second describes those that

(37)

:- modeb(Recallnumber, Bodyliteraltemplate)?

HeadtemplateandBodyliteraltemplatearetemplatesofpredicatesand

take the form predicate(ts1, ts2, ...), where ts is a term specication.

Each term specication comprisestwo parts: a mode and atype. Typesare

describedinSection B.1.2. Thethreepossiblemodesare:{

+ Thisindicatesthatthetermisaninput. Thatis,inallcallstothispredicate,

thetermwillbeboundtoavalue.

{ Thisindicatesthetermisanoutput.

# This indicatesthataconstantshouldappearinthisterm.

Recallnumberreferstothedeterminacy ofthepredicatetemplate, thatis

itspecies the maximumnumberof times acall to thepredicate cansucceed

foragivensetofinputvariables. Hencefordeterminatepredicatetemplatesit

isset to oneand for indeterminatepredicatetemplates to valuesgreaterthan

one. IftheUserspeciesaRecallnumbertobe*thenProgolassignsadefault

valueof100toit.

B.1.2 Type Declarations

Typesthat areincludedin modedeclarationsmaybeunarypredicatesdened

in the background knowledge. All but one of the mode declarations listed in

Table15refertothetyperlist/1;thisisapredicatewhosedenition islisted

onTable14. Progoltype-checksaconstantby executingaqueryin which the

predicatecorrespondsto thetypeandthetermisinstantiatedtotheconstant.

If the query succeeds then Progol accepts that the constant is of the correct

type.

B.1.3 PruneDeclarations

Prune declarations are used to prevent Progol considering specied forms of

(38)

whenHeadisinstantiatedto theliteralin theheadoftheproposedclauseand

Bodyisinstantiatedtotheproposedbodyoftheclause.

B.2 The Time and Space Complexity of Progol

The Progolalgorithm, as analysed in (Muggleton, 1995), has time and space

complexity which increases linearly in both the number of examples and the

numberofclausesinthelearnedtheory. However,thescalingconstantsinvolved

varydependingonthesizeofthehypothesisspacesearchedforeachclause. This

iscontrolledusinganumberofparameters,includingaclauselengthboundand

aproofdepthbound. Put

Tables

13,14,15,

16,17

(39)

Therule-setthat wasgeneratedfrom the`Proportions+Length+SIGNALP

+Grammar'amalgamisshowninFigures5and6. Eachboxcontainsaruleas

itwasoutputbyC4.5rulestogetherwiththeEnglishtranslationwhichisshown

in italics. The percentagein square brackets refersto the predictedaccuracy

ofthe correspondingrule. Eachcolumn of rulesis tried in turn. Within each

column,eachruleistriedinorderofappearance. Thelastruleisthe`default'

rulewhichisused ifnoneoftheotherrulesapply.

Therule-setthat was generatedfrom the`Proportions+Length'amalgam

isshowninFigures 7,8and9. Put

Figures

5,6,7,

8,9

(40)

Abe,N.,&Mamitsuka,H.(1997).Predictingproteinsecondarystructureusing

stochastictreegrammars. Machine Learning,29,275{301.

Bairoch,A., &Apweiler,R.(2000). TheSWISS-PROTproteinsequence

data-baseanditssupplementTrEMBLin 2000. Nucleic Acids Res,28,45{48.

Bairoch, A., Bucher, P., & Hofman, K. (1997). The PROSITE database, its

statusin1997. NucleicAcidsResearch,25, 217{221.

Bakalkin,G.,Rakhmaninova,A.,Akparov,V.,Volodin,A.,Ovchinnikov,V.,&

Sarkisyan,R.(1991).Aminoacidsequencepatternintheregulatorypeptides.

InternationalJournalof Peptide ProteinResearch,38, 505{510.

Baldi,P.,&Brunak,S.(1998).Bioinformatics: the machinelearningapproach.

Cambridge,MA:MITPress.

Brazma,A., Jonassen,I.,Eidhammer,I.,&Gilbert,D. (1998). Approachesto

the automatic discoveryof patterns in biosequences. Journal of

Computa-tionalBiology, 5,279{305.

Claros,M.,Brunak,S.,&vonHeijne,G.(1997).PredictionofN-terminalsorting

signals. Current OpinioninStructuralBiology, 7,394{398.

Devi, L. (1991). Consensus sequence for processing of peptide precursors at

monobasicsites. FEBS,280,189{194.

Dietterich,T.G.(1998).Approximatestatisticaltestsforcomparingsupervised

classicationlearningalgorithms. NeuralComputation, 10,1895{1924.

Doolittle, R. (Ed.). (1996). Computer methods for macromolecular sequence

analysis. NewYork: AcademicPress.

Durbin, R., Eddy, S.,Krogh, A., &Mitchison, G. (1998). Biological sequence

analysis: probabilisticmodelsofproteinsandnucleicacids.Cambridge:

(41)

USA.Version2.1.1edition.

Heniko,J.,&Heniko,S.(1996). Blocksdatabaseanditsapplications.

Meth-ods inEnzymology, 266,88{105.

Hillier, L., Lennon, G., Becker, M., Bonaldo, F., Chiapelli, B., Chissoe, S.,

Dietrich, N., DuBuque, T., Favello, A., Gish, W., Hawkins, M., Hultman,

M., Kucaba, T., Lacy, M., Le, M., Le, N., Mardis, E., Moore, B., Morris,

M.,Parsons,J.,Prange,C.,Rifkin,L.,Rohlng,T.,Schellenberg,K.,Soares,

M.,Tan,F.,Thierry-Mieg,J.,Trevaskis,E.,Underwood,K.,Wohldman,P.,

Waterston,R.,Wilson, R., &Marra,M.(1996). Generationand analysisof

280,000humanexpressedsequencetags. Genome Research,6,807{828.

Klavdieva,M. (1995). The historyof neuropeptides. Frontiers in Neuro

endo-crinology,16, 293{321.

Ling, C., & Li, C. (1998). Data mining for direct marketing: Problems and

solutions. Proceedings of the FourthInternational Conference on Knowledge

DiscoveryandDataMining(pp. 73{79). NewYorkCity.

Linz,P.(1996).An introductiontoformallanguagesandautomata. Lexington,

Massachusetts: D.C.HeathandCompany.

Lyall, A. (1996). Bioinformatics in the pharmaceutical industry. Trends In

Biotechnology, 14,308{312.

Michalski,R.,&Wnek,J.(1997). Guesteditors'introduction. Machine

Learn-ing,27,205{208.

Muggleton,S. (1995). Inverse entailment and Progol. New Generation

Com-puting,13, 245{286.

Muggleton,S.(1996).Learningfrompositivedata. Proceedings oftheSixth

In-ternationalWorkshoponInductiveLogicProgramming(pp.358{376).

(42)

challenge. Articial Intelligence,114,283{296.

Muggleton, S., Bryant, C., & Srinivasan, A. (2000). Learning chomsky-like

grammars for biological sequence families. Proceedings of the Seventeenth

InternationalConferenceonMachine Learning(pp.631{638). Stanford

Uni-versity,USA:SanFrancisco,CA:MorganKaufmann.

Muggleton,S.,King,R.,&Sternberg,M.(1992). Proteinsecondarystructure

predictionusing logic-basedmachine learning. Protein Engineering, 5,647{

657.

Muggleton,S., & Raedt, L. D. (1994). Inductivelogic programming: Theory

andmethods. JournalofLogicProgramming, 19,20,629{679.

Nielsen,H., Brunak,S.,&vonHeijne, G.(1999). Machine learningapproaches

forthepredictionofsignalpeptidesandotherproteinsortingsignals.Protein

Engineering,12, 3{9.

Nielsen,H.,Engelbrecht,J.,Brunak,S.,&vonHeijne,G.(1997). Identication

ofprokaryoticandeukaryoticsignalpeptidesandpredictionoftheircleavage

sites. Protein Engineering,10,1{6.

Pennisi,E.(1999). Humangenome- Academicsequencerschallenge Celerain

sprintto thenish. Science, 283,1822{1823.

Provost,F., &Fawcett,T.(1998). Analysisand visualizationof classier

per-formance: Comparison under imprecise class and cost distributions.

Pro-ceedings of the ThirdInternational Conference onArticialIntelligence(pp.

706{713).Menlo Park,CA:AAAIPress.

Quinlan, J. (1993). C4.5: Programs for machine learning. San Mateo, CA:

MorganKaufmann.

Rawlings, C., &Searls, D. (1997). Computational gene discoveryand human

(43)

Boileau,G., &Cohen, P.(1995). Role of amino-acid-sequencesanking

di-basiccleavage sites in precursor proteolytic processing {the importance of

therst residueC-terminalof the cleavagesite. Euroupean Journal of

Bio-chemistry,227,707{714.

Rholam,M.,Nicolas,P.,&Cohen,P.(1986). Precursorsforpeptide-hormones

sharecommon secondarystructuresforming features at theproteolytic

pro-cessingsites. FEBSLetters,,207,1{6.

Sakakibara,Y. (1997). Recent advancesof grammaticalinference. Theoretical

ComputerScience,185,15{45.

Searls,D.(1993). Thecomputationallinguisticsofbiologicalsequences.

Arti-cialIntelligencein Molecular Biology. California,USA:AAAIPress.

Searls,D. (1997). Linguisticapproachestobiologicalsequences[review].

Com-puterApplications in theBiosciences,13, 333{344.

Sonnhammer, E., Eddy, S., Birney, E., Bateman, A., & Durbin, R. (1998).

PFAM:multiplesequencealignmentsandHMM{prolesofproteindomains.

NucleicAcidsResearch,26, 320{322.

Sonnhammer, E., Eddy, S., & Durbin, R. (1997). PFAM: a comprehensive

databaseof proteindomain families basedonseedalignments. Proteins,28,

405{420.

Spence, P. (1998). Obtainingvalue from thehumangenome {achallengefor

thepharmaceuticalindustry[review]. DrugDiscovery Today, 3,179{188.

Thompson, J., Higgins, D., & Gibson, T. (1994). CLUSTAL W: improving

thesensitivity of progressivemultiple sequence alignment throughsequence

weighting, position-specic gap penaltiesand weightmatrixchoice. Nucleic

(44)

P01019P01042P01156P01178P01185P01189P01210P01213P01258P01270P01275

P01279P01282P01286P01298P01303P05060P05305P06881P07491P08858P08949

P10082P10645P12272P14138P16860P18509P20366P20382P20800P21591P22466

(45)

(46)

PhysicochemicalProperty Aminoacidswithproperty

Hydrophobic H,W,Y,F,M,L,I,V,C,A,G,T,K

Veryhydrophobic A,F,G,I,L,M,V

Hydrophilic S,E,Q,R,D,N Electropositive R,K,H Electronegative D,E Neutral A,C,F,G,I,L,M,N,P,Q,S,T,V,W,Y Large Q,E,R,K,H,W,Y,F,M,L,I Small P,V,C,A,G,T,S,N,D Tiny A,G,S Polar Y,T,S,N,D,E,Q,R,K,H,W Aliphatic L,I,V Aromatic H,W,Y,F

Hydrogendonor W,Y,H,T,K,C,S,N,Q,R

(47)

Feature Description

grampred Abooleanwhichindicateswhetherthegrammarpredictsa

sequencetobeaNPPornot.

gramsigl Lengthofthesignalpeptide.

gramnpl Lengthoftheneuropeptide.

gramfirst Positionoftherstresiduein theneuropeptide.

gramlast Positionofthelastresidueintheneuropeptide.

gramnpstartfirst Therstresidueintheneuropeptideoroneofitsproperties.

gramnpstartsecondThesecondresidueinneuropeptideoroneofitsproperties.

gramnpendfirst Therstresidue/property/starinthebodyoftheendrule.

gramnpendsecond The second residue/property/starin the body of the end

rule.

(48)

Feature Description

sigpcmax MaximumSIGNALPCscore

sigpcmaxpos PositionwheremaximumSIGNALPCscoreisachieved

sigpcconcl SIGNALPCscoreconclusion(`y'or`n')

sigpymax MaximumYscorereportedbySIGNALP

sigpymaxpos PositionwheremaximumSIGNALPYscoreisachieved

sigpyconcl SIGNALPY scoreconclusion(`y'or`n')

sigpsmax MaximumSIGNALPS score

sigpsmaxpos PositionwheremaximumSIGNALPS scoreisachieved

sigpsconcl SIGNALPS scoreconclusion(`y'or`n')

sigpsmean SIGNALPmeanofSscorestocleavagesite

(49)

thedecisiontreesgeneratedfromtheamalgamsofthefeaturegroups.MeanR A

wasestimated usingthemethod describedinSection A.3.

Predictor MeanR A PredictiveAccuracy(%)

HiddenMarkovModel 0 99.0

+ 0.3 Onlyprops 0 96.7 + 0.6 OnlyLength 1.6 91.8 + 0.9 OnlySignalP 11.7 98.1 + 0.4 OnlyGrammar 10.8 97.0 + 0.5 Props+Length 49.0 98.6 + 0.4 Props+SignalP 15.0 98.3 + 0.4 Props+Grammar 31.7 98.2 + 0.4 SignalP+Grammar 0 98.6 + 0.4 Length+Grammar 0 96.2 + 0.6 Length+SignalP 34.4 98.7 + 0.4

Length+SignalP+Grammar 0 98.0

+

0.4

Props+Length+SignalP 29.2 98.7

+

0.4

Props+Length+Grammar 33.2 98.5

+

0.4

Props+SignalP+Grammar 15.0 98.3

+

0.4

Props+Length+SignalP+Grammar 107.7 99.0

+

(50)

addedtoSWISS-PROTbyMay1999.

P10092 P23435 O00230 P09681 O43555 P07492 P48645 P01138 P20783 P02818

(51)

1999.

Spring1997 May1999

Numberofsequences 64,000 79,449

NumberofknownhumanNPPs 44 57

(52)

are labelled by the sets NPP sequences, Random sequences, H (Hypothesis

predictions) and H (complement of H). Thecells of the matrix represent the

cardinalitiesofthecorrespondingintersectionsofthesesets. n

1 +n 2 +n 3 +n 4 =

n,wherenisthenumberofinstancesin thetest-set.

SetoftestNPPsequences SetoftestRandomsequences

H n 1 n 2 H n 3 n 4

(53)

Theaxesof the22matrixarelabelledbythesets NPPsequences,Negative

sequences, H (Hypothesis predictions) and H (complement of H). The cells

of the matrix represent the cardinalities of the corresponding intersections of

these sets. Æ = M=S where S is the total number of sequences in the entire

SWISS-PROTdatabase,ofwhichM areNPPs.

SetoftestNPPsequences SetoftestNegativesequences

H n 1 n 2 (1 Æ) H n 3 n 4 (1 Æ)

(54)

are labelled by the sets NPP sequences, Random sequences, H (Hypothesis

predictions) andH (complementofH). Thetotal ofthecounts/frequenciesin

thefourcells=S,whereSisthetotalnumberofsequencesintheSWISS-PROT

database.

NPPsequencesinSWISS-PROT Negativesequencesin SWISS-PROT

H n1 n1+n3 M n2 n2+n4 (S M) H n3 n1+n3 M n4 n2+n4 (S M)

(55)

bythe cells of the22 contingency table for H

1

. The columns of the 44

matrixarelabelledbythecellsofthe22contingencytableforH

2

. Thecells

ofthe44matrixrepresentthecardinalitiesofthecorrespondingintersections

of these sets. P 4 i=1 P 4 j=1 n i;j

=n, where n is thenumber ofinstances in the

test-set. n 1 n 2 n 3 n 4 n 1 n 1;1 n 1;2 n 1;3 n 1;4 n 2 n 2;1 n 2;2 n 2;3 n 2;4 n 3 n 3;1 n 3;2 n 3;3 n 3;4 n 4 n 4;1 n 4;2 n 4;3 n 4;4

(56)

sigpep(A,B):- g(A,C),star(C,D), s(D,B).

sigpep(A,B):- m(A,C),star(C,D), hydrophilic(D,E),tiny(E,B).

sigpep(A,B):- hydrophobic(A,C),star(C,D),w(D,E), hydrobacc(E,B).

sigpep(A,B):- large(A,C), run(C,D,hydrophobic),star(D,E),t(E,B).

sigpep(A,B):- m(A,C), star(C,D),t(D,E),neutral(E,F),small(F,B).

sigpep(A,B):- m(A,C), star(C,D),veryhydrophobic(D,E),positive(E,F),tiny(F,B).

sigpep(A,B):- hydrophobic(A,C),run(C,D,hydrophobic),star(D,E),f(E,F), hydrophobic(F,B).

sigpep(A,B):- hydrophobic(A,C),star(C,D),h(D,E), hydrophobic(E,F),tiny(F,B).

sigpep(A,B):- hydrophobic(A,C),star(C,D),v(D,E), hydrophobic(E,F),neutral(F,B).

sigpep(A,B):- large(A,C), star(C,D),a(D,E),hydrophobic(E,F),small(F,B).

sigpep(A,B):- large(A,C), star(C,D),s(D,E),neutral(E,F),small(F,B).

start(A,B):-a(A,C), veryhydrophobic(C,B).

start(A,B):-d(A,C), t(C,B). start(A,B):-g(A,C), v(C,B). start(A,B):-h(A,C), r(C,B). start(A,B):-k(A,C), r(C,B). start(A,B):-l(A,C), r(C,B). start(A,B):-q(A,C), g(C,B). start(A,B):- s(A,C),l(C,B). start(A,B):- w(A,C),q(C,B).

start(A,B):- hydrophilic(A,C),a(C,B).

start(A,B):- hydrophilic(A,C),hydrophilic(C,B).

start(A,B):- positive(A,C),k(C,B).

start(A,B):- small(A,C),r(C,B).

middle(A,B):- yvh(A,C), star(C,D),large(D,E),large(E,B).

middle(A,B):- positive(A,C),star(C,D), neutral(D,E),large(E,F), large(F,B).

middle(A,B):- hydrobacc(A,C), star(C,D),hydrophobic(D,E),neutral(E,F),aromatic(F,B).

middle(A,B):- hydrobacc(A,C), yvh(C,D), star(D,B).

middle(A,B):- small(A,C), star(C,D),p(D,E),large(E,F), large(F,B).

middle(A,B):- y(A,C), star(C,D),g(D,E),hydrophobic(E,B).

middle(A,B):- hydrobacc(A,C), star(C,D),k(D,E), neutral(E,F),small(F,B).

middle(A,B):- small(A,C), star(C,D),l(D,E),m(E,B).

middle(A,B):- small(A,C), star(C,D),f(D,E),hydrophobic(E,F),aliphatic(F,B).

middle(A,B):- tiny(A,C),star(C,D),m(D,B).

middle(A,B):- q(A,C), star(C,D),positive(D,E),neutral(E,F),neutral(F,B).

middle(A,B):- hydrophobic(A,C),star(C,D),m(D,E), hydrophilic(E,F),neutral(F,B).

middle(A,B):- e(A,C), star(C,D),i(D,B).

middle(A,B):- q(A,C), star(C,D),l(D,B).

middle(A,B):- aromatic(A,C),star(C,D), v(D,E),neutral(E,F),hydrobdon(F,B).

middle(A,B):- aromatic(A,C),star(C,D), a(D,E),e(E,B).

middle(A,B):- c(A,C), star(C,D),c(D,B).

middle(A,B):- y(A,C), star(C,D),hydrobdon(D,E), hydrobdon(E,B).