learning from biological sequence data? { a case study S.H.Muggleton y C.H. Bryant z
DepartmentofComputerScience,UniversityofYork,YorkYO105DD,UK.
A.Srinivasan
OxfordUniversityComputingLaboratory,WolfsonBuilding,ParksRoad,OxfordOX13QD,UK.
A.Whittaker x
S.Topp C.Rawlings
{
SmithKlineBeecham,NewFrontiers,SciencePark,ThirdAvenue,Harlow,EssexCM195AW,UK.
Keywords: Bioinformatics, Machine Learning, Inductive Logic
Program-ming,CostFunction,GrammaticalInference
Abrief account of part of thiswork was published inthe proceedings of seventeenth
internationalconferenceonMachineLearning(Stanford,29June-2July2000).
y
Currentaddress:DepartmentofComputing,ImperialCollegeofScience,Technologyand
Medicine,180Queen'sGate,LondonSW72BZ.
z
Addresscorrespondenceto:SchoolofComputerandMathematicalSciences,TheRobert
GordonUniversity,Aberdeen,AB251HG,Scotland,UK.
x
Currentaddress:PsygnosisLtd.,37,KentishTownRoad,LondonNW18NX,UK.
{
This paper investigates whether Chomsky-like grammar
representa-tions are useful for learning cost-eective, comprehensible predictors of
membersof biological sequencefamilies. TheInductiveLogic
Program-ming(ILP)Bayesianapproachtolearningfrompositiveexamplesisused
to generatea grammarfor recognising aclass ofproteins knownas
hu-manneuropeptideprecursors(NPPs). Collectively,veoftheco-authors
ofthispaper,haveextensiveexpertiseonNPPsandgeneral
bioinformat-ics methods. Theirmotivationfor generatingaNPPgrammarwas that
noneoftheexistingbioinformaticsmethodscouldprovidesuÆcient
cost-savingsduringthesearchfornewNPPs. Priortothisprojectexperienced
specialists at SmithKlineBeecham had triedfor many monthsto
hand-code such a grammar but without success. Our best predictor makes
thesearchfor novelNPPsmore than100 times moreeÆcientthan
randomlyselecting proteins forsynthesisand testingthemfor biological
activity.Asfarastheseauthorsareaware,thisisboththerstbiological
grammarlearntusingILPandtherstreal-worldscienticapplicationof
theILPBayesianapproachtolearningfrompositiveexamples.
A groupof features is derivedfrom this grammar. Other groups of
features ofNPPsarederivedusingotherlearning strategies. Amalgams
of these groups are formed. A recognition model is generated for each
amalgamusingC4.5 andC4.5rules anditsperformance is measured
us-ing both predictive accuracyand a newcost function, Relative
Advant-age (R A). The highest R A was achieved by a model which includes
grammar-derivedfeatures. ThisR Aissignicantlyhigherthanthebest
R Aachievedwithouttheuseofthegrammar-derivedfeatures. Predictive
accuracyis notagoodmeasure ofperformance for thisdomainbecause
it doesnot discriminate well between NPP recognitionmodels: despite
coveringvaryingnumbersof(therare)positives,allthemodelsare
awar-dedasimilar(high)scorebypredictiveaccuracybecausetheyallexclude
Thispaperattemptstoanswer,bywayofacase-study,thequestionofwhether
grammatical representations are useful for learning from biological sequence
data. Weaddressthequestionwithexperimentalresultsthatsignicantly
con-tradictthefollowingnullhypothesis.
Nullhypothesis: Themostcost-eective,comprehensiblemulti-strategy
pre-dictors of human neuropeptide precursors do not employ acontext-free
denite-clause-grammar.
Multi-strategylearning(Michalski&Wnek,1997)aimsat integrating
mul-tiplestrategiesin asinglelearningsystem, wherestrategiesmaybe inferential
(e.g. induction, deduction etc) or computational. Computational strategy is
dened by therepresentationalsystemandthecomputational methodused in
thelearningsystem(e.g.decisiontreelearning,neuralnetworklearningetc).
A grammar for alanguagetells us whether a sentence is properly formed.
NoamChomsky, afounderofformal languagetheory, providedaninitial
clas-sicationoflanguagetypes. Those readersrequiringanintroductiontoformal
grammarsorthishierarchyarereferredto (Linz,1996).
We obtainresultswhich signicantly contradict thenullhypothesis as
fol-lows. A grammaris generatedfor aparticular classofbiologicalsequences. A
groupof features is derivedfrom this grammar. Other groupsof features are
derivedusing otherlearningstrategies. Amalgamsofthese groupsareformed.
Arecognitionmodelis generatedfor each amalgam usingC4.5 andC4.5rules.
Theresultssignicantlycontradictthenullhypothesis
because:-1. the best performance achieved using any of the models which include
grammar-derived features is higher than the best performance achieved
using anyof the models which do notinclude thegrammar-derived
fea-tures;
morecomprehensiblethanthebest `non-grammar'model.
Performance is measured using a new cost function, Relative Advantage
(RA). Appendix A denes RA and explains why it is used in preference to
otherperformancemeasures. A method ofestimating theRA ofarecognition
modelis presentedwhich subsequentlyallowsthestatistical signicanceof the
dierencebetweentheRAoftwomodelstobegauged.
Thedomainofthecasestudyistherecognitionofaclassofproteinsknown
ashuman neuropeptide precursors (NPPs). These proteins haveconsiderable
therapeuticpotentialand areof widespreadinterestin thepharmaceutical
in-dustry (see Section 3). Our best multi-strategy predictor of NPPsemploys a
context-freedenite-clause-grammar.
AnInductiveLogicProgramming(ILP)(Muggleton&Raedt,1994)system
is used to generatea grammar for NPPs. As far as these authors are aware,
this is the rst attempt to generate a grammar for abiological domain using
ILP. ILPistheareaofArticialIntelligencewhich dealswith theinductionof
hypothesised predicate denitions from examples and background knowledge.
Logic programs are used as a single representation for examples, background
knowledgeandhypotheses. ForarecentoverviewofILPissuesandresultssee
(Muggleton,1999).
MostILPsystemsrequirebothasetofpositiveexamplesoftheconceptto
belearntandaset ofnegativeexamples. Howeveritisnotpossibletoidentify
alarge,unbiasedsetofnegativeexamplesofNPPswithcertaintybecausethere
willbeproteinswhichhaveyettoberecognisedscienticallyasaNPP.Therefore
advantage was taken of the ILP Bayesianapproach to learningfrom positive
examples(Muggleton,1996). Thisapproachdoesnotrequireaset ofnegative
examples. Itisabletolearnaconceptfromasetofpositiveexamplesandaset
ofexamplessampledat random. As farastheseauthors areaware, thisis the
informationinmolecularbiologyandprevioustechniquesforlearningfromit,
in-cludinggrammaticalinference. Thissectionreviewsgrammaticalinferencefrom
theviewpointof learningcost-eective, comprehensiblepredictorsof members
ofbiologicalsequencefamilies. Thosereadersinterestedin previoustheoretical
resultsand applicationsto naturallanguagearereferredto (Sakakibara, 1997)
andthereferencestherein. Section 3introducesneuropeptideprecursor
recog-nition,thedomain of thecase-study. Sections 4and5detailtheexperimental
materials, methods and results. Section 6is the Discussion. Appendix A
de-scribesthenew costfunction RelativeAdvantage(R A). Appendix Bincludes
the production rules generated by CProgol. Appendix C includes our best
Research in the biological and medical sciences is being transformed by the
volume ofdata comingfrom projectswhich will reveal theentire geneticcode
(genomesequence)ofHomosapiensaswellasotherorganisms. Oncecomplete,
these projects should help us understand the genetic basis of human disease.
Thegrowthinthevolumeofdataandimprovementsinsoftwareforinterpreting
thisinformationhasincreasedinterestintheuseofcomputationalmethodsfor
identifyinggenesinvolvedin humandisease(Rawlings&Searls,1997).
Know-ingthe genesimplicated in adiseaseidenties theproteinsthat theycodefor
and possibly suggests the biochemical processes that may be inuencing the
developmentofthedisease. Thisinformationiscrucialforthegenerationofthe
experimental reagents needed for the development of new drugs and explains
thewidespreadinvestmentbythebiotechnologyandpharmaceuticalindustries
inbioinformaticsstaandtechnologies(Lyall,1996;Spence,1998).
Recentannouncementsindicateacommitmenttocompletesequencingofthe
entire human genome during the year 2000, through acceleratedinternational
funding of public research (Pennisi, 1999). This initiative has promoted the
developmentoftechnologytogenerateraw(uninterpreted)genesequencedata
at a rate of at least 1Gigabase of new DNA peryear. Once the full human
genomesequenceisavailable,itwillnotbelongbeforethenucleotidesequenceof
everyhumangeneisknown. Theaminoacidsequencesoftheproteinsencoded
bythesegenescanthenbededuced. Currentestimatesofthenumberofgenesin
thehumangenomevarygreatly,buttendtoaveragearound60,000. Ifdierent,
but similar biologicalsequences are believed to havearisenby evolution from
acommon ancestor, the proteins orgenes are said to behomologous to each
other. For asignicantportionofthese deduced proteinsequences, afunction
canbeinferredduetothehomologyofthesequencetoanotherknownprotein.
However,giventhelargenumbersofnewproteinsequencesexpectedtobesolved
inthenearfuture therewill beagreatmanyforwhich noclearhomologuesof
informationprocessinginfrastructureinbothcommercialandacademicresearch
sectors.Currentstateoftheartcomputationalsequenceinterpretationmethods
arenotcapableofsolvingthesequencetofunctionproblemforallnewproteins,
andthedevelopmentofnewtechniquesisstillrequired.
Asignicantchallengeintheanalysisandinterpretationofgeneticsequence
datais thereforetheaccurate recognitionof patternswithin the datathat are
diagnosticforknownstructural orfunctional features within theprotein. The
languageofgenesiswritteninasimplealphabetfA, C, G, Tgrepresentingthe
fourDNAbasecodes. Thelanguageofproteinsusesatwentycharacteralphabet
fA, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Yg
repres-entingamino acidresidues. These residuesareencodedin genes bysuccessive
DNAbasetriplets. Attheirsimplest,thesepatternscanbedescribedasregular
expressions. Manyfeaturescanbedescribedthroughtheuseofregular
expres-sions and a database PROSITE (Bairoch et al., 1997) is available in which
thesepatternsarecurated. APROSITEpatternsuchas:
[AC]-x-V-x(4)-fEDg
istranslatedas:
[A or C]-any-V-any-any-any-any-fany but E or Dg
AmoreextensiveexampleofaPROSITEpatternisthat forthefamilyof
pro-teinscalledshort-chaindehydrogenases(enzymesinvolvedincellmetabolism).
The PROSITE pattern that includes two perfectly conserved residues, a
tyr-osine(Y)andalysine(K) is:
[LIVSPADNK]-x(12)-Y-[PSTAGNCV]-[STAGNQCIVM]-[STAGC]-K-fPCg-[SAGFY-R]-[LIVMSTAGD]-x(2)-[LIVMFYW]-x(3)-[LIVMFYWGAPTHQ]-[GSACQRHM]
Thereare,however,limitationsinherentintheuseofsimpleregular
expres-sionsas arepresentationofbiologicalsequencepatterns. Inrecentyears
atten-tionhasshiftedtowardsboththeuseofneuralnetworkapproaches(see(Baldi
& Brunak, 1998)) and to probabilistic models, in particular hidden Markov
models (see (Durbin et al., 1998)). Both these methods directly address the
plemented with well-established methods for training models from examples.
Training regimes generally(but not always)requirethat thesequences in the
trainingset bearrangedso that those regionsof the sequencethat havebeen
conservedthroughevolutionarealignedinthesamecolumn. Theaccurate
mul-tiple alignment of biological sequenceshas been thesubjectof much research
and discussion (Doolittle, 1996)), and is considered by many to be a solved
problem. However,itrelies ontheassumptionthat thesequencestobealigned
showsomehomologyatthesequencelevelwitheachother. Complexbiological
signalsalso requirecomplex models and it is often thecase that considerable
expertiseisrequiredintheselectionoftheoptimalneuralnetworkarchitecture
orhiddenMarkovmodelbefore trainingcantakeplace.
A generallinguisticapproachto representingthestructure andfunction of
genesandproteinshasintrinsicappealasanalternativeapproachto
probabil-isticmethods becauseof thedeclarative andhierarchicalnature of grammars.
Searls(Searls,1993)hasundertakenthemostthoroughanalysisofthelinguistic
classicationof geneticgrammars startingwith the Denite Clause Grammar
(DCG).SearlsproposesaStringVariableGrammar(SVG)extensiontoaDCG
to providefeatures necessaryfor representinghigher-orderinteractions among
genetic sequence elements found in nucleic acids such as non-linear features
foundin RNA pseudoknots and other secondarystructures formedas aresult
ofinternalnucleicacidbase-pairing.
Whilelinguisticmethodshaveprovidedsomeinterestingresultsinthe
recog-nitionofcomplexbiologicalsignals(Searls,1997)generalmethodsforlearning
newgrammars from example sentences are much lessdeveloped. Brazma has
reviewedthedevelopmentofmethods fortheautomaticdiscoveryofbiological
patterns(Brazmaet al.,1998)muchofwhichhastakenplaceinthecontextof
buildingdatabasesofsequencemotifssuchasPROSITE(Bairochet al.,1997),
BLOCKS (Heniko &Heniko, 1996) and PFAM (Sonnhammer et al., 1997;
Sonnhammeret al.,1998). Abe&Mamitsukaproposedamethodforpredicting
Weconsidered it valuableto investigatethe application of InductiveLogic
Programmingmethodstothediscoveryofalanguagethatwoulddescribea
par-ticularlyinterestingclassofsequences{neuropeptideprecursorproteins(NPP).
Theyarehighly variableinlengthandundergospecic enzymaticdegradation
(proteolysis) before the biologically active short peptides (neuropeptides) are
released. Unlikeenzymesorstructuralproteins,NPPstend toshowalmostno
overallsequencesimilaritywiththeexceptionofsomeevidenceforcommon
an-cestrywithin certain groups. Roughly, the50 humanneuropeptideprecursors
currently known contain about 140 known cleaved peptides. These peptides
belongtoatleast40dierentneuropeptidefamilies,withonly1-5membersper
family. Prositemotifsdoexist forsomeofthemorehighlypopulatedfamilies,
but these arebased onthecleavedpeptideand notthe precursor. These
mo-tifscouldbeusedtodiscovernewmembersofaneuropeptidefamily,but they
willneverdetectnovelfamiliesofneuropeptides. Itisbelievedthat therearea
greatmanymorenovelsubgroupsyettobediscovered. Thisconfoundspattern
discoverymethodsthatrelyonmultiple sequencealignmentandrecognitionof
biologicalconservation. As aconsequenceNPPsposeaparticularchallenge in
Neuropeptides are an important group of short proteins that act as
neuro-transmittersmediatingthepassageofsignalswithinthecentralnervoussystem
(CNS)and betweentheCNSandtherestofthebody. Thetermneuropeptide
was rstintroduced in1971byD. deWeid (Klavdieva,1995)to describe
frag-mentsofhormonesthatproducedbehaviouralchangeswheninjected,butlacked
theactivityofthe intacthormone. Morerecentlythe termhasbeenaccepted
tocoverpeptides united by anumberofcommon features, includingtheir
tis-sue expression(brain, nervous tissue, secretorycells from organssuch asgut,
heart,lungs,placentaetc)metabolism,secretion,biosynthesisandhighpotency
(Klavdieva,1995).
Drugmoleculesworkbyinteractingwithtargetsiteswithinthebody. These
sitescommonlyareproteinmolecules,eitherenzymesorreceptors. By
interac-tion with these protein molecules, drugscan modulate their actions and
gen-erally suppress undesirable biochemical reactions. Neuropeptides exert their
biological actions through binding as ligands to specic receptors. The term
ligandis usedformolecules which bindto thetarget site. (Aligand mightbe
highly active against the target, but not a `drug', because of alack of other
requiredpropertiessuchasmetabolicstabilityorsafety.)
Activeresearchhasincreasedthenumberofmammalianneuropeptidesfrom
about18in1978tomorethan80by1999. However,despitealltheseeorts,the
biologyof many ofthese neuropeptides aswellastheirinteractions withtheir
receptorsremaintobeelucidated. Thereceptorsofsomeneuropeptideshavenot
yetbeenidentiedandtherearesomeorphanreceptorswithasyetunidentied
ligands. The in-vitro pairing of a novel receptor with its ligand is a critical
rststepinunderstandingthemechanismofdisease. Thusnovelneuropeptides
and theirorphan receptors haveconsiderabletherapeutic potential andare of
widespreadinterestinthepharmaceuticalindustry.
Neuropeptides are subsequences of neuropeptide precursor sequences. An
exampleof aneuropeptide precursoris shown in Figure 1. This neuropeptide Put
grammatic representation of several other precursors is shown in Figure 2.
Precursorsmaycontaineitherasingleneuropeptide,multiplecopiesofthesame Put
Fig2
here
neuropeptideorseveraldierentneuropeptides. Thesecanoccurconsecutively
in the precursororcanbe separatedbylarge stretchesof ller peptidewhich
is believed to play a purely structural role. Neuropeptide precursors contain
a short prex of residues called a signal peptide of about 20{30 amino acids
(aa)in length. Theknownprecursors rangein lengthfrom 70 to 600aa, and
the cleavedpeptides range from 3-200 aa. It is this hugevariation in length,
sequenceandinternalorganisationthatmakesneuropeptideprecursorsdiÆcult
to use when searching for novel remote homologues using sequence database
searchingmethods(e.g. BLAST).Theyalsoconfoundtypicalmultiplesequence
alignment methods used to identify conservedfeatures among functionally
re-latedsequences.
Many proteins are cleaved and trimmed after synthesis. Forexample,
di-gestiveenzymesaresynthesisedasinactiveprecursorsthatcanbestoredsafely
inthepancreas. Afterbeingreleasedintotheintestine,theseprecursorsbecome
activatedbycleavage. Neuropeptideprecursorsundergothis`splitting'process.
The signalpeptide targets the protein for secretionthrough a cell membrane
whereitis thencleavedfrom theprecursor. Theremainderoftheprecursoris
furthercleavedtoreleasetheneuropeptide. Withintheprecursor,thelocation
ofthecleavagesofthesignalsequenceandtheneuropeptidesarereferredtoas
cleavage sites.
Toour knowledge there has beenno previous attempt to computationally
predictorrecogniseneuropeptides. Severalinvestigatorshavestatistically
ana-lysedvariousfeaturesofneuropeptidesandtheirprecursors(Devi,1991;Rholam
et al., 1995; Rholam et al., 1986; Bakalkin et al., 1991). Other
investigat-orshavedevelopedmethodsfortheidenticationofproteinsortingsignalsand
thepredictionof theircleavagesites(Claroset al.,1997;Nielsenet al.,1999;
Nielsenet al.,1997). Howevertherecognitionofsignalpeptidesonlysolvespart
SmithK-centredonthe useof regularexpression searchingof translatedExpressed
Se-quence Tag (EST)databases of gene fragments. The results of this approach
were limited by the rigidity of the regular expressions, the high frequency of
errorsinESTsequences(Hillieret al.,1996),andtheirrelativelyshortlength.
The ultimate goal of this project will be to dierentiate novel neuropeptides
fromthewealthof othernewsequencesproducedbythecompletionofthe
hu-mangenome sequencingproject. This isataskthat currentsequenceanalysis
Thissectiondescribesanexperimentwhose resultssignicantlycontradict the
null hypothesis (see Section 1). The section begins by describing the
mater-ials (data, background knowledge and machine learning systems) used in the
experiment. This is followed by an account of the three steps of the
experi-mentalmethod. Finallythesectionends withthepresentationand analysisof
theresults.
4.1 Materials
4.1.1 Data
The data was taken from the SWISS-PROT database (Bairoch & Apweiler,
2000). SWISS-PROTisanannotatedproteinsequencedatabaseestablishedin
1986 and maintained, with collaborators, by the Department of Medical
Bio-chemistryoftheUniversityofGeneva. Itcanbeaccessedat
http://www.expasy.ch/sprot/sprot-top.html.
Ourdata-setcomprisesasubsetof positivesi.e. knownNPPsand asubset
ofrandomly-selectedsequences. Itisnotpossibletogeneratealarge,unbiased
set of negative examples because there will be proteinswhich have yet to be
recognised scientically as a NPP. The characteristics of the two subsets of
sequencesareasfollows.
Positives This subsetcontains allofthe 44knownNPPsequences that were
inSWISS-PROTin Spring1997,thetimethedata-setwasprepared(see
Table8). TheSWISS-PROTidentiersofthese44sequencesarelistedin
Tables1and2. Put Tab1 and Tab2 here
All but three of these are human proteins. The three non-human
se-quences were included asthe humanequivalenthad notbeendiscovered
andtheywereconsideredtobeimportantexamples. Theyareexpectedto
beverycloselyrelatedtothehumanandarepossiblyareasonablemodel
Thesesequencesareunrelatedbysequencehomologytotheremaining34.
Randoms Thissubsetcontainsallofthe3910fulllengthhumansequencesin
SWISS-PROTinSpring 1997.
1000ofthe3910randomswerereservedforthetest-set.
Thedata-setisavailableat
ftp://ftp.cs.york.ac.uk/pub/aig/Datasets/neuropeps/
4.1.2 MachineLearning Systems
Thepropositional learningwasperformedusing the decision-treelearnerC4.5
(Release 8) in conjunction with the companion program C4.5rules that
con-structsrulesfromatreebuiltbyC4.5(Quinlan,1993). Thegrammarlearning
was performed usingCProgol(Muggleton, 1995)version4.4whichis available
from
ftp://ftp.cs.york.ac.uk/pub/MLGROUP/progol4.4.
4.1.3 Background Knowledge
Duringboththegeneration ofthegrammarusing CProgolandthegeneration
of propositional rule-sets using C4.5 and C4.5rules we adopt the background
informationusedin(Muggletonet al.,1992)todescribephysicalandchemical
propertiesoftheaminoacids(seeTable3). Put
Tab3
here
4.2 Method
Themethod maybesummarisedas
follows:-1. AgrammarisgeneratedforNPPsequencesusingCProgol(seeSection4.2.1).
2. Agroupoffeaturesisderivedfromthisgrammar. Othergroupsoffeatures
amalgam using C4.5 and C4.5rulesand its performance is measured
us-ing Mean R A. This is a new cost function which is described in
Ap-pendixA. Thenull-hypothesis(seeSection1)isthentestedbycomparing
theMeanR Aachievedfromthevariousamalgams. (See Section4.2.3).
4. A hiddenMarkovmodel(HMM) isgenerated forNPPsequencesandits
MeanR Aismeasured(seeSection4.2.4).
4.2.1 Grammar Generation
A NPP grammar contains rules that describe legal neuropeptide precursors.
Figure3showsanincompleteexampleof suchagrammar,written asaProlog Put
Fig3
here
program. This section describeshowproduction rules for signalpeptides and
neuropeptide starts, middle-sections and ends were generated using CProgol.
Thesewereusedtocompletethecontext-freedenite-clause-grammarstructure
shownin Figure3. Thestartand endrepresentcleavagesites andthe
middle-sectionrepresentsthematureneuropeptidei.e.whatremainsaftercleavagehas
takenplace.
TheproductionrulestobelearntbyCProgolcontainsdyadicpredicatesof
the form p(X,Y), which denote that property p began the sequence X and is
followedbyasequenceY.Tolearnsuchrulesfromthethetraining-set,CProgol
wasprovidedwiththefollowingextensionaldenitions:
Precursor data. Using details of the start and nishing positions for signal
peptidesand neuropeptidesit waspossibletogenerate examplesof
non-terminalsasbelow:
signalpep(S,[]) whereSis alistofprecursorresiduesconstituting the
signalpeptide.
start(S,[]) where S is a list of residues constituting the start of a
neuropeptide. ThestartingresidueforSistherstresidueafter the
end of thesequencefor start/2 above. Theend of Swastaken to
be3residuesfrom thelast position oftheneuropeptide.
end(S,[]) whereSisalistof3precursorresiduesconstitutingtheendof
aneuropeptide. Scommences with theresidue after theend of the
sequenceformiddle/2above.
Randomdata. Onerandomexampleforeachofsignalpep/2,start/2,middle/2
and end/2 was generated from each sequence in the set of randoms
se-quences. Randomexamplesaredistinguishedbytheprex*.
*signalpep(S,[]) where S is a list of residuesstarting at the rst
po-sition in the sequence. The length of S is obtained from drawing
randomlyfromthedistributionofsignalpeptidelengthsofNPPsin
thetrainingdata.
*start(S,[]) whereSisapairofsequenceresidues. Thestartingresidue
israndomlychosen,andisensurednotto conictwiththesequence
chosenfor*signalpep/2above.
*middle(S,[]) where Sisalist ofresiduesstartingafter theendof the
sequencefor*start/2above. ThelengthofSisobtainedbydrawing
randomlyfromthelengthofneuropeptidemiddle-sectionsof
precurs-orsinthetrainingdata.
*end(S,[]) where S is a list of 3 residues starting at the end of the
sequenceterminatingthedenitionof*middle/2above.
CProgol was provided with denitions of the non-terminals star/2 and
run/3(see Table14). star/2represents somesequence of unnamed residues
whoselengthisnotspecied. run/3representsarunofresidueswhich sharea
speciedproperty.
The grammar-based approach presents a powerful method for describing
important: KR;GKRandGRR.ThesubsequencesKRandGKRareestablished
pro-teolyticcleavage sites found in NPPs; GRR is a relatively commonalternative
cleavage site to GKR. Pilot experiments suggested that the following patterns
maybesignicant:
K,positive;
positive,positive;
Y,veryhydrophobic;
hydrophilic,agap ofsomeresidues,M,negative;
HP;
WMDF.
All these subsequencesand patterns were coded as Prolog predicates and
includedasbackgroundknowledge(seeTable14).
Other pilot experiments onthe training data showedthat theaccuracy of
CProgol'sgrammarwashigherwithcertainrestrictionsonthelengthofNPPs,
signalpeptidesand neuropeptides. Specic constraintswere obtainedby
pro-gressivelycheckingthefollowing: alllengthslessthanthemeanlengthon
train-ing data; lengths that are within 1;2;::: standard deviation of the mean on
trainingdata. This resulted in the following additionalrestrictions: (1) NPP
lengthsnottoexceed200residues;(2)signalpeptidelengthstobebetween19
and29residues;and(3)middle-sectionsofneuropeptideslengthstobebetween
4and52 residues. These constraintsonlyaect thevalues offeatures derived
from the grammar. They do not constrain the value of the sequence length
featuredescribedattheendofSection 4.2.2.
AppendixBshowsthemodeandtypedeclarations,settingsandprune
pre-dicatesthatwereneededtoenableCProgoltocompletethegrammarshownin
The grammar features PredictionsaboutaNPPsequencecanbemadeby
parsingit usingtheNPPgrammar. Thevaluesof thefeatures shown in
Table 4wereobtainedbysuchparses. Notethat wheneverthegrammar Put
Tab4
here
predictsthat asequenceisnotaNPP,allofthefeaturesareassigned the
valuezero.
The SIGNALP features Eachfeatureinthisgroupisasummaryofthe
res-ult of using theSIGNALP program on asequence. The SIGNALP
pro-gram(Nielsenet al.,1997)representsthepre-eminentautomatedmethod
for predicting the presence and location of N-terminal signal peptides.
SIGNALPisavailableonthewebat
http://www.cbs.dtu.dk/services/SignalP.
Thetechniqueusedcombines thepredictionsoftwodierentneural
net-works groups { one that recognises cleavage sites, and the other that
identiessignalpeptides. Whenprovidedwith asequence ofN-terminal
residues, the following are reported as summaries: (a) C scores: which
consist of the maximumvalue of the scorefrom the cleavage-recogniser,
theposition in thesequencewhere this valueisachieved,and anominal
`y'or`n'denotingtheanswertowhether acleavagesiteispresent;(b)S
scores: thecorrespondingvaluesfromthesignal-peptiderecogniser;(c)Y
scores: ascorethatcombinestheCandSscores;and(d)Mean scores: a
meanoftheS-scoreandS-conclusionsfromtheN-terminalendtothe
pre-dicted cleavagesite. Forthe experiments here, SIGNALP was provided
with 50 amino acids from the N-terminal for each sequence. The
sum-marieswere extracted andrepresented by11 features shownin Table 5.
Put
Tab5
here
The proportions features Eachfeature in this groupisaproportion of the
numberofresiduesin agivensequencewhich eitherareaspecic
ertiesshownin Table3.
The sequence length feature Thisfeatureisthelengthofthesequence. In
theremainderofthis paperthis featurewillbereferredtoaslength.
4.2.3 PropositionalLearning
ThetrainingandtestdatasetsforC4.5werepreparedasfollows.
1. Recall from Section 4.1.1that ourdata comprises44 positivesand 3910
randoms. 40ofthe44positivesoccurinthesetof3910randoms. AsC4.5
is designed to learnfrom a set of positivesand a set of negatives, these
40 positiveswere removed from theset of randoms. Of the40 positives
which are in theset of randoms,10 are in thetest-set. Hence theset of
(3910 40)sequenceswere splitintoatraining-setof(2910 30=2880)
andatest-setof(1000 10=990).
2. Valuesofthefeaturesweregeneratedforeachtrainingandtestsequence.
Eachsequencewasrepresentedbyadatavectorcomprisedofthesefeature
valuesand1classvalue(`1'todenote aNPPand`0'otherwise).
3. Finallyto ensure that there were asmany`1' sequencesas`0'sequences
atraining-setof2880NPPswasobtainedbysamplingwith replacement.
Thusthetrainingdata-setinputtoC4.5comprised(22880)examples.
(Nore-adjustingwasdoneonthetestdata.)
Amalgams of the feature groups described in the previous section were
formed. The amalgams are listed in Table 6. The following procedure was
followedforeach
one:-1. trainingand testsetswerepreparedasdescribedabove;
2. adecisiontreewasgeneratedfromthetraining-setusingC4.5;
rule-setonthetest-set;
5. MeanR Awasestimated asdescribedinAppendix A.3.
Thedefault settingsofC4.5andC4.5ruleswereused.
The contradiction of the null hypothesis was then attempted by testing
theMeanR Aofthebestmodelwhichincludesgrammar-derivedfeatures
was higherthan the best performance achievedusing anyof the models
which donotinclude thegrammar-derivedfeatures.
suchanincreasewasstatisticallysignicant. EstimatesofMeanR Awere
comparedusingthestatisticalmethod describedinAppendix A.4.
thebest model whichincludes grammar-derivedfeatures wassuÆciently
morecomprehensiblethanthebest `non-grammar'model.
4.2.4 Hidden Markov Model Comparison
AHMMforNPPswasgeneratedandtestedusingHMMER 1
version2.1.1(Eddy,
1998)whichisavailable fromhttp://hmmer.wustl.edu/.
CLUSTALW(1.8)(Thompsonet al., 1994)wasusedto alignthepositive
sequencesinthetraining-set. ThehmmbuildprogramofHMMERwasthenused
togenerateaHMMfromtheCLUSTALWalignment. TheresultingHMMand
thehmmsearchprogram ofHMMERwerethen usedtosearchforNPPsin the
test-set. The default settings of CLUSTAL W and HMMER were used. The
MeanR AoftheHMMwasestimatedbasedonthepredictionsonthetest-set.
4.3 Results and Analysis
Table 13 shows the grammar that wasgenerated using CProgol. The
gram-mar is veryrich in terms of non-terminals. All but one of the non-terminals
1
mar. The grammar also includes star/2, run/3 and three of the six
non-terminalswhich represent the patterns mentioned in Section 4.2.1. The
non-terminalsyvhand pp, which representthepatterns Y,veryhydrophobicand
positive,positive,bothappeartwice. Thethirdnon-terminalwhichappears
ishmn,whichcorrespondstothepattern
hydrophilic,agap ofsomeresidues,M,negative.
Table6showsthatpredictiveaccuracyisnotagoodmeasureofperformance Put
Tab6
here
for this domain because it does notdiscriminate well betweenthe amalgams:
despite covering varying numbers of (the rare) positives, all the models are
awardedasimilar(high)scorebypredictiveaccuracybecausetheyallexclude
mostoftheabundantnegatives.
Table6showstheMeanR AforboththehiddenMarkovmodelandforeach
amalgamof feature groups. TheMean R Aof theHMMis zero. The highest
MeanR A(107.7)wasachievedbyoneofthegrammaramalgams,namely the
`Proportions+Length+SIGNALP+Grammar'amalgam. ThebestMeanR A
achieved by any of the amalgams which do not include the grammar-derived
featureswasthe49.0attainedbythe`Proportions+Length'amalgam.
The P
90
M=57
R Afor the`Proportions+ Length+ SIGNALP+ Grammar'
amalgamwas 3661.376. The P
90
M=57
R Afor the`Proportions+Length'
amal-gamwas1666.733. For theamalgams `Proportions +Length +SIGNALP +
Grammar'and `Proportions +Length', ^
D =1994:643 and ^ D = p n = 2:081.
This dierenceis statistically signicant: substituting these valuesof ^
D and ^ D = p
nintoEquation15showsthatp(d<0)iswellbelow0.0001.
If one searches for a NPP by randomly selecting sequences from
SWISS-PROTfor synthesisand subsequentbiologicaltesting then, at most,only one
inevery2408sequencestestedisexpectedtobeanovelNPP.Thisfollowsfrom
thefact that number of sequencesin SWISS-PROT =79449, the most
prob-able number of novel NPPs in SWISS-PROT = 90 57 = 33 (see Table 8)
and 33=79449 = 1=2408. Using our best recognitionmodel as a lter makes
pectedtobeanovelNPP.Thiscanbeseenfromthefollowingsimplecalculation.
RearrangingEquation2givesPr(NPP j R ec)=R APr(NPP).
Substitut-ing in the MeanR A for the best recognitionmodel gives Pr(NPP j R ec) =
107:688(90=79449)=1=8:2. Multiplying1/8.2bytheproportionofNPPsin
SWISS-PROTwhicharenovel(33/90)givesapproximately1/22.
Appendix C lists the complete rule-sets for the amalgams `Proportions +
Length+SIGNALP+Grammar'and`Proportions+Length'. Therulesthat
were generated from the `grammaramalgam' suggest that the NPPgrammar
isuseful forlearningfrom NPP sequencedata. Nine of the25 rules include a
grammar-derivedfeature. Theserulesrefertoavarietyofthegrammar-derived
whetherthegrammarpredictstheexistenceofanneuropeptidestart(e.g.
seeRule14in Figure5);
therstresiduein theneuropeptide(e.g.see Rules20 and21in Figures
5and 6respectively);
the position of the rst residue in the neuropeptide (e.g. see Rule 6 in
Figure5);
the property of thethird from last residue in the neuropeptide (e.g. see
Rule17inFigure 5).
thelengthof thesignalpeptide;
Ourmethoddidnottrytoremovethepotentialredundancybetweenvalues
ofsomeoftheSIGNALP featuresand grammarfeatures. Theresultslisted in
Table 6justify this. It is of interest to note that thegrammar-derived length
ofsignalpeptideisusedfrequentlybyC4.5(seeRules2,9,23and24),despite
the availabilityof similarfeaturesderivedfromSIGNALP(seeTable5).
Finally,itshouldbenotedthattwooftherulesreferonlytogrammar-derived
Aim The aim of the second experimentis to demonstrate that overtting of
theNPP sequencesdidnotinadvertentlyoccurduring the generationof
grammar in the rst experiment (see Section 4.2.1). The data-set used
in the rst experimentcontainedall of the44 known NPPsequences in
SWISS-PROT in Spring 1997. The second experiment utilises 13
addi-tional NPP sequences which had been added to SWISS-PROTby May
1999(seeTable8). Noneofthese13additionalNPPsequenceswereused
for training or testing in the rst experiment. Indeed the identiers of
these13sequences werenotknownto C.H.B(whoperformedthesecond
experiment)untilafter theresultsoftherstexperimentwere published
(Muggletonet al.,2000).
Data Twotest-sets areusedinthisexperiment.
1. Thetest-setusedin therstexperimenti.e.theonewhich contains
10ofthe44originalNPPsandthe990'other'sequences.
2. Anewtestsetcomprisingthe990'other'sequencesfromthetest-set
used in the rst experiment and 10 of the 13 additional NPP
se-quenceswhichhadbeenaddedtoSWISS-PROTbyMay1999. The
SWISS-PROTidentiersoftheadditional13sequencesarelistedin
Table7and thesequencesareavailable at
ftp://ftp.cs.york.ac.uk/pub/aig/Datasets/neuropeps/.Three
of the 13additional NPP sequences(P10092, P07492and Q00072)
werenotusedastheyarehomologuesofNPPsintheoriginaltraining
setof34NPPs.
Method TheMeanR Aof thegrammar onboththeoriginal test-setand the
newtest-setwasmeasured. NotethatitwastheMeanR Aofthegrammar
generatedbyCProgolthatwasmeasuredasopposedtooneofthedecision
treesgeneratedbyC4.5andC4.5rules. MeanR Awasestimatedusingthe
andthenewtest-set.
Conclusion OverttingoftheoriginalsetofNPPsequencesdidnotoccur
dur-ing thegeneration ofthe grammar: theperformance ofthe grammaron
the newand original data-setsis the same. Thisresult provides further
evidencetosupport ourconclusionthat wehavedevelopedaNPP
recog-nitionmethodwhichwouldprovidecost-savingsduringasearchfornovel
Thispaperhasshownthatthemostcost-eective,comprehensiblemulti-strategy
predictorofhumanneuropeptideprecursorsdoesemployacontext-free
denite-clause-grammar.
The ILP Bayesian approach to learning from positive examples was used
to generate a grammar for recognising a class of proteins known as human
neuropeptideprecursors(NPPs). Collectively,veoftheco-authorsofthis
pa-per, have extensive expertise on NPPs and general bioinformatics methods.
Their motivation for generating a NPP grammar was that none of the
ex-isting bioinformatics methods could provide suÆcient cost-savings during the
searchfor newNPPs. Priorto this projectexperienced specialistsat
SmithK-line Beecham had tried for many months to hand-code such a grammar but
without success. Our best predictor makes the search for novel NPPsmore
than 100 times more eÆcient than randomly selecting proteins for
syn-thesisand testing themfor biologicalactivity(see Figure 4) . Asfar asthese Put
Fig4
here
authorsareaware,thisis boththerstattempt tolearnabiologicalgrammar
using ILP and the rst real-world scientic application of the ILP Bayesian
approachtolearningfrompositiveexamples.
We rst published that our best predictor delivers more than a
hundred-fold cost-saving in the proceedings of seventeenth international conferenceon
MachineLearning(Muggletonet al.,2000). Sincethenwehaveobtainedfurther
evidenceto supportourconclusion that wehavedevelopedaNPPrecognition
methodwhichwouldprovidecost-savingsduring asearchfor novelNPPs. We
haveshown,using NPPsequenceswhich hadnotbeenusedpreviouslyonthis
project, that overtting of the original set of NPP sequences did not occur
duringthegenerationofthegrammar.
A shortcomingoftheNPPgrammar generatedis thatit willnotrecognise
allNPPsbecauseitimpliesthat 1) bothstartand endcleavagesitesare
com-pulsory;2) amatureneuropeptidecannotbeadjacenttothesignalpeptideor
ofthepredicatesnpp/2andneuropeptide/2showninFigure3. Experiments
withamoreexiblegrammarshouldthereforeformthesubjectoffuturework.
Inouropinion,thebest`non-grammar'recognitionmodel doesnotprovide
any biological insight. However the best recognition model which includes
grammar-derivedfeaturesisbroadlycomprehensibleandcontainssomeintriguing
associationsthatmaywarrantfurtheranalysis. Thismodelisbeingevaluatedas
anextensiontoexistingmethodsusedinSmithKlineBeechamfortheselection
ofpotentialneuropeptidesforuseinexperimentstohelpelucidatethebiological
functions ofG-protein coupledreceptors. It isclear however,that therulesof
themodelarenotanoptimalrepresentationofsequencedataandresidue
prop-erties. A moreintuitive(e.g. graphicallyoriented,sequencecentred)displayof
themeaningofthese ruleswouldberequiredto buildtoolsthat theexperts in
theeld wouldndacceptable.
The new cost function presentedin this paper, RelativeAdvantage (R A),
may be used to measure performance of a recognitionmodel for any domain
where
1. theproportionofpositivesin thesetofexamplesisverysmall.
2. thereis noguaranteethat allpositivescanbeidentiedassuch. Insuch
domains, the proportion of positive examples in the population is not
known and a large, unbiased set of negatives cannot be identied with
completecondence.
3. thereisnobenchmarkrecognitionmethod.
InAppendixAwehavedevelopedageneralmethodforassessingthe
signi-canceofthedierencebetweenR Avaluesobtainedincomparativetrials. R A
isestimatedbysummingtheestimateofperformanceoneachtest-setinstance.
Themethod uses a) identically distributed random variables representing the
outcomeforeachinstance;b)asamplemeanwhichapproachesthepopulation
Theco-authorsfromSmithKlineBeecham identiedtheproblemaddressedby
thecase-study,preparedthedata-set,providedexpertiseonNPPsand general
Bioinformatics methods and also helped to write the domain-specic aspects
ofthepaper. S.H.M.developedCProgoland themethodof generatingaNPP
grammar.A.S.developedthemethodofusingC4.5tolearnpropositionalrules
fromfeaturesderivedfromSIGNALP,proportionsandsequencelength. C.H.B.
devisedthesetofgrammarfeatures,performedtheexperimentsdescribedinthis
paperand preparedallthesections ofthis paperexcept thoseon thedomain.
C.H.B.and S.H.M. developed themethod for assessingthe signicance of the
dierencebetweentheRAoftwomodelsandanassociatedmethodestimating
WewouldliketothankI.S.Gloger,M.LawrenceandR.B.Russellforsharingtheir
knowledgeofNPPsandsequencehomology. WewouldliketothankJ.Cussens
forhis help with Section A, M.Turcottefor his advice on HMM software and
D.Michieforhiscommentsonthiswork.
Thisresearchwasconductedaspartofaprojectedentitled`UsingMachine
Learningto DiscoverDiagnostic Sequence Motifs' supported by a grant from
SmithKlineBeecham. DuringpartofhistimespentwritingthisarticleC.H.B.
was supported bythe EPSRC grant(GR/M56067)entitled`ClosedLoop
Ma-chineLearning'. A.S.holdsaNuÆeldTrustResearchFellowshipatGreen
NPPsare identiedeither throughpurely biologicalmeansorbyscreening
ge-nomic or protein sequence databases for likely NPPs, followed by biological
evaluation. Ifwewishtogobeyondusingsequencehomologytondnew
mem-bers of the (generally small) NPP families, we need a recognition model for
NPPsin general. Howeverifthisrecognitionmodelispoorthenitmaynotbe
muchbetterthanrandomsamplingofsequencedatabases(e.g. SWISS-PROT)
and thecost-benetof anyexperimental evaluation of NPPs foundby such a
procedurewould beprohibitivelysmall.
In developinga general recognition model for human NPPs, we are faced
withthreesignicantobstacles.
1. The number of known NPPs in the public domain databasesof protein
sequence (e.g. SWISS-PROT) is very small in proportion to the total
numberofsequences. Whenwedevelopedourmethod ofestimating RA
(May1999),SWISS-PROTcontained79,449sequences,ofwhichsome57
coulddenitelybeidentied ashumanNPPs.
2. There is no guarantee that all the human NPPs in SWISS-PROT have
been properly identied. We estimate there may, in fact be up to 90
NPPsin SWISS-PROT.
3. There isnobenchmarkmethod forNPPrecognitionthat canbe usedto
compare any newmethods. Wemust therefore compareour recognition
modelwithrandomsampling toevaluatesuccess.
This domain requires aperformance measure which addresses all of these
issues.
Table8summarises howsome of theproperties of SWISS-PROT changed Put
Tab8
here
overthedurationoftheexperimentsdescribedinthispaper. AlltheR A
meas-urementsin this paper are based on the properties asthey stood at May 99.
PROT.
A.1 Limitations of Existing Performance Measures
Fordomains in whichpositivesare rare, predictiveaccuracy, asit isnormally
measuredin MachineLearning(assumingequalmisclassicationcosts):
gives a poor estimate of the performance of a recognition model. For
instance,ifalearnerinducesaveryspecic modelforsuchadomain, the
predictiveaccuracyofthemodelmaybeveryhighdespitethenumberof
truepositivesbeingverysmallorevenzero.
does not discriminate well between models which exclude most of the
(abundant) negatives but cover varying numbersof (the rare) positives.
(Thiswasillustratedearlierinthis paper-see Table6.)
Fordomainsinwhichthereisnobenchmarkrecognitionmethodthatcanbe
usedtocompareanynewmethods,Lift(Ling&Li,1998)isnottheappropriate
measure of performance because it does notquantify the reductionin cost in
usingthepredictorversusrandomsampling. Furthermore,intheirpaper,Ling
& Li gave no explanation of how to assess the signicance of the dierence
betweentheLiftoftwomodels. ROCcurves(orLorentzdiagrams)(Provost&
Fawcett,1998)alsodonotquantifythereductionincostinusingthepredictor
versusrandomsampling.
Therefore wedene a relative advantage (R A) function which predicts the
reductionin cost in using the model versus random sampling. In contrast to
otherperformance measures,R A ismeaningful and relevantto experts in the
domain.
A.2 Denition of RA
Inthefollowing,`themodel'referstothelearnedrecognitionmodelfor
incostinusingthemodelversusrandomsampling. R A= A B (1) where
A = the expected cost of nding one NPPby repeated independent random
samplingfromSWISS-PROTandperformingalaboratoryanalysisofeach
protein.
B = the expected cost of nding oneNPP by repeated independent random
samplingfromSWISS-PROTandanalysingonlythoseproteinswhichare
predictedbythelearnedmodeltobeaNPP.
RAcanbedenedin termsofprobabilityasfollows. Let
C =thecostoftestingthebiologicalactivityofoneproteinviawet-experiments
inthelaboratory;
NPP =SequenceisaNPP;
Rec =ModelrecognisessequenceasaNPP.
Equation1cannowberewrittenas:
R A= C=Pr(NPP) C=Pr(NPP jR ec) = Pr(NPP jR ec) Pr(NPP) (2)
Lettestingthemodelontestdata yieldthe22contingencytable shown
in Table 9with the cellsn
1 , n 2 , n 3 , and n 4 . Let n = n 1 +n 2 +n 3 +n 4 be Put Tab9 here
thenumberofinstancesin thetest-set. Notethattherandomsetofsequences
referredtointheright-handcolumnmayincludesomeNPPsequences. Table10
shows anestimate of the contingency table that would beobtained if it were Put
Tab10
here
possibletoidentify andremoveallthepositivesfromthesetofrandoms. Ifthe
proportionofNPPsinthetest-setwasknowntobethesameastheproportion
ofNPPsin thedatabasethenwecouldestimate Pr(NPP)to be(n
1 +n 3 )=n andPr(NPP jR ec)toben 1 =(n 1 +n 2
inthetest-setanddatabase.
InordertoderiveaformulaforestimatingRAgivenbothasetofpositives
andaset of randoms,weestimatePr(NPP)and Pr(NPP j R ec)asfollows.
LetS bethetotalnumberofsequencesinthedatabase,ofwhichM areNPPs.
Pr(NPP) =
no:of NPPsin thedatabase
no:of sequencesinthedatabase
= M=S (3)
Pr(NPP j R ec)= N
dbNPP recog
N
dbseqpred pos
(4)
where N
dbNPP recog
is the number of NPPs in db which are recognised by
model and N
dbseqpred pos
is the number of sequences in dbwhich the model
predictsto beNPP.
Table11 showsthe expected resultof using the learnedrecognitionmodel Put
Tab11
here
on the entire SWISS-PROT database. Note that the factor (1 Æ) does not
appearasitcancelsout. FromEquation4andTable11itfollowsthat:
Pr(NPP jR ec) ' n1 n1+n3 M n1 n1+n3 M+ n2 n2+n4 (S M) = (Mp 1 )=(Mp 1 +(S M)p 2 ) (5) wherep 1 =n 1 =(n 1 +n 3 )andp 2 =n 2 =(n 2 +n 4
). SubstitutingEquations3and
5intoEquation2gives
R A = (Mp 1 )=(Mp 1 +(S M)p 2 ) M=S = Sp 1 Mp 1 +(S M)p 2 = Sp 1 Sp 2 +M(p 1 p 2 ) (6)
A.3 Estimating Relative Advantage
In thefollowing Relative Advantageover theentire population is represented
byR Aincapital letterswhereasRelativeAdvantageoverasampleisdenoted
bylowercasei.e.ra. AsthevalueofM isnotknown,weestimate P
90
M=57 R A.
is equalto thenumber of known NPPsin SWISS-PROT. The upperlimit of
M is the mostprobable number of NPPsin SWISS-PROT i.e. atotal of the
known NPPsandthose proteinswhich haveyet to bescienticallyrecognised
asaNPP. 90 X M=57 R A ' Sp 1 Z 91 M=57 1 (p 1 p 2 )M+Sp 2 @M = Sp 1 (p 1 p 2 ) ln((p 1 p 2 )M+Sp 2 )+k 91 57 = Sp 1 (p 1 p 2 ) ln 91(p 1 p 2 )+Sp 2 57(p 1 p 2 )+Sp 2 (7) Weestimate P 90 M=57
R Aby summing anestimate of the P
90
M=57
R A for each
instance in the test-set asfollows, where n is the number of instances in the
test-set. This method hasthe advantagethat it allowsthe signicance of the
dierencebetweentheRAoftwomodelstobegauged(seeSectionA.4).
n X k =1 90 X M=57 ra k (8)
FromEquation8andthecontingencytableitfollowsthat:
90 X M=57 ra= 1 n 4 X i=1 n i 90 X M=57 ra i ! (9) Each P 90 M=57 ra i is estimated by substituting p 1 = a a+c and p 2 = b b+d into
Equation7. Thevaluesofa,b,c anddaredeterminedbythree steps.
1. Whatever the i value, a, b, c and d are initially given the valuesof the
correspondingcounts/frequenciesinthecontingencytableforthetest-set
(seeTable9).
2. Eachoneofa,b,candd, isdecrementedprovidingthat thevaluebefore
subtractionisgreaterthan1.
We donot decrementwhen the valuebefore subtraction iszero because
thiscanresultinp
1 orp
2
1 2
is one becausethis can cause p
1 or p
2
to havethe valuezero, which in
turnhasahighly disproportionate eectonthevalueof P 90 M=57 ra i .
3. Thevalueofeither a,b,c ordis incrementedto reecttheclassication
ofaninstanceinthecelln
i .
For instance, ifi =2 and all the counts in the contingency table are greater
thanonethena=n
1 1;b=n 2 ;c=n 3 1;d=n 4 1.
Note thatSteps 1and2assign thesameprior probabilityto each instance
because the eect of each step is not dependent upon which cell the current
instance belongs to. Therefore this method of estimating P
90
M=57
R A has the
propertiesofa)producingidenticallydistributedrandomvariablesrepresenting
theoutcomeforeachinstance; b)havingasamplemeanwhichapproachesthe
populationmeaninthelimitandc)havingarelativelysmallsamplevariance.
Thenal stepof ourmethodfor estimatingR A isto takethe meanof the
summedvalues. MeanR A= P 90 M=57 ra i 90 (57 1) = P 90 M=57 ra i 34 (10)
A.4 Assessing the signicance of the dierence between
the RA of two models
Nextwedevelopamethodforassessingthesignicanceofthedierencebetween
theRA oftwomodels. Thismethod tacklesaproblem whichissimilar tothat
posed bythethird questionin Dietterich's taxonomyofstatisticalquestions in
MachineLearning(Dietterich,1998). Thatis,howtochoosebetweenclassiers
forasingle applicationdomain in which theamountof available data is
suÆ-cient to allowsome of it to beset aside for evaluating classiers. However a
newmethodisneededbecauseofthefundamentaldierencesbetweenRelative
Advantageandpredictiveaccuracy.
We compare the performance of two recognition models, H
1 and H 2 , by comparing their P 90 M=57
R A values. Let d be dierence in P
90
M=57
overthe entirepopulation,i.e. forall theproteinsin SWISS-PROT,andd be
theobserveddierenceonthetest-set.
d= 90 X M=57 R A H 1 90 X M=57 R A H 2 (11) ^ d= 90 X M=57 ra H 1 90 X M=57 ra H 2 (12) ^
dis anunbiasedestimatorforthetruedierencebecauseitiscalculatedusing
anindependenttest-set. Todeterminewhether theobserveddierenceis
stat-istically signicantweaddressthefollowingquestion. Whatis theprobability
that P 90 M=57 R A H1 > P 90 M=57 R A H2
,giventheobserveddierence, ^
d.
IfD isarandomvariablerepresentingtheoutcomeofestimatingd by
ran-domsamplingthen,accordingtotheCentralLimitTheorem,^
D
isnormally
dis-tributedinthelimit. Ithasanestimatedmean ^
dandhasanestimatedvariance
of^ 2
D
=n. Thevarianceof arandomvariable, X, is 2 X =E((X) 2 ) (E(X)) 2 .
Therefore,sinceD isarandomvariable:
^ 2 D =^ D 2 ^ 2 D (13) We calculate ^ D
2 as follows. Lettesting themodel ontest data yield the
44contingencytableshowninTable12withthecellsn
i;j . Put Tab12 here ^ D 2 = 1 n 4 X i=1 4 X j=1 0 @ n i;j 90 X M=57 ra i 90 X M=57 ra j ! 2 1 A (14) Giventhatp( P 90 M=57 R A H 1 > P 90 M=57 R A H 2 )=p( P 90 M=57 R A H 1 P 90 M=57 R A H 2 >
0) we evaluate our null hypothesis by estimating p(d < 0) using the Central
LimitTheorem. Z 0 x= 1 Pr(d=x)dx= Z 0 x= 1 1 p 2 2 e 1 2 ( x ) 2 dx (15) where=^ D and=^ D = p n.
Table13 showsthe productionrules generatedby CProgol. Therulescomply
withPrologsyntax. signal(X;Y)istrueifthere isasignalpeptideat the
be-ginningofthesequenceX,anditisfollowedbyasequenceY. Theotherdyadic
predicatesaredenedsimilarly. Non-terminalsandterminalswhich appearon
the right hand side of the production rules listed in Table 13 are dened in
Table14. Table14 showsthe Prologcode representingthebackground
know-ledge input to Progol. The production rules, when taken together with the
partialgrammarshownin Figure3,form agrammarforNPPsequences.
B.1 Dening a Hypothesis Language for Progol
AHypothesisLanguageforProgolisdenedby:{
mode and type declarations which state the forms that atoms in
hypo-thesesmaytake(seeSectionsB.1.1andB.1.2);
prunedeclarationswhichfurtherrestricttheformofhypotheses(see
Sec-tionB.1.3);
the maximum numberof layers of variables introduced by atoms in the
bodyofinducedclausesfrom variablesintheheadoftheclauses;
themaximumnumberofliteralsinthebodyofinducedclauses.
ThehypothesislanguageusedintheexperimentisdenedbyTables15,16
and17.
B.1.1 Mode Declarations
The mode declarations state the mode of call for those predicates that can
appearinahypothesisinducedbyProgol. Therearetwotypesofmode
declar-ations,asshownbelow. Therstdescribestheformofliteralsthatmayappear
in the headof clauses induced by Progoland the second describes those that
:- modeb(Recallnumber, Bodyliteraltemplate)?
HeadtemplateandBodyliteraltemplatearetemplatesofpredicatesand
take the form predicate(ts1, ts2, ...), where ts is a term specication.
Each term specication comprisestwo parts: a mode and atype. Typesare
describedinSection B.1.2. Thethreepossiblemodesare:{
+ Thisindicatesthatthetermisaninput. Thatis,inallcallstothispredicate,
thetermwillbeboundtoavalue.
{ Thisindicatesthetermisanoutput.
# This indicatesthataconstantshouldappearinthisterm.
Recallnumberreferstothedeterminacy ofthepredicatetemplate, thatis
itspecies the maximumnumberof times acall to thepredicate cansucceed
foragivensetofinputvariables. Hencefordeterminatepredicatetemplatesit
isset to oneand for indeterminatepredicatetemplates to valuesgreaterthan
one. IftheUserspeciesaRecallnumbertobe*thenProgolassignsadefault
valueof100toit.
B.1.2 Type Declarations
Typesthat areincludedin modedeclarationsmaybeunarypredicatesdened
in the background knowledge. All but one of the mode declarations listed in
Table15refertothetyperlist/1;thisisapredicatewhosedenition islisted
onTable14. Progoltype-checksaconstantby executingaqueryin which the
predicatecorrespondsto thetypeandthetermisinstantiatedtotheconstant.
If the query succeeds then Progol accepts that the constant is of the correct
type.
B.1.3 PruneDeclarations
Prune declarations are used to prevent Progol considering specied forms of
whenHeadisinstantiatedto theliteralin theheadoftheproposedclauseand
Bodyisinstantiatedtotheproposedbodyoftheclause.
B.2 The Time and Space Complexity of Progol
The Progolalgorithm, as analysed in (Muggleton, 1995), has time and space
complexity which increases linearly in both the number of examples and the
numberofclausesinthelearnedtheory. However,thescalingconstantsinvolved
varydependingonthesizeofthehypothesisspacesearchedforeachclause. This
iscontrolledusinganumberofparameters,includingaclauselengthboundand
aproofdepthbound. Put
Tables
13,14,15,
16,17
Therule-setthat wasgeneratedfrom the`Proportions+Length+SIGNALP
+Grammar'amalgamisshowninFigures5and6. Eachboxcontainsaruleas
itwasoutputbyC4.5rulestogetherwiththeEnglishtranslationwhichisshown
in italics. The percentagein square brackets refersto the predictedaccuracy
ofthe correspondingrule. Eachcolumn of rulesis tried in turn. Within each
column,eachruleistriedinorderofappearance. Thelastruleisthe`default'
rulewhichisused ifnoneoftheotherrulesapply.
Therule-setthat was generatedfrom the`Proportions+Length'amalgam
isshowninFigures 7,8and9. Put
Figures
5,6,7,
8,9
Abe,N.,&Mamitsuka,H.(1997).Predictingproteinsecondarystructureusing
stochastictreegrammars. Machine Learning,29,275{301.
Bairoch,A., &Apweiler,R.(2000). TheSWISS-PROTproteinsequence
data-baseanditssupplementTrEMBLin 2000. Nucleic Acids Res,28,45{48.
Bairoch, A., Bucher, P., & Hofman, K. (1997). The PROSITE database, its
statusin1997. NucleicAcidsResearch,25, 217{221.
Bakalkin,G.,Rakhmaninova,A.,Akparov,V.,Volodin,A.,Ovchinnikov,V.,&
Sarkisyan,R.(1991).Aminoacidsequencepatternintheregulatorypeptides.
InternationalJournalof Peptide ProteinResearch,38, 505{510.
Baldi,P.,&Brunak,S.(1998).Bioinformatics: the machinelearningapproach.
Cambridge,MA:MITPress.
Brazma,A., Jonassen,I.,Eidhammer,I.,&Gilbert,D. (1998). Approachesto
the automatic discoveryof patterns in biosequences. Journal of
Computa-tionalBiology, 5,279{305.
Claros,M.,Brunak,S.,&vonHeijne,G.(1997).PredictionofN-terminalsorting
signals. Current OpinioninStructuralBiology, 7,394{398.
Devi, L. (1991). Consensus sequence for processing of peptide precursors at
monobasicsites. FEBS,280,189{194.
Dietterich,T.G.(1998).Approximatestatisticaltestsforcomparingsupervised
classicationlearningalgorithms. NeuralComputation, 10,1895{1924.
Doolittle, R. (Ed.). (1996). Computer methods for macromolecular sequence
analysis. NewYork: AcademicPress.
Durbin, R., Eddy, S.,Krogh, A., &Mitchison, G. (1998). Biological sequence
analysis: probabilisticmodelsofproteinsandnucleicacids.Cambridge:
USA.Version2.1.1edition.
Heniko,J.,&Heniko,S.(1996). Blocksdatabaseanditsapplications.
Meth-ods inEnzymology, 266,88{105.
Hillier, L., Lennon, G., Becker, M., Bonaldo, F., Chiapelli, B., Chissoe, S.,
Dietrich, N., DuBuque, T., Favello, A., Gish, W., Hawkins, M., Hultman,
M., Kucaba, T., Lacy, M., Le, M., Le, N., Mardis, E., Moore, B., Morris,
M.,Parsons,J.,Prange,C.,Rifkin,L.,Rohlng,T.,Schellenberg,K.,Soares,
M.,Tan,F.,Thierry-Mieg,J.,Trevaskis,E.,Underwood,K.,Wohldman,P.,
Waterston,R.,Wilson, R., &Marra,M.(1996). Generationand analysisof
280,000humanexpressedsequencetags. Genome Research,6,807{828.
Klavdieva,M. (1995). The historyof neuropeptides. Frontiers in Neuro
endo-crinology,16, 293{321.
Ling, C., & Li, C. (1998). Data mining for direct marketing: Problems and
solutions. Proceedings of the FourthInternational Conference on Knowledge
DiscoveryandDataMining(pp. 73{79). NewYorkCity.
Linz,P.(1996).An introductiontoformallanguagesandautomata. Lexington,
Massachusetts: D.C.HeathandCompany.
Lyall, A. (1996). Bioinformatics in the pharmaceutical industry. Trends In
Biotechnology, 14,308{312.
Michalski,R.,&Wnek,J.(1997). Guesteditors'introduction. Machine
Learn-ing,27,205{208.
Muggleton,S. (1995). Inverse entailment and Progol. New Generation
Com-puting,13, 245{286.
Muggleton,S.(1996).Learningfrompositivedata. Proceedings oftheSixth
In-ternationalWorkshoponInductiveLogicProgramming(pp.358{376).
challenge. Articial Intelligence,114,283{296.
Muggleton, S., Bryant, C., & Srinivasan, A. (2000). Learning chomsky-like
grammars for biological sequence families. Proceedings of the Seventeenth
InternationalConferenceonMachine Learning(pp.631{638). Stanford
Uni-versity,USA:SanFrancisco,CA:MorganKaufmann.
Muggleton,S.,King,R.,&Sternberg,M.(1992). Proteinsecondarystructure
predictionusing logic-basedmachine learning. Protein Engineering, 5,647{
657.
Muggleton,S., & Raedt, L. D. (1994). Inductivelogic programming: Theory
andmethods. JournalofLogicProgramming, 19,20,629{679.
Nielsen,H., Brunak,S.,&vonHeijne, G.(1999). Machine learningapproaches
forthepredictionofsignalpeptidesandotherproteinsortingsignals.Protein
Engineering,12, 3{9.
Nielsen,H.,Engelbrecht,J.,Brunak,S.,&vonHeijne,G.(1997). Identication
ofprokaryoticandeukaryoticsignalpeptidesandpredictionoftheircleavage
sites. Protein Engineering,10,1{6.
Pennisi,E.(1999). Humangenome- Academicsequencerschallenge Celerain
sprintto thenish. Science, 283,1822{1823.
Provost,F., &Fawcett,T.(1998). Analysisand visualizationof classier
per-formance: Comparison under imprecise class and cost distributions.
Pro-ceedings of the ThirdInternational Conference onArticialIntelligence(pp.
706{713).Menlo Park,CA:AAAIPress.
Quinlan, J. (1993). C4.5: Programs for machine learning. San Mateo, CA:
MorganKaufmann.
Rawlings, C., &Searls, D. (1997). Computational gene discoveryand human
Boileau,G., &Cohen, P.(1995). Role of amino-acid-sequencesanking
di-basiccleavage sites in precursor proteolytic processing {the importance of
therst residueC-terminalof the cleavagesite. Euroupean Journal of
Bio-chemistry,227,707{714.
Rholam,M.,Nicolas,P.,&Cohen,P.(1986). Precursorsforpeptide-hormones
sharecommon secondarystructuresforming features at theproteolytic
pro-cessingsites. FEBSLetters,,207,1{6.
Sakakibara,Y. (1997). Recent advancesof grammaticalinference. Theoretical
ComputerScience,185,15{45.
Searls,D.(1993). Thecomputationallinguisticsofbiologicalsequences.
Arti-cialIntelligencein Molecular Biology. California,USA:AAAIPress.
Searls,D. (1997). Linguisticapproachestobiologicalsequences[review].
Com-puterApplications in theBiosciences,13, 333{344.
Sonnhammer, E., Eddy, S., Birney, E., Bateman, A., & Durbin, R. (1998).
PFAM:multiplesequencealignmentsandHMM{prolesofproteindomains.
NucleicAcidsResearch,26, 320{322.
Sonnhammer, E., Eddy, S., & Durbin, R. (1997). PFAM: a comprehensive
databaseof proteindomain families basedonseedalignments. Proteins,28,
405{420.
Spence, P. (1998). Obtainingvalue from thehumangenome {achallengefor
thepharmaceuticalindustry[review]. DrugDiscovery Today, 3,179{188.
Thompson, J., Higgins, D., & Gibson, T. (1994). CLUSTAL W: improving
thesensitivity of progressivemultiple sequence alignment throughsequence
weighting, position-specic gap penaltiesand weightmatrixchoice. Nucleic
P01019P01042P01156P01178P01185P01189P01210P01213P01258P01270P01275
P01279P01282P01286P01298P01303P05060P05305P06881P07491P08858P08949
P10082P10645P12272P14138P16860P18509P20366P20382P20800P21591P22466
PhysicochemicalProperty Aminoacidswithproperty
Hydrophobic H,W,Y,F,M,L,I,V,C,A,G,T,K
Veryhydrophobic A,F,G,I,L,M,V
Hydrophilic S,E,Q,R,D,N Electropositive R,K,H Electronegative D,E Neutral A,C,F,G,I,L,M,N,P,Q,S,T,V,W,Y Large Q,E,R,K,H,W,Y,F,M,L,I Small P,V,C,A,G,T,S,N,D Tiny A,G,S Polar Y,T,S,N,D,E,Q,R,K,H,W Aliphatic L,I,V Aromatic H,W,Y,F
Hydrogendonor W,Y,H,T,K,C,S,N,Q,R
Feature Description
grampred Abooleanwhichindicateswhetherthegrammarpredictsa
sequencetobeaNPPornot.
gramsigl Lengthofthesignalpeptide.
gramnpl Lengthoftheneuropeptide.
gramfirst Positionoftherstresiduein theneuropeptide.
gramlast Positionofthelastresidueintheneuropeptide.
gramnpstartfirst Therstresidueintheneuropeptideoroneofitsproperties.
gramnpstartsecondThesecondresidueinneuropeptideoroneofitsproperties.
gramnpendfirst Therstresidue/property/starinthebodyoftheendrule.
gramnpendsecond The second residue/property/starin the body of the end
rule.
Feature Description
sigpcmax MaximumSIGNALPCscore
sigpcmaxpos PositionwheremaximumSIGNALPCscoreisachieved
sigpcconcl SIGNALPCscoreconclusion(`y'or`n')
sigpymax MaximumYscorereportedbySIGNALP
sigpymaxpos PositionwheremaximumSIGNALPYscoreisachieved
sigpyconcl SIGNALPY scoreconclusion(`y'or`n')
sigpsmax MaximumSIGNALPS score
sigpsmaxpos PositionwheremaximumSIGNALPS scoreisachieved
sigpsconcl SIGNALPS scoreconclusion(`y'or`n')
sigpsmean SIGNALPmeanofSscorestocleavagesite
thedecisiontreesgeneratedfromtheamalgamsofthefeaturegroups.MeanR A
wasestimated usingthemethod describedinSection A.3.
Predictor MeanR A PredictiveAccuracy(%)
HiddenMarkovModel 0 99.0
+ 0.3 Onlyprops 0 96.7 + 0.6 OnlyLength 1.6 91.8 + 0.9 OnlySignalP 11.7 98.1 + 0.4 OnlyGrammar 10.8 97.0 + 0.5 Props+Length 49.0 98.6 + 0.4 Props+SignalP 15.0 98.3 + 0.4 Props+Grammar 31.7 98.2 + 0.4 SignalP+Grammar 0 98.6 + 0.4 Length+Grammar 0 96.2 + 0.6 Length+SignalP 34.4 98.7 + 0.4
Length+SignalP+Grammar 0 98.0
+
0.4
Props+Length+SignalP 29.2 98.7
+
0.4
Props+Length+Grammar 33.2 98.5
+
0.4
Props+SignalP+Grammar 15.0 98.3
+
0.4
Props+Length+SignalP+Grammar 107.7 99.0
+
addedtoSWISS-PROTbyMay1999.
P10092 P23435 O00230 P09681 O43555 P07492 P48645 P01138 P20783 P02818
1999.
Spring1997 May1999
Numberofsequences 64,000 79,449
NumberofknownhumanNPPs 44 57
are labelled by the sets NPP sequences, Random sequences, H (Hypothesis
predictions) and H (complement of H). Thecells of the matrix represent the
cardinalitiesofthecorrespondingintersectionsofthesesets. n
1 +n 2 +n 3 +n 4 =
n,wherenisthenumberofinstancesin thetest-set.
SetoftestNPPsequences SetoftestRandomsequences
H n 1 n 2 H n 3 n 4
Theaxesof the22matrixarelabelledbythesets NPPsequences,Negative
sequences, H (Hypothesis predictions) and H (complement of H). The cells
of the matrix represent the cardinalities of the corresponding intersections of
these sets. Æ = M=S where S is the total number of sequences in the entire
SWISS-PROTdatabase,ofwhichM areNPPs.
SetoftestNPPsequences SetoftestNegativesequences
H n 1 n 2 (1 Æ) H n 3 n 4 (1 Æ)
are labelled by the sets NPP sequences, Random sequences, H (Hypothesis
predictions) andH (complementofH). Thetotal ofthecounts/frequenciesin
thefourcells=S,whereSisthetotalnumberofsequencesintheSWISS-PROT
database.
NPPsequencesinSWISS-PROT Negativesequencesin SWISS-PROT
H n1 n1+n3 M n2 n2+n4 (S M) H n3 n1+n3 M n4 n2+n4 (S M)
bythe cells of the22 contingency table for H
1
. The columns of the 44
matrixarelabelledbythecellsofthe22contingencytableforH
2
. Thecells
ofthe44matrixrepresentthecardinalitiesofthecorrespondingintersections
of these sets. P 4 i=1 P 4 j=1 n i;j
=n, where n is thenumber ofinstances in the
test-set. n 1 n 2 n 3 n 4 n 1 n 1;1 n 1;2 n 1;3 n 1;4 n 2 n 2;1 n 2;2 n 2;3 n 2;4 n 3 n 3;1 n 3;2 n 3;3 n 3;4 n 4 n 4;1 n 4;2 n 4;3 n 4;4
sigpep(A,B):- g(A,C),star(C,D), s(D,B).
sigpep(A,B):- m(A,C),star(C,D), hydrophilic(D,E),tiny(E,B).
sigpep(A,B):- hydrophobic(A,C),star(C,D),w(D,E), hydrobacc(E,B).
sigpep(A,B):- large(A,C), run(C,D,hydrophobic),star(D,E),t(E,B).
sigpep(A,B):- m(A,C), star(C,D),t(D,E),neutral(E,F),small(F,B).
sigpep(A,B):- m(A,C), star(C,D),veryhydrophobic(D,E),positive(E,F),tiny(F,B).
sigpep(A,B):- hydrophobic(A,C),run(C,D,hydrophobic),star(D,E),f(E,F), hydrophobic(F,B).
sigpep(A,B):- hydrophobic(A,C),star(C,D),h(D,E), hydrophobic(E,F),tiny(F,B).
sigpep(A,B):- hydrophobic(A,C),star(C,D),v(D,E), hydrophobic(E,F),neutral(F,B).
sigpep(A,B):- large(A,C), star(C,D),a(D,E),hydrophobic(E,F),small(F,B).
sigpep(A,B):- large(A,C), star(C,D),s(D,E),neutral(E,F),small(F,B).
start(A,B):-a(A,C), veryhydrophobic(C,B).
start(A,B):-d(A,C), t(C,B). start(A,B):-g(A,C), v(C,B). start(A,B):-h(A,C), r(C,B). start(A,B):-k(A,C), r(C,B). start(A,B):-l(A,C), r(C,B). start(A,B):-q(A,C), g(C,B). start(A,B):- s(A,C),l(C,B). start(A,B):- w(A,C),q(C,B).
start(A,B):- hydrophilic(A,C),a(C,B).
start(A,B):- hydrophilic(A,C),hydrophilic(C,B).
start(A,B):- positive(A,C),k(C,B).
start(A,B):- small(A,C),r(C,B).
middle(A,B):- yvh(A,C), star(C,D),large(D,E),large(E,B).
middle(A,B):- positive(A,C),star(C,D), neutral(D,E),large(E,F), large(F,B).
middle(A,B):- hydrobacc(A,C), star(C,D),hydrophobic(D,E),neutral(E,F),aromatic(F,B).
middle(A,B):- hydrobacc(A,C), yvh(C,D), star(D,B).
middle(A,B):- small(A,C), star(C,D),p(D,E),large(E,F), large(F,B).
middle(A,B):- y(A,C), star(C,D),g(D,E),hydrophobic(E,B).
middle(A,B):- hydrobacc(A,C), star(C,D),k(D,E), neutral(E,F),small(F,B).
middle(A,B):- small(A,C), star(C,D),l(D,E),m(E,B).
middle(A,B):- small(A,C), star(C,D),f(D,E),hydrophobic(E,F),aliphatic(F,B).
middle(A,B):- tiny(A,C),star(C,D),m(D,B).
middle(A,B):- q(A,C), star(C,D),positive(D,E),neutral(E,F),neutral(F,B).
middle(A,B):- hydrophobic(A,C),star(C,D),m(D,E), hydrophilic(E,F),neutral(F,B).
middle(A,B):- e(A,C), star(C,D),i(D,B).
middle(A,B):- q(A,C), star(C,D),l(D,B).
middle(A,B):- aromatic(A,C),star(C,D), v(D,E),neutral(E,F),hydrobdon(F,B).
middle(A,B):- aromatic(A,C),star(C,D), a(D,E),e(E,B).
middle(A,B):- c(A,C), star(C,D),c(D,B).
middle(A,B):- y(A,C), star(C,D),hydrobdon(D,E), hydrobdon(E,B).