Rochester Institute of Technology
Rochester Institute of Technology
RIT Scholar Works
RIT Scholar Works
Theses
Summer 2005
Isoelectric point prediction from the amino acid sequence of a
Isoelectric point prediction from the amino acid sequence of a
protein
protein
Matthew Conte
Follow this and additional works at: https://scholarworks.rit.edu/theses
Recommended Citation
Recommended Citation
Conte, Matthew, "Isoelectric point prediction from the amino acid sequence of a protein" (2005). Thesis. Rochester Institute of Technology. Accessed from
This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact
THESIS
ISOELECTRIC
POINT PREDICTION FROM THE
AMINO
ACID SEQUENCE OF A PROTEIN
Submitted
by
MatthewConteDepartment ofBiological Sciences
In partial fulfillment ofthe requirements
Forthe MasterofScience degree in Bioinformatics at
Rochester Institute of
Technology
-~
nIQlnformatlcs
~luT
To: Head, Department of Biological Sciences
Rochester Institute of Technology Department of Biological Sciences Bioinformatics Program
The undersigned state that _ _
...!...M----=.!~~· :....:~...!\--...!h~~~v...J~
\
~A....!...~C:!z<.loooO~Vl-"-!e..LJo...---
(Student Name)_ _ --:-:::---:---:-:---_-:--_ _ ' a candidate for the Master of Science degree in (Student Number)
Bioinformatics, has submitted his/her thesis and has satisfactorily defended it.
This completes the requirements for the Master of Science degree in Bioinformatics at Rochester Institute of Technology.
Thesis committee members:
Name
Gary R. Skuse
(Committee Chair)Paul A. Craig
(Thesis Advisor)Name Illegible
Douglas P. Merrill
DateThesis/Dissertation Author Permission Statement
Title of thesis or dissertation: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Name of auth0J.
A~HhLw
(0/1
k
Degree: ~ "'S~
Program: --~G~;~o~M~f9~C-M-~~I.-.-s---College: Sc.iC ..
,e.
I understand that I must submit a print copy of my thesis or dissertation to the RIT Archi ves, per current
RIT guidelines for the completion of my degree. I hereby grant to the Rochester Institute of Technology
and its agents the non-exclusive license to archive and make accessible my thesis or dissertation in whole or in part in all forms of media in perpetuity. I retain all other ownership rights to the copyright of the thesis or dissertation. I also retain the right to use in future works (such as articles or books) all or part of
this thesis or dissertation.
Print Reproduction Permission Granted:
I,
&t+kw
~
It.
,
hereby grant permission to the Rochester Institute Technology to reproduce my print thesis or dissertation in whole or in part. Any reproduction will not befor commercial use or profit.
Signature of Author:
Matthew Conte
Date:Cf-
OJ..
-J..065
Print Reproduction Permission Denied:
1, , hereby deny permission to the RIT Library of the Rochester Institute of Technology to reproduce my print thesis or dissertation in whole or in part.
Signature of Author: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date:
-Inclusion in the RIT Digital Media Library Electronic Thesis
&Dissertation (ETD) Archive
I, ' additionally grant to the Rochester Institute of Technology Digital Media Library (RIT DML) the non-exclusive license to archive and provide electronic access to my thesis or dissertation in whole or in part in all forms of media in perpetuity.I understand that my work, in addition to its bibliographic record and abstract, will be available to the
world-wide community of scholars and researchers through the RIT DML. I retain all other ownership
rights to the copyright of the thesis or dissertation. I also retain th.: right to use in future works (such as articles or books) all or part of this thesis or dissertation. I am aware that the Rochester Institute of
Technology does not require registration of copyright for ETDs.
I hereby certify that, if appropriate, I have obtained and attached written permission statements from the
owners of each third party copyrighted matter to be included in my thesis or dissertation. I certify that the
version I submitted is the same as that approved by my committee.
Abstract
Proteinsoftendonotmigrate as expected intwodimensionalelectrophoresis
basedontheirprimarysequence. Thepredicted isoelectricpoint(pi)
frequently
doesnotcoincide with experimental pivalues obtainedinthelaboratory. Thereasonsforthese
differencesledto thisstudy. Initially, 2DE data fromtheE. coli proteome was collected
and formatted. Thisdataset was splitintothreepartseachconsistingofdifferent levelsof
pi
discrepancy
(Apl). Theprotein sequencedata foreachAplsubset was runthroughapipeline. Ateach stageofthepipelinethe datawere analyzed
by
comparingeach ofthe threeAplsubsets toone another. The pipelineconsistedofa naiveapproach(considering
individual amino acidfrequencies), followedby
theapplication four differentalphabetsto represent sequences inasimplerway
by
groupingsimilaraminoacidsbasedontheircharge,functional, chemical,andhydrophobic properties . Thefinal step inthepipeline
involved
investigating
thedipeptidesof all ofthesesequencesusing boththe20aminoacid alphabet andthesimplifiedgroupings. Anevaluation ofthe alphabetdipeptide
analysisdemonstratedtheexistence of certaindipeptidesequences whichcorrelatewell
Table
ofContents
1
Introduction
12
Methods
72.1
Forming
thedata set 72.2 Experimental and predicted pi values 9
2.3
Extracting
useful information from collected subsetsequences 10
2.2.1 Aminoacidfrequencyanalysis (naive approach) ... 10
2.2.2Frequencyof amino acids (alphabets approach) ... 1 1
2.2.3 Frequencyof amino acids (dipeptide approach) ... 14
2.2.4Pipelineworkflow 15 3 Results 18 3.1 Naive approach 18 3.2 Alphabets approach 19 3.2.1 Charge 19 3.2.2 Chemical 21 3.2.3Functional 22 3.2.4 Hydrophobic 23 3.3 Dipeptide approach 24 3.4 Dipeptide threshold 26
3.5 Dipeptide using alphabets 28
3.5.1 Charge 28 3.5.2 Chemical 29 3.5.3 Functional 31 3.5.4Hydrophobic 32 4 Discussion 34 5 Conclusions 42 6 References 44
Introduction
Two-dimensionalgel electrophoresis
(2DE)
has beenanimportantlaboratory
techniqueforthefield of proteomicsforovertwodecades. 2DEallowstheresearcherto
separate and
identify
thousands ofproteinsfromacellularextractinasingle experiment.2DE isdifficultandtimeconsumingasit is necessarytodetermine ideal initial
conditions,waitforresults, andpossiblychangeconditions afterthat(1). Inaddition,
reproducibilityof gels andcomparison of2DEresultsbetweenseparate groupshas
proveddifficult (1). In2DE, proteins areseparatedinthefirst dimension
by
theirisoelectricpoints(thepH at whichthenet charge oftheproteiniszero)andinthesecond
dimension
by
theirmolecular weights. Theaccurate prediction of proteinisoelectricpoint(pi) andmolecular weight(MW)using simplytheamino acid sequence ofthe
protein wouldbe extremelyvaluabletoresearcherswho usetwo-dimensionalgel electrophoresis.
Computationalproceduresfor calculatingandpredictingthepifromtheamino acidcomposition of a proteinbasedonthedissociationconstants ofthe charged groups
withintheproteinhave been developed (2-8). The accuracyofthesealgorithmsis limited
by
thecertaintyofthevaluesforthedissociationsconstants andby
microenvironmental effects suchas charge-chargeinteractionsandpost-translational
modifications.
To systematicallyexploretherelationship betweenpi,molecularweightand
proteinsequence,adataset of proteins was collected andorganizedfromamodel
post-phosphorylation which can alterthe pI/MW;thepresenceofthesemodificationsmakes
pI/MW predictions much more difficultsincethemodifications intheproteinsmaycause
themtomigratetoapositionon a2-Dgelthat isquitedifferentthanwhatis predicted
basedsolelyontheamino acid sequenceoftheprotein. E. coliisalsoone ofthebest
characterized prokaryotes and much moredata beyond simplytheproteinsequencefor
each proteiniswidelyavailablefor it.
Atthispointit is necessarytoconsiderthebasic structuralfeaturesof proteins and
therole ofindividualamino acidsinthestructure andfunctionof proteins. Figure 1
belowshowsthestructure ofthe 20amino acidswith side chain structures showninred
(10).
Thecharge onall proteins arisesfromsome oftheaminoacid sidechains, aswell
asthecarboxy-and
amino-termini, some prostheticgroups, andbound ions. Ourpi
predictiontool(11)is designedtocalculate chargebasedonthe side chains and
carboxy-and amino-termini. Thecharge on amino acid sidechainsdependsonthepH ofthe
solutionandthe pKaoftheside chains. It isalso affected
by
the localizedenvironmentaround a side chain. Ourcurrent calculation model usesthe
following
pK.Avaluesforionizablegroups ontheproteinanddoesnot make anyadjustments to thepKAvalues of
thesidechains regardless oftheirenvironment withintheprotein(Table 1). Wealso
assumethattheseparationis basedonthe totalcharge onthe protein,notthe
H H H H H 1 .0 H3N+-aC -ce 1 Vj 1
P
H3N+ -ac -ce 1XP
1P
H3N+-aC-ce 1 x'o 1P
H3N+-ac-c'e 1 XP 1P
H3N+ -aC C^e 1 XP (CH2)3 1 CH2 1 CH2 1 CH2 1 CH2 1 NH CH2t^
h
w
1 C=NH2 1 1 C=0 |KJ
y
OH H NH2 Arginine (Arg/R) NH2 Glutamine (Gln/Q) Phenylalanine (Phe /F) Tyrosine (Tyr/Y) Tryptophan (Trp.W) H 1p
H3N+ -ac c'e 1XP
CH3 H 1P
H3N+-aC -Cp 1 "P H 1P
H3N+ -ttC-C> 1 XP CH2 H 1 /P H3N+ -aC - C^S 1xo
H 1 /> H3N+ -Mc - ce 1XC
H /rcH2
(CH2)4 | HN ,N 1 OHNH2 Glycine Alanine Histidine Serine
Lysine
(Lys/K)
(Gly/G) (Ala/A) (His/H) (Ser/S) H 1
P
H3N+ -^C-C^e 1 XP CH3 1 CH2 H 1P
H3N+ -aC - C*e 1 XP CH2 1 COOH H 1P
H3N+ -aC - CS 1 ^P H-C-OH 1 CH3 H 1P
H3N+ -aC-Ce 1 XP CH2 1 ' SH H2 C\
/
P
H2N+ -aC - Ce 0 Proline 1 (Pro /P) COOHjl GlutamicAcid AsparticAcid Threonine Cysteine
1 yP H3N+ -ac- c -e 1
XC
CH2 1(Glu/E) (Asp/D) (Thr/T) (Cys /C)
H H H H 1
P
HsN+^c-c'e 1 XP 1P
H,N+ -*c - ce 1XP
1 /P H3N+ -"C CS 1 XP 1p
H3N+ -aC-Cve 1 ^P CH2 1 S CH2 1 CH CH2 1 c=o HC-CH3 1 CH2 CH CfH3 CH3 1 CH3P\
CH3 CH3 NH21 CH31Methionine Leucine Asparagine Isoleucine Valine
(Met /M) (Leu /L) (Asn /N) (He /1) (Val/V)
Figure 1. Structuresof amino acids with side chains showninred,carboxylate groups ingreen, andamino groupsin blue(10).
Thecharge ontheproteinis thesum ofthechargesontheindividualamino acid
side chains. However, thecharge onindividualaminoacid sidechains canvarywhen
pKaforglutamic acidis about4.1. Inlysozyme, twoglutamic acidresiduesareinthe
active site. Oneis inapolar environment andhasanormalpKAvalue. Theother
glutamate side chainisinahydrophobic environment, whereanegative charge is
energeticallyunfavorable. Therefore the pKAvalue forthis glutamate side chain
increases,whichthendecreasestheextent ofthedeprotonationofthatside chain.This is
veryimportantinthemechanismoflysozymeactivity,which requiresthatone ofthe side
chainsbecharged
(deprotonated)
andthe otherbeuncharged(protonated) atthe same time.Inasecondexample,the serine intheactive sites of serineproteaseshasa much
differentacid-basebehaviorthanotherserinesnormally found inproteins (9). The
normalpKAvalue forthe hydroxyl groupontheserine side chainisgreaterthan 15, meaningthatthisgroup isnotfound inanionizedstateinmost proteins. Inserine
proteases, theinteractionoftheactive site serine withnearby histidineand aspartate side
chains(the so-called catalytictriad) leadsto theionizationofthe serinehydroxylgroup.
Meanwhile,the pKAvalueisreducedfromabout 15to a value closerto7 or8. This
example makesitclearthat themicroenvironmentof anindividualamino acidside chain
canchangeit ionization behavior.
Othereffects onthe pKAof anamino acid side chain canbeseen whencertain
aminoacidsare positioned nexttoeach other. Forexample,atypicalArginineresidue
whichis basicwill haveapKAof about 12.5 (Table 1
below)
andcarryafull+1 chargeinthephysiologicalpHrange. However,whentwo ofthesebasicArginineresidues are
adjacent ina protein sequencethe pKAvalues will decrease,duetorepulsionbetweenthe
arginine side chains tobecome lessionizedandcarry onlyafractionalpositive charge.
Table 1 below liststhe typical pKAvaluesforionizablegroups inproteins (9).
Group
TypicalpKaTerminala-carboxyl group 3.1
Asparticacid,
Glutamicacid 4.1
Histidine 6.0
Terminala-aminogroup 8.0
Cysteine 8.3
Tyrosine 10.9
Lysine 10.8
Arginine 12.5
Table 1. ThesearepKAvaluesthatarecommonly found forthesesidechainswhen
theyarepart of aprotein.ThepKAvaluesfortheseside chainsmay bequite different forthefreeamino acidinsolution. pKAvalues alsodependon temperature,ionic strength, andthemicroenvironmentoftheionizable group(9).
Aswebegantoconsidertheimpactof amino acid sequence onionization behaviorof
individualaminoacid sidechains,theneedtocreate groupsofaminoacidsbasedontheir
chemicaland physical characteristics ratherthanconcentratingon eachindividualamino
acidbecameapparent. Weelectedto dividethe aminoacidsintogroupsbasedontheir
chemical, functional, charge,andhydrophobiccharacteristics. Dividingsets of amino acids intothese groups enables ustouse smaller alphabetsbasedonthesecharacteristics
as opposedto simply usingthenormal20 letteraminoacid alphabet inourcalculations.
Weusedthesepropertygroupsto rewrite aprotein sequences intoan alternative
alphabetthatismuch smallerthan thenormal aminoacid alphabet of20characters (12).
onwhich amino acids fallunder what particulartypes. The Methodssectioncontains
examples of protein sequences thathave beentranslatedintothesedifferentalphabets.
Alphabet Type
(size)
Code
Meaning
Amino AcidswiththatCode Charge(3) A Negative D, E C Positive H,K,R N Nocharge A,C,F,G,I,L,M, N,P,Q,S,T,V,W,Y Chemical(8) A Acidic D, E L Aliphatic A,G,I,L,V M Amide N,Q R Aromatic F,W,Y C Basic R,H,K H Hydroxyl S,T I Imino P S Sulphur C,M Functional(4) A Acidic D, E C Basic H,K,R H Hydrophobic A,F,I,L, M, P, V, W P Polar C,G,N,Q,S,T,Y Hydrophobic(2) I Hydrophobic A, F, I, L, M, P, V,W 0 Hydrophilic C, D, E, G, H, K, N, Q,R, S,T,Y Table 2. Descriptionoffourabbreviated amino acid sequence alphabets: Charge,
Chemical,Functional, andHydrophobic (12). Shownarethenew alphabet codes usedforeachdifferentalphabet,whateach code represents intermsof properties of aminoacids,andthespecific amino acidsthatare included in
each property.
Proteinsthathaveasignificantdifferencebetweentheirpredicted pI/MW
(obtained usingsimilar algorithmsasmentionedabove)andtheirexperimental pI/MW
willbestudied. Asmentionedbefore,certain aminoacidsthatoccurinaparticular
of certain proteins(those withlargeAplvalues) thatdonot occurintheotherproteins
whose pi values were accuratelypredicted are important.
They
may leadtoamoreaccurate prediction ofthepi andMWall of proteinsfromtheiraminoacidcompositions.
Methods
Formingthedataset
The
ExPASy
Server's SWISS-2DPAGE database(13)provides extensive2-Dgel information forhuman, mouse,Arabidopsis thaliana,Dictyosteliumdiscoideum,E. coli, Saccharomyces cerevisiae,andStaphylococcusaureus (N315)which arealsocross-referencedin Swiss-Prot. Eachproteininthe database iscollected and annotatedfrom
experimental2-D gels readfromreference maps. Thedatabaseforthisproject contains 336proteins oftheE. coliproteome characterized
by
five differentresearchgroups(14-18). Itwasdecidedthatthe compilation of pI/MW setsfortheseproteins shouldbe
separatedaccordingtoeach researchgroup since experimentalconditionsvariedamong
them. Theproteinscontributed
by
thePhillips et al. (14),Pasquali etal. (15),andVanbogelen et al. (16)groupswereignored becausetheseproteins were also
characterized
by
the Tonellaetal.(17)and Yanet al. (18)groups. Two setswerecreated; thefirstcontains 228of all theproteinsdenotedby
Tonellaetal. and 153 proteins ofalltheproteinsdenoted
by
Yanet al. The firstset wasalso separatedbasedonthepH rangeused for isoelectric
focusing
(pH4-5, 4.5-5.5, 5-6, 5.5-6.7, 6-9,and6-11). We concentratedontheTonellaet al. setbecause itcoveredmorethan70%oftheE. coliWethenmatchedthepI/MWdata foreachprotein with its FASTAsequence.
This allowsustocompare experimental pI/MWvalueswith predicted pI/MW values.
ExPASy
providesits owntool forpredictingpI/MWwhich requiresalist ofSwiss-ProtproteinIDs asits inputofproteins(19). Wehavealsodevelopedourowntool that
includesapI/MW predictionwhichrequires inputofFASTA formatsequences,Genbank
format,orProtein Data Bank format (11). Both ofthesepredictiontools arebased (and
especiallypi for bothtools)on acalculationusingpKAvalues of aminoacids as
described earlierintheintroductionand
by
Bjellqvistet al. (19) The first stepwastoretrievethe2-D gelinformation forall oftheseproteins. ExPASyprovides awaytoget
thedata fromeach2-D gelinatabdelimited formatthatincludeseach spot (oneprotein
canhavemultiplespots on agel).
Having
thisdata inatabdelimited formatgave afar greaterease of use whenlater performing anytypeof analysis onthedata (suchascomparingexperimental pitopredicted pi). The fieldscontainedinthese files included:
genename,proteindescription, SWISS-2DPAGE SerialNumber,SWISS-2DPAGE
AccessionNumber,identificationmethod(gelmatching, microsequencing, or peptide
massfingerprinting),experimentalpi,experimentalMW, and references.
AlistofSwiss-ProtproteinIDs (2DPAGE Accession Number
-e.g. P00274)was
thenmadeforeach ofthe gels. This listofproteinswasthenusedtoretrieve aFASTA
file oftheproteinsfromeach gel(someproteins were repeatedformultiplespots). The
Swiss-Prot IDsweresubmittedtotheNCBItoolfor retrieving sequencesat
http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Protein. The sequenceswere
downloaded in FASTA formattobeused inour predictiontool. BatchretrievalatNCBI
whateverreason, the initialmethionine residuewhenretrieving in FASTA format. The
FASTA fileforthe set of proteinsfromeach gelwasthen fed intoourtool wherethe
output canbe conveniently recordedtoaMicrosoft Excel file. However, problems
occurred whenusingthe FASTA file from NCBI inourtoolsinceitwould orderthe file
basedonGenbank accession number andnot
by
Swiss-Prot IDwhich was neededtomatchthetab delimitedfile foreach gel. Thiswas solved
by
removingtheGenbankaccession number
(leaving
justtheSwiss-Prot ID)fromeach proteinentry ineachrespectiveFASTAfileusingasimplePerlscript. Thiswas facilitated
by
afewregularexpressions mostnotably: ":%s/gi|\d*|sp|//"(quotationsexcluded). ThepI/MW predict
toolat
ExPASy
(19)was not quiteas easytouse sinceitdoesnot outputinto aformatthatcanbe imported into Excelreadily. Theoutputfilewas editedusingthe
following
regularexpression:
":%sAs\s*At/g"
(quotations excluded)whichtransformedit into atab delimitedtextfile,allowing ittobe easilymanipulated in Excel. Neverthelessthe "ComputepI/MW
tool"
at
ExPASy
(19) gave strikinglysimilar resultstoourtool.Bothexperimentaldatasets derived fromtheTonella data(17)andthe Yan
(DIGE) data(18)werecomparedwithbothpI/MW predictiontools andtheresults canbe seenintheExcelfilesathttp://www.rit.edu/~mac3948/E2D/Ecoli/.
Experimentalandpredicted pi values
Looking
atthe compileddatasetitwasnoticeablethatsome predicted pi valueswere far different fromexperimental pi values. Someproteins differedinpredictedpi
versus experimental pi
by
asmuchas 1.86pH units(e.g.P06128,Phosphate-binding
pi wasexactlythesame asthe experimentalpi(e.g. P06960, Ornithine
carbamoyltransferase chain F(OTCase-2), seeAppendix A).
Tobettercharacterizethesediscrepancies across alloftheproteinsasimple
calculation was performed:
Experimental pi
-predicted pi = Delta (A)pi (Eq. 1)
Thedifferenceinexperimental pi and predicted piwillbereferredasApl inthis paper.
Themainfocusofthis projectisto
identify
potentialcauses ofvarying Aplvalues.Thedatasetwasthenbroken down into roughlythirds. Thefirst subset of
proteins consistedof60proteins wheretheAplvalue waslessthan0. 1. Anothersubset
held58 proteins ofAplvalues greaterthan0.3,but lessthan0.7 (0.3 <Apl<0.7). The
lastthirdwasputintoasubset of50proteins wheretheAplvalue was greaterthan0.7. Refertothe tables in Appendix Aforalistoftheproteins ineachAplsubset.
The
following
sections will providethesequentialstepsthatwereperformedonthe analysis ofthesedatasubsets. Itstartswith anaive approachto
handling
thedatathatdealswithsimply calculatingraw frequenciesofthe20aminoacids. Thenext section explainshowwe usedthefour differentalphabetsto analyzethedatasubsets, still
focusing
onindividual aminoacid frequencies. The dipeptideapproachesaredescribednext,followed
by
afinal sectionthatsummarizeshowthewhole process flows together.Extractinguseful information fromcollected subsetsequences
Amino acid
frequency
analysis (thenaiveapproach)There isa naive approachto
finding
a significantdifference betweeneachofthesubsets ofAplranges. This methodinvolves
determining
the counts of each amino acidacidbetweentheApl subsets. Ifasignificantdifference for anyaminoaciddoes exist
betweenanyoftheAplsubsets, then this wouldbeof greatinterest. Itwouldthen be
possibletoadjust a pi predictionalgorithmbased onindividual aminoacid
frequency
values and predictpivaluesthatwere closertoexperimental values.
Thefirststep in goingaboutthenaive approachwasto startfromthe listof
proteins foreachAplsubset. As previouslydescribed, thebatchsequence retrieval atthe
NCBI wasusedtoobtain aFASTA filethatcontained each sequenceincluded in each
Apl subset. A Perlprogramwasthenwrittentocountthenumberof aminoacidsin each
sequencefromaFASTA fileand calculatethe
frequency
ofeach, outputtingatabdelimited file
displaying
allofthe frequencies foreach sequence. Thecode ofthisprogramcanbe found in Appendix B
-aacounts.pl.
Another Perlprogram was written whichconcatenates eachseparatesequence
intoone
long
sequence. Thisallows oneto lookattheamino acidfrequenciesencompassingeachAplsubset as a wholeinsteadof protein
by
protein. Theprogramalso makes surethateachprotein sequenceis kept separate andthat theheader lineof
each sequenceisremoved(see Appendix B
-makeComposite.pl),which willbeshown
tobe important shortlywhen
looking
at twoamino acidsthatoccurone right aftertheother(see dipeptideapproach).
Frequencyofamino acids(alphabetsapproach) Chargealphabet
Amore sophisticatedanalysisofamino acid
frequency
canbe doneifthe aminoside chainsoftheamino acids canbeused toassignthemtofourabbreviatedamino acid
alphabets(Charge, Chemical, Functional, andHydrophobic). The Chargealphabet(see
Table2) is basedon whetherthe side chain of an amino acidcanhavea positiveor
negativecharge, orissimplyuncharged(neutral). Glutamic Acid (Glu/E) andAspartic
Acid
(Asp
/D)are theonlyamino acidsthatcontainthenegativelycharged carboxylgroup (COO). Therefore, intheCharge alphabettheyare groupedtogetherandgiventhe
code A. Likewise, Lysine (Lys /K)andArginine
(Arg
/R)are aminoacidsthatcontainthe positivelycharged amino groups(thelysine sidechaincontains ane-aminogroupand
argininehas aguanidino group). Inthe Chargealphabettheyare groupedtogetherwith
thecode C. Histidine (His /H) isalsogrouped intothepositivelychargedamino acid
group becauseprotonation ofthe nitrogenon itsside chainoccurs easily. The remaining
15 aminoacidshave side chainswhichnormally donotdemonstratecharge behavior in
proteins; theyare groupedtogetherand giventhecode N. Anexample ofusing the
Chargealphabet canbeseenbelow:
ACDEFGH (original sequence)
i
NNAANNC (Charge alphabet sequence)
Chemicalalphabet
The Chemicalalphabetincorporatestwo groupings, acidicandbasicwith codesA
andC,respectively. These groupings areanalogousto theAandC groupings inthe Chargealphabetforthesame reasons. The Chemicalalphabetcharacterizesthe
remaining 15 amino acidsbasedonmorethan theirlackof acharge. Asparagine (Asn /
N)andGlutamine (Gin /
Q)
areamino acidsthatcontainan amide(CONH2)
and are/W), andTyrosine (Tyr, Y)contain aromatic rings(code R). Serine(Ser/S)and Threonine (Thr/T)containthehydroxyl group(OH)ontheirside chains (code H). Proline (Pro / P) contains animinogroup (>C=NH)on itsside chain(code I). Finally,
the sulfurcontainingamino acids areCysteine (Cys /C) andMethionine (Met / M)are
groupedtogetherwithcode S. Anexample ofusingtheChemical alphabetcanbeseen
below:
ACDEFGHNPS (original sequence)
I
LSAARACMIH (Chemical alphabet sequence)
Functionalalphabet
The Functionalalphabet againincorporatestheA(acidic) andC(basic)groups as
didtheChargeandChemicalalphabets. The Functionalalphabet characterizes the
remainingaminoacidsinto 2 groups: H (hydrophobic)andP(polar)basedon whether
theaminoacidis hydrophobic (suchasAlanine)or polar(suchas Cysteine). Anexample
ofusingtheFunctionalalphabetcanbeseenbelow:
ACDEFGH (original sequence)
1
HPAAHPC (Functional alphabet sequence)
Hydrophobicalphabet
TheHydrophobic alphabetis similarto thelatter halfofthe Functionalalphabet.
Itgroups aminoacidsbased onlyonhydrophobicity. Aminoacidsthatarehydrophilic
(suchas Cysteine)are giventhecodeI. Aminoacidsthatarehydrophobic(suchas Alanine)aregiventhecodeO. Anexample ofusingthe Hydrophobicalphabet canbe
ACDEFGH (original sequence)
1
OIIIOII (Hydrophobic alphabet sequence)
Perlprograms were written thatconvert normalsequences intoeach ofthe four
alphabetsjust described(seecharge.pl, chemical.pl, functional.pl,andhydro.pl in Appendix B). Theprograms also calculate and
display
thefrequency
of each alphabeticcodethatis chosen.
Frequency
ofamino acids (dipeptideapproach)The problemthatcertain abnormalpKAside chains values of amino acids
affectingthe overallcharge of aprotein stillhadnotbeen dealtwithupuntilthispoint.
All thathad been consideredwasthesumof asetofstrict pKAvalues foreach amino
acidwithouttaking intoaccountanychangesthatmight occurduetocertain amino acids
being
nexttootheramino acids insequence. Theapproachtosolvingthis problemwas to examineevery"dipeptide"
inthe threeAplsubsets. Asequenceoflength 7 has 6
dipeptides. Forexample,
Sequence: Dipeptides: Dipeptide counts: Frequency:
ABCABBC AB AB = 2 0.333 BC BC = 2 0.333 CA CA = 1 0.167 AB BB = 1 0.167 BB BC
The
frequency
atwhich eachdipeptideoccurs inaparticular sequence isofinterest, particularly,whentheyare consideredin eachAplsubset. A Perl program was
dipeptideinthe sequencesofthe FASTA afile that is input (see Appendix B- dipeps.pl
fordipeptides output in
increasing
order ordipepsA.pl fordipeptidesoutputalphabeticallyfrom AA ... VV). Aswasthe case earlierwiththenormalamino acid
alphabet, thenumberofdifferentdipeptides(20x20=400 forthe
normalalphabet)
becameproblematic. The samedipeptidetechnique wasappliedto sequences after
convertingthem intotheCharge,
Chemical,
Functional, andHydrophobic alphabetsto alleviatethisproblem.Combining
an entireApl subsetofFASTAsequences intoonelong
sequence(using
makeComposite.pl-seeAppendixB)alsobecameproblematic. Tocountthe number ofdipeptides ina set ofsequencesthathas beencombinedintoone
long
sequence, special attention needstobe paidsothatthelastaminoacidinone sequenceandthefirstamino acidinthenextsequence arenot counted as adipeptide. The format
ofthe outputfile frommakeComposite.pl handlesthisproblem
by
replacingeach accessionlinewith ablanknewline. Theotherprogramscan now usethis formattedFASTA fileso thatthedipeptidecountsarejustas accurate as naive and alphabetcounts.
Pipeline Workflow
So fartherehave beenstages at whichthe
frequency
of anaminoacid, groupof aminoacids (coded accordingto the fouralphabets),dipeptide, orgroupeddipeptide (coded accordingtothe fouralphabets)has beenexamined. Theprocess oftransforming thedatatoreach each ofthese stages mayappear somewhatconfusing. Figure 2 belowdiagrams howtogo fromaninitialset ofFASTA sequences(foreachApl subset) toeach stage of analysis. The flow intakingthenaiveapproachwould gofromFASTA
sequencetomakeComposite.plto aacounts.pl andthenanalysis. However,the flow for
examiningdipeptideswithafunctionalalphabetismore complex. Itbegins
by
transferringtheFASTAsequencetomakeComposite.pl tofunctional.pltodipeps.pl (or
dipepsA.pl)
followedby
analysis. Table 3 belowgivesabrief descriptionofeachprogram usedinthis pipeline workflow(fora moredetaileddescriptionandcode ofeach
program see Appendix B).
( \ Aplsunset FASTA file v. J charge.pl [ i ' chemical.pl 1 \ * ~~~~~ ^-^^^^r \ dipeps.pl or dipepsA.pl ^ ) makeComposite.pl i r functional.pl i hydro.pl " r ~\ analysis i i' aaco ants.pi ^ )
Figure 2. Workflow diagramthatshows howto getto each stageof analysis (naive, alphabets,dipeptides).
Program Description
aacounts.pl Countsthe number of each aminoacid(normal alphabet) ina sequence fromaFASTA fileanddetermines the
frequency
of each. Output istoFASTAfilename.aacountscharge.pl Convertsthe amino acids fromthe sequencesinaFASTA file
into a3-letteralphabetusingthecharge()methodin
Bio::Tools::OddCodes (12). Itthencounts thenumberofeach codeforeach sequence as wellas eachfrequency.
chemical.pl Convertsthe amino acids fromthe sequencesinaFASTA file
intoan8-letteralphabetusingthechemical()methodin
Bio::Tools::OddCodes (12). Itthencountsthenumber ofeach code foreach sequence as well as eachfrequency.
dipeps.pl Countsthenumber of eachdifferentaminoacidpairforeach sequence inthegivenFASTAfiles. It displays each pairin orderfrom highest
frequency
to lowest.dipepsA.pl Countsthenumberofeachdifferentaminoacid pairforeach sequenceinthegivenFASTAfiles. It displayseach pairin alphabetical order(AA ... W).
functional.pl Convertsthe amino acidsfromthesequences inaFASTAfile intoa4-letteralphabetusingthe
functional()
method inBio::Tools::OddCodes (12). Itthencountsthenumber of each code foreach sequence aswell as eachfrequency.
hydro.pl Convertsthe amino acids fromthesequences inaFASTAfile intoa2-letteralphabetusingthehydrophobic()methodin
Bio::Tools::OddCodes(12). Itthencountsthenumber of each codeforeach sequenceaswell as each frequency.
makeComposite.pl Converts FASTAfilesofmultiple sequences intoa single
(composite)sequence. Thiscomposite sequence isthenableto beused with other programslisted here.
Table3. Descriptionoftheprograms usedinthispipeline workflow. AppendixB
Results
Naiveapproach
The intitialnaive approachto analyzingthedatasetwasdonetodeterminethe
counts ofeachamino acid
(using
thenormalalphabet) ineachAplsubset(Apl <0.1; 0.3<Apl<0.7;Apl>0.7)and comparethe relative
frequency
of occurrence foreach amino acidbetweentheApl subsets. Acomparison ofthefrequencies betweentheApl<0. 1subset andthe0.3 <Apl<0.7subset isshownin Figure 3. Asimilar comparison
betweentheApl<0. 1 subset andtheApl>0.7subsetis displayed in Figure 4.
FrequenciesofAmino Acidsin \pi <0.1 and(0.3<Apl<0.7)
Figure 3. FrequencyofIndividual Amino Acids in Two Apl Subsets. The Xaxis labelsrepresenttheone letterabbreviations oftheamino acids. Shown in blueare istheApl<0. 1 subsetandshowninyellowisthe0.3 <Apl<0.7subset. The Apl< 0.1 subsetconsists of60proteins which comprise22472 totalamino acids. The 0.3 <Apl<0.7subset consists of58 proteins whichcomprise 17906totalaminoacids. More informationabout eachindividualproteinintheseAplsubsets canbeseenin AppendixA.
FrequenciesofAmino Acidsin Apl<0.1 andApl>0.7
Figure 4.
Frequency
ofIndividual Amino Acids in Two Apl Subsets. The Xaxis labelsrepresenttheoneletterabbreviations oftheaminoacids. Shown in blueareistheApl<0. 1 subsetand showninyellowistheApl>0.7 subset. The Apl<0. 1 subsetconsists of60proteinswhichcomprise22472 totalamino acids. TheApl> 0.7 subsetconsists of50proteinswhich comprise 15581 totalamino acids. More informationabout eachindividualproteinintheseAplsubsets canbeseenin Appendix A.
Alphabets approach
-Charge
The nextstep inanalysis wasto convert each oftheAplsubsets intoasequence
thatutilizesthefouralphabets. This decreasesthe size oftheaminoacidalphabet and
reducesthenumberofvariables
being
examined. The differentalphabets aresummarizedin Table 2.
Using
theChargealphabet,a comparisonofthefrequenciesbetweentheApl<0. 1 subset andthe0.3 <Apl<0.7subsetis shownin Figure 5. Again
usingtheChargealphabetasimilar comparisonbetweenthe Apl<0.1 subset andtheApl
Frequencies ofAmino Acids (Charge alphabet) in
Apl<0.1 and (0.3<Apl<
0.7)
Apl< 0.1
?0.3< Apl< 0.7
CAN
Amino Acid (charge alphabet)
Figure 5.
Frequency
ofAmino AcidsUsing
the Charge Alphabet in Two Apl Subsets.Frequencies ofAmino Acids (Charge alphabet) in
Apl<0.1 andApl>0.7
80 70 -. 60 s? 50 > o g 40
|
30 "" 20 10 0 Apl< 0.1 ?Apl> 0.7; CAN AminoAcid (charge alphabet)Figure 6.
Frequency
ofAmino AcidsUsing
theCharge Alphabet in Two Apl Subsets.-Chemical
Using
theChemicalalphabet,acomparisonofthefrequenciesbetweentheApl <0. 1 subsetandthe0.3 <Apl <0.7 subsetisshown inFigure 7. Figure 8 displaysthe
same comparisonbetweentheApl <0. 1 subset andtheApl>0.7subset.
FrequenciesofAmino Acids (Chemical alphabet)inApl<0.1 and
(0.3<Apl<0.7)
Apl< 0.1 D0.3< Apl< 0.7
R M H C
AminoAcid (chemical alphabet)
Figure 7.
Frequency
ofAmino AcidsUsing
the ChemicalAlphabetin Two Apl Subsets.FrequenciesofAmino Acids (Chemical alphabet) in
Apl<0.1 andApl>0.7
Apl<0.1
?Apl>0.7
I R M H C
Amino Acid (chemical alphabet)
-Functional
UsingtheFunctionalalphabet,a comparison ofthefrequencies betweentheApl< 0. 1 subset andthe0.3 <Apl<0.7 subsetisshownin Figure9. Again usingthe
Functionalalphabet a similar comparisonbetweentheApl<0.1 subsetandtheApl>0.7
subset isdisplayed in Figure 10.
Frequencies ofAmino Acids (Functional alphabet) in
Apl<0.1 and(0.3<Apl<0.7)
Apl<0.1 D0.3<Apl< 0.7
A P
Amino Acid (functional alphabet)
Figure 9. FrequencyofAminoAcidsUsingtheFunctional Alphabetin Two Apl Subsets.
FrequenciesofAmino Acids (Functionalalphabet)in Apl<0.1 and Apl>0.7
Apl<0.1
DApl>0.7
A P
AminoAcid (functional alphabet)
Figure 10.
Frequency
ofAmino AcidsUsing
theFunctional Alphabet in Two Apl Subsets.-Hydrophobic
Using
theHydrophobicalphabet, acomparisonofthefrequencies betweenthe Apl<0.1 subsetandthe0.3 <Apl<0.7 subsetisshownin Figure 11. Again usingtheHydrophobicalphabet a similarcomparisonbetweentheApl<0. 1 subsetandtheApl>
FrequenciesofAmino Acids (Hydrophobic alphabet) in
Apl <0.1 and (0.3<Apl <0.7)
Apl<0.1 ? 0.3<Apl<0.7
I O
Amino Acid (hydrophobic alphabet)
Figure 11.
Frequency
ofAmino AcidsUsing
theHydrophobic Alphabet in Two Apl Subsets.Frequencies ofAmino Acids (Hydrophobic alphabet) in
Apl<0.1 and Apl >0.7
Apl<0.1 D Apl>0.7
I O
Amino Acid (hydrophobic alphabet)
Figure 12. FrequencyofAminoAcids
Using
theHydrophobic Alphabet in Two Apl Subsets.Dipeptideapproach
Using
a moresophisticatedmethodthatlooksatdipeptides ofasequence gave anis similarto thenaive approachinthat itjustexamines dipeptides usingthenormalamino
acid alphabet. Thisresults inupwardsof400differentdipeptides (there may be slightly
fewerthan400dipeptides ina given subsetowingto thechance thatnotallpossible
dipeptidesmayoccur). The difference in
frequency
ofevery dipeptide between Aplsubsets was also calculated("Deltafrequency"or"%"). Inotherwords,aDelta% of100
wouldmeanthatacertaindipeptideoccurred2timesas muchinonesubset comparedto
anothersubset.The differences, or"Delta%"valuescanbeseenin Figure 13 when
comparingtheApl< 0. 1 subsetandthe0.3 <Apl<0.7 subset. Figure 14 showsthe
similarDelta %values whencomparingtheApl <0. 1 subset andtheApl>0.7 subset.
To betterexplainFigures 13-16, considerthebar indicated
by
thearrowin Figure 13.This barrepresentsthe 1 1 times that therewas aA%valuebetween 100%and 150%
whencomparing dipeptide frequencies inthe two different Aplsets.
DensitiesofDelta % Values inApl<0.1and 0.3<Apl<0.7
UsingaNormalAmino Acid Alphabet
Figure 13. DensityofDelta% ValuesofDipeptidesin Two Apl Subsets. The Apl<0.1
subset consists of60proteins which comprise22412totaldipeptides. The 0.3 <Apl<
0.7subset consists of58proteins which comprise 17848 totaldipeptides. More
informationabout eachindividualproteinintheseAplsubsets canbeseenin Appendix A.
DensitiesofDelta % Values inApl<0.1and Apl>0.7
UsingaNormal Amino Acid Alphabet
>25 >50 >75 >100 >150 >200 >300 >400 Delta%range
Figure 14.
Density
ofDelta % Values ofDipeptidesin Two Apl Subsets. The Apl <0. 1 subsetconsists of60proteins whichcomprise22412 totaldipeptides. The Apl>0.7 subsetconsists of50proteinswhichcomprise 15531 totaldipeptides. More informationabouteachindividual proteinintheseApl subsets canbeseeninAppendixA.
Dipeptide Threshold
Asimilaranalysis was performed onthesameAplsubsets wheredipeptidesthat
hadavery low
frequency
(whichmaychangeits Delta %valuetoorapidly, seeDiscussion foranelaboration)weremonitored. A
frequency
of occurrencethresholdvalue of0.1% hadtobemetfor dipeptides. Inotherwords, ifadipeptideoccurred so
infrequently
(under 0.1%ofthe totalnumber ofdipeptides)thenitwaseliminated. Theremaining dipeptideswere countedandtheDelta%values comparingtheApl <0. 1 subset andthe 0.3 <Apl<0.7subsetcanbeseenin Figure 15. Likewise,thecomparison
fortheApl <0.1 subsetandtheApl >0.7 subset canbeseenin Figure 16. Dipeptides
thatwerefound intheextreme positive ornegativeranges ofthesefigures areindicated
by
theone letteraminoacidcodes. Forinstance, thedipeptideRR(arginine-arginine)
in Figure 15 was foundmuchlessfrequently
inthe Apl<0.1 datasetthaninthe 0.3<Apl<DensitiesofDelta %Values in Apl<0.1and0.3< \pl<0.7UsingaNormalAmino Acid Alphabet (wherefrequencyofdipeptidemustbeabove0.1)
<-50 <-40 <-30 <-20 <-10 <0 >0 >10 >20 >30 >40 >50 >60 >75 >100 Delta %rangeand particulardipeptides
Figure 15. DensityofDelta % Values ofDipeptides in Two Apl Subsetswitha
Thresholdof0.1%. 90 80 c ffi 70 w S 60 n a 50 a ai E 40 a 30 0) F 20 3 z 10 0
DensitiesofDelta %Values in Apl<0.1 andApl>0.7Usinga NormalAmino Acid Alphabet (where frequencyofdipeptide mustbeabove 0.1)
<-50 <-40 <-20 <0 >0 >20
Delta%range andparticulardipeptides
>80 >100
Figure 16. DensityofDelta% ValuesofDipeptides in Two Apl Subsetswitha
Dipeptide using Alphabets
The finalstep in analysis wastocombinethealphabetanddipeptideapproaches
together.
Using
the smaller alphabetsdramatically
reduced andcondensedtheresultsascomparedtousingthenormal alphabet which creates400possibledipeptides.
-Charge
Using
theCharge alphabet,a comparisonofthedipeptidefrequencies betweentheApl<0.1 subset andthe0.3 <Apl< 0.7 subsetis shownin Figure 17 as wellastheDelta
%values foreachdipeptide. Thesame comparisonis shownbetweentheApl<0. 1
subset andtheApl>0.7subsetin Figure 18.
ComparisonofDipeptides (basedon charge characteristic)takenfrom Apl <0.1 and0.3< Apl<0.7
Dipeptide(chargealphabet)
Figure 17. FrequenciesofCharge Alphabet Dipeptides in Two Apl Subsets. Shown in bluearethefrequencies ofeachdipeptideintheApl<0.1 subset andshowninyellowis difference in
frequency
foreachdidpeptide betweentheApl<0.1 subsetandthe 0.3 < Apl<0.7 subset.ComparisonofDipeptides(basedoncharge characteristic) takenfrom Apl< 0.1 and Apl> 0.7
60 50 40 30 20 10 0 -10 -20 -30 -40 A^ NKI
AfsjJ
CA CN NC NN CCDipeptide (charge alphabet)
Figure 18. FrequenciesofCharge Alphabet Dipeptides in Two Apl Subsets. Shown in bluearethe frequencies of eachdipeptide in theApl<0.1 subsetandshowninyellowis
difference in
frequency
for eachdidpeptide betweentheApl<0. 1 subsetandtheApl >0.7subset.
-Chemical
Using
theChemicalalphabet,a comparison ofthedipeptide frequencies betweentheApl<0. 1 subsetandthe0.3 <Apl <0.7 subset isshownin Figure 19aswellasthe
Delta%valuesforeachdipeptide. Thesame comparisonisshownbetweentheApl<0. 1
subsetandtheApl>0.7subsetin Figure 20. The Chemicalalphabet withdipeptideswas
sufficientlylargethatitwas not possibleto
display
allthepossibledipeptidecombinations inFigures 19 and20. Instead onlythe
density
values were chosentoDensitiesofDelta%Values in Apl<0.1 and0.3< Apl<0.7) Usinga Chemical Alphabet 16
|
14 12 S 10 a) I 6 a n E SS(-28%) Al(-25%) AS(-24%) MS(-22%) IS(-20%)I
I
IC(43%) IM(48%) rt*(48%)J
RR(61%) <-20 <-10 <0 >0 >10 >20 >30 >40Delta %range and particulardipeptides
>50 >60
Figure 19.
Density
ofDelta % Values ofChemical Alphabet Dipeptides in Two Apl Subsets. The Apl<0.1 subset consistsof60proteinswhich comprise22412 totaldipeptides. The 0.3<Apl <0.7subset consistsof58proteins which comprise 17848
totaldipeptides. More informationabout eachindividualproteinintheseAplsubsets can beseenin Appendix A.
DensitiesofDelta%Values inApl<0 1andApl>0.7UsingaChemical Alphabet
<-40 <-30 <-20 <-10 <0 >0 >10 >20 >30 >40
Delta%range and particulardipeptides
>50 >60 >70 >80
J
Figure 20. DensityofDelta% ValuesofChemical Alphabet Dipeptides in Two Apl Subsets. The Apl<0.1 subset consistsof60proteins which comprise22412totaldipeptides. The Apl >0.7 subset consists of50proteins whichcomprise 15531 total
dipeptides. Moreinformationabout eachindividualproteinintheseAplsubsets canbe seen inAppendix A.
-Functional
Using
the Functionalalphabet,a comparison ofthedipeptide frequencies between the Apl<0. 1 subset andthe 0.3 <Apl<0.7subsetis shownin Figure 2 1 as wellastheDelta%values foreach dipeptide. Thesame comparisonis shownbetweentheApl<0. 1
subset andtheApl>0.7subsetin Figure 22.
Comparisonofdipeptides (basedonfunctional characteristic)takenfrom
Apl<0.1and0.3<Apl<0.7
Dipeptide (functional alphabet)
Figure 21. FrequenciesofFunctional Alphabet Dipeptides in Two Apl Subsets. Shown in bluearethefrequenciesof eachdipeptide intheApl<0.1 subset and showninyellow is difference in
frequency
foreachdidpeptide betweentheApl<0. 1 subset andthe0.3 < Apl<0.7subset.Comparisonofdipeptides (basedonfunctionalcharacteristic) takenfrom Apl< 0.1 andApl>0.7
30 20 10 0 -10 -20 -30 -40 /A AH CA I-A
jjLfc-fa.tfi.ll tUljlj
AC HH HC PC CH CP PH HP CC PPDipeptide(functional alphabet)
Figure 22. Frequencies ofFunctional Alphabet Dipeptides in Two Apl Subsets. Shown
inbluearethefrequenciesofeachdipeptideintheApl <0. 1 subsetand showninyellow
isdifference in
frequency
foreachdidpeptidebetweentheApl<0. 1 subsetandtheApl>0.7subset.
-Hydrophobic
Using
theHydrophobicalphabet,a comparisonofthedipeptide frequencies betweentheApl<0.1 subset andthe0.3 <Apl<0.7 subsetis shownin Figure 23 aswellas theDelta% valuesforeachdipeptide. Thesame comparisonisshownbetweenthe Apl<0. 1 subset andtheApl>0.7subset in Figure 24.
Comparisonofdipeptides(basedonhydrophobiccharacteristic) taken fromaApl<0.1and 0.3<Apl<0.7
%ofDipeptideinApl<0.1
DDelta%(piA<0.1- 0.3<Apl<0.7)
Dipeptide(hydrophobicityalphabet)
Figure 23. Frequencies ofHydrophobic Alphabet Dipeptides in Two Apl Subsets.
Shown inblueare thefrequencies ofeachdipeptide intheApl<0.1 subset and shownin
yellowisdifference in
frequency
foreachdidpeptide betweentheApl<0. 1 subsetandthe 0.3 <Apl<0.7subset.
Comparisonofdipeptides (basedonhydrophobic characteristic) taken from
Apl<0.1 andApl>0.7
%ofDipeptide inApl<0.1
DDelta %A(pl<0.1-Apl>0.7)
Dipeptide(hydrophobicityalphabet)
Figure24. FrequenciesofHydrophobicAlphabetDipeptides inTwo Apl Subsets.
Shownin bluearethefrequencies of eachdipeptideintheApl<0. 1 subset and shownin
yellowis difference in
frequency
foreach didpeptide betweentheApl<0.1 subset andDiscussion
When exploringthebehaviorof proteinsundergoing isoelectricfocusing, there
existsa
discrepancy
between predicted pi values andexperimentally determinedpivalues forahighpercentage ofthoseproteins. Thiscomparisonofpi values was
performedusingpredictionsbasedon our algorithm(11)or similar algorithms(19)and
experimental pi valuesdeterminedin different
laboratory
settings (14-18). Thesizeandregular occurrence ofthesedifferencesjustifiedaclosestudyoftheprotein sequences in
aneffortto
identify
underlyingpatternsthatcouldcontribute to thesedifferences. Thequestion now
lay
inwhethertherewas enoughinformation intheresults thatwereextractedtobeable tomoreaccuratelypredictpi valuesusingthe informationobtained. The first
key
elementwashaving
a reliabledatasetthatwasbothuniformandrobustenoughtogivemeaningfuldata. A datasetthat istoo diversewouldleadto
complications suchasthequestionofhowtohandlepost-translational modificationsin
predictingpi andMW.
Simply
finding
the frequenciesof all dipeptides inallknown protein sequences would provide adatasetthatis certainlyrobustenough.Unfortunately,therobustness wouldbeoffset
by
thehigh levelof noise inthedata dueto thefactthatdifferentorganismshave differentpost-translationalmodifications. Adatasetthatistoo smallwould nothaveenoughdipeptideinformationtomake surethat the
dipeptides thatoccurinthelowest frequenciesare still seeninsufficientabundanceto
maintaintheirstatistical validity. To overcomebothofthesehurdles,thesearch space
was limited onlytoproteins in E. coli sinceit displays very fewpost-translational
modifications andhasa proteomethathas been sufficientlydocumentedtodo a case
In
keeping
withthe theme ofhaving
adatasetwithas littlenoiseaspossible, yetstill retainingas muchrobustness as possibleitwasdecidedthateventhoughwell
structured2DEdataexistedfrom5 differentgroups (14-18), itwasprobablybestto limit
theusage ofthisdatato oneortwo ofthesegroups (17and18). Boththe Yanet al. (18)
andTonellaet al.(19) groups performedlarge scale2DEstudies ontheE. coli proteome.
The Tonella (19) groupboastedover70%oftheE. coliproteome
being
coveredintheir data. Sincenone ofthegroupsusedthesame2DE conditionsitwasdecidedthat thedatafromtheTonella(19) groupwouldbethe only dataused. The primary justificationwas
toensurethat the experimentalpiand MWvalues were gainedusing thesame conditions.
This inturnwould reduce asmuchnoise aspossible. Inaddition, the factthat theirdata
coveredover70%oftheE. coligenomeheldpromise forthisstudy.
Oncethe entiredataset was selected,anotherdecision hadtobe madeabouthow
toseparatethedatasothatclearlinescouldbeseenbetweenproteinsthathad verysmall
Aplvalues and proteinsthathadgreaterAplvalues. Doingso wouldmake itpossibleto
seeifsignificantsequencedifferences (atthedipeptidelevel)between Aplsubsets
existed. Itwas necessarytobreakthedataset intoa small numberofAplsubsets. These
arbitrary Apl cut-offranges (Apl<0.1; 0.3 <Apl <0.7;Apl>0.7)werechoseninorder
toseparatethedata into distinctsetsof similar sizethatcouldbe comparedwitheach
other.
Therewas
difficulty
indeciding
howtoseparatetheentiredatasetintothese threesubsets. One possible approach wasto separatethedataset into many smaller sized
subsetsbasedonalargernumber ofAplranges. Onone hand
doing
thismight providerelativetoadjacentAplranges. Ontheotherhand
by doing
itthisway, there is alossofinformationatthesequencelevel dueto the smallernumber of sequencesthatwouldbe
found ineachdataset. This, inturn, wouldthreaten the reliabilityofourfindings.
Therefore, thedatasethadtobeseparatedintosubsets ofsufficient robustness. The Apl
<0.1 subset consistsof60proteinswhichcomprise22472 totalamino acids or22412
total dipeptides. The 0.3 <Apl<0.7 subset consists of58proteinswhichcomprise
17906totalamino acids or 17848totaldipeptides. The Apl >0.7subset consists of50
proteinswhichcomprise 15581 totalamino acids or 15531 totaldipeptides. More
informationabouteachindividualproteinintheseAplsubsets,
including
Apl, adescriptionandSWISS-2DPAGE AccessionNumber,canbeseenin Appendix A.
Theanalytical process is bestviewedas a pipeline as seenin Figure 2 inthe
Methodssection. Webeganouranalysis withthemost simple method(naiveapproach),
worktheirwaytomore complicated methods (alphabetsapproach), and end withthe
most complicated methods(dipeptidesusingalphabets approach). Alongthispath,the
relevance ofthedataalsobecomesmorecomplicated,butmore
interesting
atthesametime (withafewexceptions).
Thenaive approachto
handling
thedatasetdidnotprovideanymeaningfulresults. Itwas quicklyapparentthatindividual aminoacidfrequencies inagiven set of
protein sequencesdidnotvary amongthe threedatasubsets. Intheend,no amino acid
frequency
characteristics using simplythe naive approach werefoundtobe significantlydifferent betweenthethreeAplsubsets. Thiscanbe seenin Figures 3and4when
comparingtheApl<0. 1 subsetwiththe0.3<Apl<0.7 subset andtheApl <0. 1 subset
yellowfrequenciescanbeseenfor any individualamino acid; thevaluesare alsonearly identicalwhenFigure 3 andFigure 4 arecompared,as well. The lackofacorrelation
betweenAplvalues andthe
frequency
ofthese individualaminoacids showed usthatweneededtoconsidertheprobleminmoredepth
-morethanone aminoacid at atime.
To simplifythe analysis, thenumber of variableswas reduced
by
usingthe four alphabetsdescribedin Table 2atthenext stageinthepipeline. Again, theresultsdidnotreveal anysignificanttrendsthat could affectthewaythatpi iscalculated. Figures 5 and
6 (Chargealphabetcomparisons),Figures 7and 8 (Chemicalalphabetcomparisons),
Figures 9and 10 (Functionalalphabetcomparisons),andFigures 1 1 and 12
(Hydrophobicalphabetcomparisons) showverysimilarresultsto thatofthe naive
approachin Figures 3 and4. There isnotrendofincrease ordecrease in Apl for any particularamino acidwhenmoving betweenthe threedatasets.
Itwasexpectedthatmore meaningful results wouldbeobtained
by
analysis ofthedipeptide frequencies. Allpreviouspi predictionalgorithms(2-8),
including
ours(11)treat thepKaforeachamino acidindependently,regardless ofitsnear ordistant
neighbors. Atthispointit is instructivetoconsiderthe experimental conditions
normally employedfor isoelectric
focusing
(IEF). The biological functionofproteinsrequiresthattheymaintaintheirthreedimensional structureintact. However,forIEF,we are interested only inseparatingthe proteins,notobservingtheirbiologicalfunction. To assurethebest separation,reagents such asureaanddetergentsare added priorto IEFto
disrupt anysecondary,tertiaryorquaternaryaspects of protein structure. Inthese
fully
denaturedproteins, theonlysignificantinteractionsareexpectedto occurbetweenamino
considerationoftheeffectofneighboringamino acids ontheirrespective sidechainpKA
valuesmayprove valuable.
Withrespecttoeach alphabetthatwas usedthe discussionwill advancefrom the
leastsignificant alphabet dipeptideresultsto the mostsignificant alphabetdipeptide
results. However,theanalysisusingthenormal amino acidalphabetwillbe discussed
first. Atfirstglance Figures 13 and 14 show somevery promisingresults. The Delta %
value representsthechangein
frequency
from one Aplsubsetbeing
comparedto thenext Apl subset. Therefore, Delta %valuesthat areinthe300and400ranges would seemverysignificant. Theproblemwasthatmost ofthedipeptides thatfell into theseextreme
ranges were dipeptidesthatwhose overall
frequency
wasvanishinglysmall. Adipeptidethat occursonlyonceinoneAplsubset andmultipletimes inanotherAplsubsetis going to haveavery high Delta %value. Itwould notbewisetorelyon suchdipeptide
frequenciestoredesignof apiprediction algorithm. Tonegotiatethroughall ofthe400
dipeptides inthenormalalphabet,thesame analysis was run with athreshold
frequency
occurrencefor dipeptides 0.1%. Inotherwords,ifadipeptide didnot occurinat least 0. 1%ofthe time (oratleast 22timesintheApl<0.1 dataset, whichcontained22412
amino acids)itwas not usedforanalysis. Theresults ofthiscanbeseenin Figures 15
and 16. Therestill exist extreme outliersthathave Delta %valuesinthe 100range which
willlater bereanalyzed
by
comparison with some ofthealphabetdipeptideanalyses. The alphabetthatshowedthe leastinteresting
results whenusingadipeptideapproachwasthehydrophobic alphabet. ComparisonsoftheAplsubsets usingthe
barsandis verynegligible ineach ofthe4dipeptides (nomore thanaDelta%value of
1.85was seen in anyofthe4dipeptides).
The charge alphabet showedslightlymore significant results for dipeptide
anaylsis. Delta %values reachedintothe 30+rangeforsome dipeptides. The AA
dipeptide
(negatively
charged amino acidfollowedby
negativelycharged aminoacid; seeTable 2 fordefinitionsofallthealphabetcodes) hadaDelta%of-31.3% going fromthe
Apl<0. 1 subsetto theApl>0.7subset(Figure 1 8). The Delta % fortheAAdipeptide is
also large(-18.9%) intheother comparison oftheApl<0. 1 subsetandthe0.3 <Apl<
0.7subset(Figure 17). However,the
frequency
of occurrenceofthisAAdipeptide (asshown inthebluebars) is very low inallthree Aplsubsets. Whatwe wouldliketoseeis
alarge Delta %value accompaniedwith alarge
frequency
of occurrenceforaparticulardipeptide. Thiswas notapparentin anyofthe dipeptides usingthe Chargealphabet.
Staying
withthe theme that themost significant results will combine large Delta %valuealongwith alargefrequency
ofoccurrencevaluefordipeptides,theFunctionalalphabetis considerednext. Figures 2 1 and22 representingtheanalysisusingthe
Functionalalphabet show a collection ofdipeptidesthathave both significantly large
Delta%values andsignificantly large frequencies of occurrence: AA, AH, HA, HP, CP,
PH, PP.
Itwasimportanttoreferbacktotheanalysisthatwasdone using dipeptides based
onthecompleteamino acidalphabet. Figures 1 5 and 16point out afewextreme
dipeptide outliers: KY,YS (Figure 15) andEE, NN,YT (Figure 16).
Converting
theserespectively. These threedifferentdipeptides all map backtoextreme outliersfromthe
analysisdoneusing the Functionalalphabet(Figures 21 and22).
Using
theChemicalalphabet withdipeptidescreateddatathatwassufficientlylargethatitwas not possibleto
display
all thepossibledipeptidecombinations in Figures19and20. Instead onlythe
density
values were chosentodisplay. Particularoutlierdipeptidesare labeledonthetop of each columnwiththeirrespectiveDelta%values.
The significanceofthese findings isthatit may leadtoamoreaccurate
calculation ofpithancurrently existingmethods(11, 19). Thesedata clearlysupportthe ideathat thepKAvalueforanamino acid sidechain, evenwhentheproteinis
fully
denatured, dependsonthemicroenvironment created
by
the nearest neighbors ofthatamino acid.
Using
theextremeoutlierdipeptides thathave been identified fromthisstudyof180annotatedE. coliproteins,it may bepossible toadjustthealgorithms for
calculatingpi values. Ouralgorithmfor calculatingpi fromaminoacidsequence(11)
couldbemodifiedtoinclude theeffects of adjacentamino acidsonthe pKAvalues used inthecalculations. Thiswillbe anempiricalprocesswherebythe pKAvalues usedin
thealgorithm willbemodified
fractionally
to see whichchanges leadtoabettercorrelationbetweenactual andpredicted pi values forthe twooutlierdatasets (0.3<Apl
<0.7; andApl>0.7).
Iftheimprovementoftheaccuracyofthepi calculationprovestobe worthythere
any many futureadvancementsthatcouldbemade. The firstcouldbetobuildalarger datasettowork withandreruntheanalysisto comparetothedatashownhere. Beyond
the scope oftheE. coliproteome, furtherdatathatare availableatthe
ExPASy
Server'smicrobialproteomes. Another stepwouldbeto porttheanalysis overto lowereukaryotic
proteomesthatcontain much morepost-translationalmodifications. A lotwould haveto be done interms ofpredictingorcategorizingthesepost-translational modificationsbut
in
doing
soit may leadtoan even more powerful approachtobetterpredictingpiinConclusions
AdatasetofE. coliproteins was collected andformattedto studythe
discrepancy
thatexistsbetweenexperimentalisoelectricpoint and predicted isoelectricpoint(Apl).
This datasetwas thensplitintothreeparts
depending
onthemagnitude ofAplforeachprotein. Several,multi-layered,sequential approaches were takenin reformattingthe
protein sequence data inan attempttogetabetterunderstandingof whatmightbe
causingthevarying Apl. Eachofthese stages representedadifferentpart ofa pipeline where thedatawere analyzed
by
comparingeach ofthe threeAplsubsets toone another.Thepipeline consistedof anaiveapproach
(considering
individualamino acidfrequencies), followed
by
the applicationfour different alphabetstorepresent sequencesinasimplerway
by
groupingsimilar aminoacidsbasedontheircharge, functional,chemical, andhydrophobicproperties . The final step inthepipeline involved
investigating
thedipeptidesof allofthesesequencesusing boththe 20amino acid alphabetandthesimplifiedgroupings. Thealphabetdipeptideapproachyieldedthemost meaningfulresults showingthatcertaindipeptidesequences occur in greatly
different
frequency
betweenproteins inthe different Aplsubsets.Future studies will attemptto showthattheresultsofthesedipeptide findings
bettercanbeusedtobetterpredictpi. Thiswill involvemodification of ourexistingpi
prediction algorithmto include theaffectofadjacentaminoacidsin sidechainpKA
values.
Using
a shortlistofonlythemost extreme cases whereadipeptideshowedgreatly different Apl fromone subsettothenext should result ina piprediction valuethat
ismoreaccurate. Oncethepi predictionis improvedthenextstepwouldbeto
them. Inaddition,similar analyses will beextendedto otherprokaryoticorganisms, and