• No results found

Isoelectric point prediction from the amino acid sequence of a protein

N/A
N/A
Protected

Academic year: 2021

Share "Isoelectric point prediction from the amino acid sequence of a protein"

Copied!
74
0
0

Loading.... (view fulltext now)

Full text

(1)

Rochester Institute of Technology

Rochester Institute of Technology

RIT Scholar Works

RIT Scholar Works

Theses

Summer 2005

Isoelectric point prediction from the amino acid sequence of a

Isoelectric point prediction from the amino acid sequence of a

protein

protein

Matthew Conte

Follow this and additional works at: https://scholarworks.rit.edu/theses

Recommended Citation

Recommended Citation

Conte, Matthew, "Isoelectric point prediction from the amino acid sequence of a protein" (2005). Thesis. Rochester Institute of Technology. Accessed from

This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact

(2)

THESIS

ISOELECTRIC

POINT PREDICTION FROM THE

AMINO

ACID SEQUENCE OF A PROTEIN

Submitted

by

MatthewConte

Department ofBiological Sciences

In partial fulfillment ofthe requirements

Forthe MasterofScience degree in Bioinformatics at

Rochester Institute of

Technology

(3)

-~­

nIQlnformatlcs

~luT

To: Head, Department of Biological Sciences

Rochester Institute of Technology Department of Biological Sciences Bioinformatics Program

The undersigned state that _ _

...!...M----=.!~~· :....:~...!\--...!h~~~v...J~

\

~A....!...~C:!z<.loooO~Vl-"-!e..LJo...---­

(Student Name)

_ _ --:-:::---:---:-:---_-:--_ _ ' a candidate for the Master of Science degree in (Student Number)

Bioinformatics, has submitted his/her thesis and has satisfactorily defended it.

This completes the requirements for the Master of Science degree in Bioinformatics at Rochester Institute of Technology.

Thesis committee members:

Name

Gary R. Skuse

(Committee Chair)

Paul A. Craig

(Thesis Advisor)

Name Illegible

Douglas P. Merrill

Date

(4)

Thesis/Dissertation Author Permission Statement

Title of thesis or dissertation: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Name of auth0J.

A~HhLw

(0/1

k

Degree: ~ "'S~

Program: --~G~;~o~M~f9~C-M-~~I.-.-s---College: Sc.iC ..

,e.

I understand that I must submit a print copy of my thesis or dissertation to the RIT Archi ves, per current

RIT guidelines for the completion of my degree. I hereby grant to the Rochester Institute of Technology

and its agents the non-exclusive license to archive and make accessible my thesis or dissertation in whole or in part in all forms of media in perpetuity. I retain all other ownership rights to the copyright of the thesis or dissertation. I also retain the right to use in future works (such as articles or books) all or part of

this thesis or dissertation.

Print Reproduction Permission Granted:

I,

&t+kw

~

It.

,

hereby grant permission to the Rochester Institute Technology to reproduce my print thesis or dissertation in whole or in part. Any reproduction will not be

for commercial use or profit.

Signature of Author:

Matthew Conte

Date:

Cf-

OJ..

-J..065

Print Reproduction Permission Denied:

1, , hereby deny permission to the RIT Library of the Rochester Institute of Technology to reproduce my print thesis or dissertation in whole or in part.

Signature of Author: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date:

-Inclusion in the RIT Digital Media Library Electronic Thesis

&

Dissertation (ETD) Archive

I, ' additionally grant to the Rochester Institute of Technology Digital Media Library (RIT DML) the non-exclusive license to archive and provide electronic access to my thesis or dissertation in whole or in part in all forms of media in perpetuity.

I understand that my work, in addition to its bibliographic record and abstract, will be available to the

world-wide community of scholars and researchers through the RIT DML. I retain all other ownership

rights to the copyright of the thesis or dissertation. I also retain th.: right to use in future works (such as articles or books) all or part of this thesis or dissertation. I am aware that the Rochester Institute of

Technology does not require registration of copyright for ETDs.

I hereby certify that, if appropriate, I have obtained and attached written permission statements from the

owners of each third party copyrighted matter to be included in my thesis or dissertation. I certify that the

version I submitted is the same as that approved by my committee.

(5)

Abstract

Proteinsoftendonotmigrate as expected intwodimensionalelectrophoresis

basedontheirprimarysequence. Thepredicted isoelectricpoint(pi)

frequently

doesnot

coincide with experimental pivalues obtainedinthelaboratory. Thereasonsforthese

differencesledto thisstudy. Initially, 2DE data fromtheE. coli proteome was collected

and formatted. Thisdataset was splitintothreepartseachconsistingofdifferent levelsof

pi

discrepancy

(Apl). Theprotein sequencedata foreachAplsubset was runthrougha

pipeline. Ateach stageofthepipelinethe datawere analyzed

by

comparingeach ofthe threeAplsubsets toone another. The pipelineconsistedofa naiveapproach

(considering

individual amino acidfrequencies), followed

by

theapplication four differentalphabets

to represent sequences inasimplerway

by

groupingsimilaraminoacidsbasedontheir

charge,functional, chemical,andhydrophobic properties . Thefinal step inthepipeline

involved

investigating

thedipeptidesof all ofthesesequencesusing boththe20amino

acid alphabet andthesimplifiedgroupings. Anevaluation ofthe alphabetdipeptide

analysisdemonstratedtheexistence of certaindipeptidesequences whichcorrelatewell

(6)

Table

of

Contents

1

Introduction

1

2

Methods

7

2.1

Forming

thedata set 7

2.2 Experimental and predicted pi values 9

2.3

Extracting

useful information from collected subset

sequences 10

2.2.1 Aminoacidfrequencyanalysis (naive approach) ... 10

2.2.2Frequencyof amino acids (alphabets approach) ... 1 1

2.2.3 Frequencyof amino acids (dipeptide approach) ... 14

2.2.4Pipelineworkflow 15 3 Results 18 3.1 Naive approach 18 3.2 Alphabets approach 19 3.2.1 Charge 19 3.2.2 Chemical 21 3.2.3Functional 22 3.2.4 Hydrophobic 23 3.3 Dipeptide approach 24 3.4 Dipeptide threshold 26

3.5 Dipeptide using alphabets 28

3.5.1 Charge 28 3.5.2 Chemical 29 3.5.3 Functional 31 3.5.4Hydrophobic 32 4 Discussion 34 5 Conclusions 42 6 References 44

(7)

Introduction

Two-dimensionalgel electrophoresis

(2DE)

has beenanimportant

laboratory

techniqueforthefield of proteomicsforovertwodecades. 2DEallowstheresearcherto

separate and

identify

thousands ofproteinsfromacellularextractinasingle experiment.

2DE isdifficultandtimeconsumingasit is necessarytodetermine ideal initial

conditions,waitforresults, andpossiblychangeconditions afterthat(1). Inaddition,

reproducibilityof gels andcomparison of2DEresultsbetweenseparate groupshas

proveddifficult (1). In2DE, proteins areseparatedinthefirst dimension

by

their

isoelectricpoints(thepH at whichthenet charge oftheproteiniszero)andinthesecond

dimension

by

theirmolecular weights. Theaccurate prediction of proteinisoelectric

point(pi) andmolecular weight(MW)using simplytheamino acid sequence ofthe

protein wouldbe extremelyvaluabletoresearcherswho usetwo-dimensionalgel electrophoresis.

Computationalproceduresfor calculatingandpredictingthepifromtheamino acidcomposition of a proteinbasedonthedissociationconstants ofthe charged groups

withintheproteinhave been developed (2-8). The accuracyofthesealgorithmsis limited

by

thecertaintyofthevaluesforthedissociationsconstants and

by

microenvironmental effects suchas charge-chargeinteractionsandpost-translational

modifications.

To systematicallyexploretherelationship betweenpi,molecularweightand

proteinsequence,adataset of proteins was collected andorganizedfromamodel

(8)

post-phosphorylation which can alterthe pI/MW;thepresenceofthesemodificationsmakes

pI/MW predictions much more difficultsincethemodifications intheproteinsmaycause

themtomigratetoapositionon a2-Dgelthat isquitedifferentthanwhatis predicted

basedsolelyontheamino acid sequenceoftheprotein. E. coliisalsoone ofthebest

characterized prokaryotes and much moredata beyond simplytheproteinsequencefor

each proteiniswidelyavailablefor it.

Atthispointit is necessarytoconsiderthebasic structuralfeaturesof proteins and

therole ofindividualamino acidsinthestructure andfunctionof proteins. Figure 1

belowshowsthestructure ofthe 20amino acidswith side chain structures showninred

(10).

Thecharge onall proteins arisesfromsome oftheaminoacid sidechains, aswell

asthecarboxy-and

amino-termini, some prostheticgroups, andbound ions. Ourpi

predictiontool(11)is designedtocalculate chargebasedonthe side chains and

carboxy-and amino-termini. Thecharge on amino acid sidechainsdependsonthepH ofthe

solutionandthe pKaoftheside chains. It isalso affected

by

the localizedenvironment

around a side chain. Ourcurrent calculation model usesthe

following

pK.Avaluesfor

ionizablegroups ontheproteinanddoesnot make anyadjustments to thepKAvalues of

thesidechains regardless oftheirenvironment withintheprotein(Table 1). Wealso

assumethattheseparationis basedonthe totalcharge onthe protein,notthe

(9)

H H H H H 1 .0 H3N+-aC -ce 1 Vj 1

P

H3N+ -ac -ce 1

XP

1

P

H3N+-aC-ce 1 x'o 1

P

H3N+-ac-c'e 1 XP 1

P

H3N+ -aC C^e 1 XP (CH2)3 1 CH2 1 CH2 1 CH2 1 CH2 1 NH CH2

t^

h

w

1 C=NH2 1 1 C=0 |

KJ

y

OH H NH2 Arginine (Arg/R) NH2 Glutamine (Gln/Q) Phenylalanine (Phe /F) Tyrosine (Tyr/Y) Tryptophan (Trp.W) H 1

p

H3N+ -ac c'e 1

XP

CH3 H 1

P

H3N+-aC -Cp 1 "P H 1

P

H3N+ -ttC-C> 1 XP CH2 H 1 /P H3N+ -aC - C^S 1

xo

H 1 /> H3N+ -Mc - ce 1

XC

H /

rcH2

(CH2)4 | HN ,N 1 OH

NH2 Glycine Alanine Histidine Serine

Lysine

(Lys/K)

(Gly/G) (Ala/A) (His/H) (Ser/S) H 1

P

H3N+ -^C-C^e 1 XP CH3 1 CH2 H 1

P

H3N+ -aC - C*e 1 XP CH2 1 COOH H 1

P

H3N+ -aC - CS 1 ^P H-C-OH 1 CH3 H 1

P

H3N+ -aC-Ce 1 XP CH2 1 ' SH H2 C

\

/

P

H2N+ -aC - Ce 0 Proline 1 (Pro /P) COOH

jl GlutamicAcid AsparticAcid Threonine Cysteine

1 yP H3N+ -ac- c -e 1

XC

CH2 1

(Glu/E) (Asp/D) (Thr/T) (Cys /C)

H H H H 1

P

HsN+^c-c'e 1 XP 1

P

H,N+ -*c - ce 1

XP

1 /P H3N+ -"C CS 1 XP 1

p

H3N+ -aC-Cve 1 ^P CH2 1 S CH2 1 CH CH2 1 c=o HC-CH3 1 CH2 CH CfH3 CH3 1 CH3

P\

CH3 CH3 NH21 CH31

Methionine Leucine Asparagine Isoleucine Valine

(Met /M) (Leu /L) (Asn /N) (He /1) (Val/V)

Figure 1. Structuresof amino acids with side chains showninred,carboxylate groups ingreen, andamino groupsin blue(10).

Thecharge ontheproteinis thesum ofthechargesontheindividualamino acid

side chains. However, thecharge onindividualaminoacid sidechains canvarywhen

(10)

pKaforglutamic acidis about4.1. Inlysozyme, twoglutamic acidresiduesareinthe

active site. Oneis inapolar environment andhasanormalpKAvalue. Theother

glutamate side chainisinahydrophobic environment, whereanegative charge is

energeticallyunfavorable. Therefore the pKAvalue forthis glutamate side chain

increases,whichthendecreasestheextent ofthedeprotonationofthatside chain.This is

veryimportantinthemechanismoflysozymeactivity,which requiresthatone ofthe side

chainsbecharged

(deprotonated)

andthe otherbeuncharged(protonated) atthe same time.

Inasecondexample,the serine intheactive sites of serineproteaseshasa much

differentacid-basebehaviorthanotherserinesnormally found inproteins (9). The

normalpKAvalue forthe hydroxyl groupontheserine side chainisgreaterthan 15, meaningthatthisgroup isnotfound inanionizedstateinmost proteins. Inserine

proteases, theinteractionoftheactive site serine withnearby histidineand aspartate side

chains(the so-called catalytictriad) leadsto theionizationofthe serinehydroxylgroup.

Meanwhile,the pKAvalueisreducedfromabout 15to a value closerto7 or8. This

example makesitclearthat themicroenvironmentof anindividualamino acidside chain

canchangeit ionization behavior.

Othereffects onthe pKAof anamino acid side chain canbeseen whencertain

aminoacidsare positioned nexttoeach other. Forexample,atypicalArginineresidue

whichis basicwill haveapKAof about 12.5 (Table 1

below)

andcarryafull+1 charge

inthephysiologicalpHrange. However,whentwo ofthesebasicArginineresidues are

adjacent ina protein sequencethe pKAvalues will decrease,duetorepulsionbetweenthe

(11)

arginine side chains tobecome lessionizedandcarry onlyafractionalpositive charge.

Table 1 below liststhe typical pKAvaluesforionizablegroups inproteins (9).

Group

TypicalpKa

Terminala-carboxyl group 3.1

Asparticacid,

Glutamicacid 4.1

Histidine 6.0

Terminala-aminogroup 8.0

Cysteine 8.3

Tyrosine 10.9

Lysine 10.8

Arginine 12.5

Table 1. ThesearepKAvaluesthatarecommonly found forthesesidechainswhen

theyarepart of aprotein.ThepKAvaluesfortheseside chainsmay bequite different forthefreeamino acidinsolution. pKAvalues alsodependon temperature,ionic strength, andthemicroenvironmentoftheionizable group(9).

Aswebegantoconsidertheimpactof amino acid sequence onionization behaviorof

individualaminoacid sidechains,theneedtocreate groupsofaminoacidsbasedontheir

chemicaland physical characteristics ratherthanconcentratingon eachindividualamino

acidbecameapparent. Weelectedto dividethe aminoacidsintogroupsbasedontheir

chemical, functional, charge,andhydrophobiccharacteristics. Dividingsets of amino acids intothese groups enables ustouse smaller alphabetsbasedonthesecharacteristics

as opposedto simply usingthenormal20 letteraminoacid alphabet inourcalculations.

Weusedthesepropertygroupsto rewrite aprotein sequences intoan alternative

alphabetthatismuch smallerthan thenormal aminoacid alphabet of20characters (12).

(12)

onwhich amino acids fallunder what particulartypes. The Methodssectioncontains

examples of protein sequences thathave beentranslatedintothesedifferentalphabets.

Alphabet Type

(size)

Code

Meaning

Amino Acidswith

thatCode Charge(3) A Negative D, E C Positive H,K,R N Nocharge A,C,F,G,I,L,M, N,P,Q,S,T,V,W,Y Chemical(8) A Acidic D, E L Aliphatic A,G,I,L,V M Amide N,Q R Aromatic F,W,Y C Basic R,H,K H Hydroxyl S,T I Imino P S Sulphur C,M Functional(4) A Acidic D, E C Basic H,K,R H Hydrophobic A,F,I,L, M, P, V, W P Polar C,G,N,Q,S,T,Y Hydrophobic(2) I Hydrophobic A, F, I, L, M, P, V,W 0 Hydrophilic C, D, E, G, H, K, N, Q,R, S,T,Y Table 2. Descriptionoffourabbreviated amino acid sequence alphabets: Charge,

Chemical,Functional, andHydrophobic (12). Shownarethenew alphabet codes usedforeachdifferentalphabet,whateach code represents intermsof properties of aminoacids,andthespecific amino acidsthatare included in

each property.

Proteinsthathaveasignificantdifferencebetweentheirpredicted pI/MW

(obtained usingsimilar algorithmsasmentionedabove)andtheirexperimental pI/MW

willbestudied. Asmentionedbefore,certain aminoacidsthatoccurinaparticular

(13)

of certain proteins(those withlargeAplvalues) thatdonot occurintheotherproteins

whose pi values were accuratelypredicted are important.

They

may leadtoamore

accurate prediction ofthepi andMWall of proteinsfromtheiraminoacidcompositions.

Methods

Formingthedataset

The

ExPASy

Server's SWISS-2DPAGE database(13)provides extensive2-Dgel information forhuman, mouse,Arabidopsis thaliana,Dictyosteliumdiscoideum,E. coli, Saccharomyces cerevisiae,andStaphylococcusaureus (N315)which arealso

cross-referencedin Swiss-Prot. Eachproteininthe database iscollected and annotatedfrom

experimental2-D gels readfromreference maps. Thedatabaseforthisproject contains 336proteins oftheE. coliproteome characterized

by

five differentresearchgroups

(14-18). Itwasdecidedthatthe compilation of pI/MW setsfortheseproteins shouldbe

separatedaccordingtoeach researchgroup since experimentalconditionsvariedamong

them. Theproteinscontributed

by

thePhillips et al. (14),Pasquali etal. (15),and

Vanbogelen et al. (16)groupswereignored becausetheseproteins were also

characterized

by

the Tonellaetal.(17)and Yanet al. (18)groups. Two setswerecreated; thefirstcontains 228of all theproteinsdenoted

by

Tonellaetal. and 153 proteins ofall

theproteinsdenoted

by

Yanet al. The firstset wasalso separatedbasedonthepH range

used for isoelectric

focusing

(pH4-5, 4.5-5.5, 5-6, 5.5-6.7, 6-9,and6-11). We concentratedontheTonellaet al. setbecause itcoveredmorethan70%oftheE. coli

(14)

WethenmatchedthepI/MWdata foreachprotein with its FASTAsequence.

This allowsustocompare experimental pI/MWvalueswith predicted pI/MW values.

ExPASy

providesits owntool forpredictingpI/MWwhich requiresalist ofSwiss-Prot

proteinIDs asits inputofproteins(19). Wehavealsodevelopedourowntool that

includesapI/MW predictionwhichrequires inputofFASTA formatsequences,Genbank

format,orProtein Data Bank format (11). Both ofthesepredictiontools arebased (and

especiallypi for bothtools)on acalculationusingpKAvalues of aminoacids as

described earlierintheintroductionand

by

Bjellqvistet al. (19) The first stepwasto

retrievethe2-D gelinformation forall oftheseproteins. ExPASyprovides awaytoget

thedata fromeach2-D gelinatabdelimited formatthatincludeseach spot (oneprotein

canhavemultiplespots on agel).

Having

thisdata inatabdelimited formatgave afar greaterease of use whenlater performing anytypeof analysis onthedata (suchas

comparingexperimental pitopredicted pi). The fieldscontainedinthese files included:

genename,proteindescription, SWISS-2DPAGE SerialNumber,SWISS-2DPAGE

AccessionNumber,identificationmethod(gelmatching, microsequencing, or peptide

massfingerprinting),experimentalpi,experimentalMW, and references.

AlistofSwiss-ProtproteinIDs (2DPAGE Accession Number

-e.g. P00274)was

thenmadeforeach ofthe gels. This listofproteinswasthenusedtoretrieve aFASTA

file oftheproteinsfromeach gel(someproteins were repeatedformultiplespots). The

Swiss-Prot IDsweresubmittedtotheNCBItoolfor retrieving sequencesat

http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Protein. The sequenceswere

downloaded in FASTA formattobeused inour predictiontool. BatchretrievalatNCBI

(15)

whateverreason, the initialmethionine residuewhenretrieving in FASTA format. The

FASTA fileforthe set of proteinsfromeach gelwasthen fed intoourtool wherethe

output canbe conveniently recordedtoaMicrosoft Excel file. However, problems

occurred whenusingthe FASTA file from NCBI inourtoolsinceitwould orderthe file

basedonGenbank accession number andnot

by

Swiss-Prot IDwhich was neededto

matchthetab delimitedfile foreach gel. Thiswas solved

by

removingtheGenbank

accession number

(leaving

justtheSwiss-Prot ID)fromeach proteinentry ineach

respectiveFASTAfileusingasimplePerlscript. Thiswas facilitated

by

afewregular

expressions mostnotably: ":%s/gi|\d*|sp|//"(quotationsexcluded). ThepI/MW predict

toolat

ExPASy

(19)was not quiteas easytouse sinceitdoesnot outputinto aformat

thatcanbe imported into Excelreadily. Theoutputfilewas editedusingthe

following

regularexpression:

":%sAs\s*At/g"

(quotations excluded)whichtransformedit into atab delimitedtextfile,allowing ittobe easilymanipulated in Excel. Neverthelessthe "ComputepI/MW

tool"

at

ExPASy

(19) gave strikinglysimilar resultstoourtool.

Bothexperimentaldatasets derived fromtheTonella data(17)andthe Yan

(DIGE) data(18)werecomparedwithbothpI/MW predictiontools andtheresults canbe seenintheExcelfilesathttp://www.rit.edu/~mac3948/E2D/Ecoli/.

Experimentalandpredicted pi values

Looking

atthe compileddatasetitwasnoticeablethatsome predicted pi values

were far different fromexperimental pi values. Someproteins differedinpredictedpi

versus experimental pi

by

asmuchas 1.86pH units(e.g.P06128,

Phosphate-binding

(16)

pi wasexactlythesame asthe experimentalpi(e.g. P06960, Ornithine

carbamoyltransferase chain F(OTCase-2), seeAppendix A).

Tobettercharacterizethesediscrepancies across alloftheproteinsasimple

calculation was performed:

Experimental pi

-predicted pi = Delta (A)pi (Eq. 1)

Thedifferenceinexperimental pi and predicted piwillbereferredasApl inthis paper.

Themainfocusofthis projectisto

identify

potentialcauses ofvarying Aplvalues.

Thedatasetwasthenbroken down into roughlythirds. Thefirst subset of

proteins consistedof60proteins wheretheAplvalue waslessthan0. 1. Anothersubset

held58 proteins ofAplvalues greaterthan0.3,but lessthan0.7 (0.3 <Apl<0.7). The

lastthirdwasputintoasubset of50proteins wheretheAplvalue was greaterthan0.7. Refertothe tables in Appendix Aforalistoftheproteins ineachAplsubset.

The

following

sections will providethesequentialstepsthatwereperformedon

the analysis ofthesedatasubsets. Itstartswith anaive approachto

handling

thedatathat

dealswithsimply calculatingraw frequenciesofthe20aminoacids. Thenext section explainshowwe usedthefour differentalphabetsto analyzethedatasubsets, still

focusing

onindividual aminoacid frequencies. The dipeptideapproachesaredescribed

next,followed

by

afinal sectionthatsummarizeshowthewhole process flows together.

Extractinguseful information fromcollected subsetsequences

Amino acid

frequency

analysis (thenaiveapproach)

There isa naive approachto

finding

a significantdifference betweeneachofthe

subsets ofAplranges. This methodinvolves

determining

the counts of each amino acid

(17)

acidbetweentheApl subsets. Ifasignificantdifference for anyaminoaciddoes exist

betweenanyoftheAplsubsets, then this wouldbeof greatinterest. Itwouldthen be

possibletoadjust a pi predictionalgorithmbased onindividual aminoacid

frequency

values and predictpivaluesthatwere closertoexperimental values.

Thefirststep in goingaboutthenaive approachwasto startfromthe listof

proteins foreachAplsubset. As previouslydescribed, thebatchsequence retrieval atthe

NCBI wasusedtoobtain aFASTA filethatcontained each sequenceincluded in each

Apl subset. A Perlprogramwasthenwrittentocountthenumberof aminoacidsin each

sequencefromaFASTA fileand calculatethe

frequency

ofeach, outputtingatab

delimited file

displaying

allofthe frequencies foreach sequence. Thecode ofthis

programcanbe found in Appendix B

-aacounts.pl.

Another Perlprogram was written whichconcatenates eachseparatesequence

intoone

long

sequence. Thisallows oneto lookattheamino acidfrequencies

encompassingeachAplsubset as a wholeinsteadof protein

by

protein. Theprogram

also makes surethateachprotein sequenceis kept separate andthat theheader lineof

each sequenceisremoved(see Appendix B

-makeComposite.pl),which willbeshown

tobe important shortlywhen

looking

at twoamino acidsthatoccurone right afterthe

other(see dipeptideapproach).

Frequencyofamino acids(alphabetsapproach) Chargealphabet

Amore sophisticatedanalysisofamino acid

frequency

canbe doneifthe amino

(18)

side chainsoftheamino acids canbeused toassignthemtofourabbreviatedamino acid

alphabets(Charge, Chemical, Functional, andHydrophobic). The Chargealphabet(see

Table2) is basedon whetherthe side chain of an amino acidcanhavea positiveor

negativecharge, orissimplyuncharged(neutral). Glutamic Acid (Glu/E) andAspartic

Acid

(Asp

/D)are theonlyamino acidsthatcontainthenegativelycharged carboxyl

group (COO). Therefore, intheCharge alphabettheyare groupedtogetherandgiventhe

code A. Likewise, Lysine (Lys /K)andArginine

(Arg

/R)are aminoacidsthatcontain

the positivelycharged amino groups(thelysine sidechaincontains ane-aminogroupand

argininehas aguanidino group). Inthe Chargealphabettheyare groupedtogetherwith

thecode C. Histidine (His /H) isalsogrouped intothepositivelychargedamino acid

group becauseprotonation ofthe nitrogenon itsside chainoccurs easily. The remaining

15 aminoacidshave side chainswhichnormally donotdemonstratecharge behavior in

proteins; theyare groupedtogetherand giventhecode N. Anexample ofusing the

Chargealphabet canbeseenbelow:

ACDEFGH (original sequence)

i

NNAANNC (Charge alphabet sequence)

Chemicalalphabet

The Chemicalalphabetincorporatestwo groupings, acidicandbasicwith codesA

andC,respectively. These groupings areanalogousto theAandC groupings inthe Chargealphabetforthesame reasons. The Chemicalalphabetcharacterizesthe

remaining 15 amino acidsbasedonmorethan theirlackof acharge. Asparagine (Asn /

N)andGlutamine (Gin /

Q)

areamino acidsthatcontainan amide

(CONH2)

and are

(19)

/W), andTyrosine (Tyr, Y)contain aromatic rings(code R). Serine(Ser/S)and Threonine (Thr/T)containthehydroxyl group(OH)ontheirside chains (code H). Proline (Pro / P) contains animinogroup (>C=NH)on itsside chain(code I). Finally,

the sulfurcontainingamino acids areCysteine (Cys /C) andMethionine (Met / M)are

groupedtogetherwithcode S. Anexample ofusingtheChemical alphabetcanbeseen

below:

ACDEFGHNPS (original sequence)

I

LSAARACMIH (Chemical alphabet sequence)

Functionalalphabet

The Functionalalphabet againincorporatestheA(acidic) andC(basic)groups as

didtheChargeandChemicalalphabets. The Functionalalphabet characterizes the

remainingaminoacidsinto 2 groups: H (hydrophobic)andP(polar)basedon whether

theaminoacidis hydrophobic (suchasAlanine)or polar(suchas Cysteine). Anexample

ofusingtheFunctionalalphabetcanbeseenbelow:

ACDEFGH (original sequence)

1

HPAAHPC (Functional alphabet sequence)

Hydrophobicalphabet

TheHydrophobic alphabetis similarto thelatter halfofthe Functionalalphabet.

Itgroups aminoacidsbased onlyonhydrophobicity. Aminoacidsthatarehydrophilic

(suchas Cysteine)are giventhecodeI. Aminoacidsthatarehydrophobic(suchas Alanine)aregiventhecodeO. Anexample ofusingthe Hydrophobicalphabet canbe

(20)

ACDEFGH (original sequence)

1

OIIIOII (Hydrophobic alphabet sequence)

Perlprograms were written thatconvert normalsequences intoeach ofthe four

alphabetsjust described(seecharge.pl, chemical.pl, functional.pl,andhydro.pl in Appendix B). Theprograms also calculate and

display

the

frequency

of each alphabetic

codethatis chosen.

Frequency

ofamino acids (dipeptideapproach)

The problemthatcertain abnormalpKAside chains values of amino acids

affectingthe overallcharge of aprotein stillhadnotbeen dealtwithupuntilthispoint.

All thathad been consideredwasthesumof asetofstrict pKAvalues foreach amino

acidwithouttaking intoaccountanychangesthatmight occurduetocertain amino acids

being

nexttootheramino acids insequence. Theapproachtosolvingthis problemwas to examineevery

"dipeptide"

inthe threeAplsubsets. Asequenceoflength 7 has 6

dipeptides. Forexample,

Sequence: Dipeptides: Dipeptide counts: Frequency:

ABCABBC AB AB = 2 0.333 BC BC = 2 0.333 CA CA = 1 0.167 AB BB = 1 0.167 BB BC

The

frequency

atwhich eachdipeptideoccurs inaparticular sequence isof

interest, particularly,whentheyare consideredin eachAplsubset. A Perl program was

(21)

dipeptideinthe sequencesofthe FASTA afile that is input (see Appendix B- dipeps.pl

fordipeptides output in

increasing

order ordipepsA.pl fordipeptidesoutput

alphabeticallyfrom AA ... VV). Aswasthe case earlierwiththenormalamino acid

alphabet, thenumberofdifferentdipeptides(20x20=400 forthe

normalalphabet)

becameproblematic. The samedipeptidetechnique wasappliedto sequences after

convertingthem intotheCharge,

Chemical,

Functional, andHydrophobic alphabetsto alleviatethisproblem.

Combining

an entireApl subsetofFASTAsequences intoone

long

sequence

(using

makeComposite.pl

-seeAppendixB)alsobecameproblematic. Tocountthe number ofdipeptides ina set ofsequencesthathas beencombinedintoone

long

sequence, special attention needstobe paidsothatthelastaminoacidinone sequence

andthefirstamino acidinthenextsequence arenot counted as adipeptide. The format

ofthe outputfile frommakeComposite.pl handlesthisproblem

by

replacingeach accessionlinewith ablanknewline. Theotherprogramscan now usethis formatted

FASTA fileso thatthedipeptidecountsarejustas accurate as naive and alphabetcounts.

Pipeline Workflow

So fartherehave beenstages at whichthe

frequency

of anaminoacid, groupof aminoacids (coded accordingto the fouralphabets),dipeptide, orgroupeddipeptide (coded accordingtothe fouralphabets)has beenexamined. Theprocess oftransforming thedatatoreach each ofthese stages mayappear somewhatconfusing. Figure 2 below

diagrams howtogo fromaninitialset ofFASTA sequences(foreachApl subset) toeach stage of analysis. The flow intakingthenaiveapproachwould gofromFASTA

(22)

sequencetomakeComposite.plto aacounts.pl andthenanalysis. However,the flow for

examiningdipeptideswithafunctionalalphabetismore complex. Itbegins

by

transferringtheFASTAsequencetomakeComposite.pl tofunctional.pltodipeps.pl (or

dipepsA.pl)

followed

by

analysis. Table 3 belowgivesabrief descriptionofeach

program usedinthis pipeline workflow(fora moredetaileddescriptionandcode ofeach

program see Appendix B).

( \ Aplsunset FASTA file v. J charge.pl [ i ' chemical.pl 1 \ * ~~~~~ ^-^^^^r \ dipeps.pl or dipepsA.pl ^ ) makeComposite.pl i r functional.pl i hydro.pl " r ~\ analysis i i' aaco ants.pi ^ )

Figure 2. Workflow diagramthatshows howto getto each stageof analysis (naive, alphabets,dipeptides).

(23)

Program Description

aacounts.pl Countsthe number of each aminoacid(normal alphabet) ina sequence fromaFASTA fileanddetermines the

frequency

of each. Output istoFASTAfilename.aacounts

charge.pl Convertsthe amino acids fromthe sequencesinaFASTA file

into a3-letteralphabetusingthecharge()methodin

Bio::Tools::OddCodes (12). Itthencounts thenumberofeach codeforeach sequence as wellas eachfrequency.

chemical.pl Convertsthe amino acids fromthe sequencesinaFASTA file

intoan8-letteralphabetusingthechemical()methodin

Bio::Tools::OddCodes (12). Itthencountsthenumber ofeach code foreach sequence as well as eachfrequency.

dipeps.pl Countsthenumber of eachdifferentaminoacidpairforeach sequence inthegivenFASTAfiles. It displays each pairin orderfrom highest

frequency

to lowest.

dipepsA.pl Countsthenumberofeachdifferentaminoacid pairforeach sequenceinthegivenFASTAfiles. It displayseach pairin alphabetical order(AA ... W).

functional.pl Convertsthe amino acidsfromthesequences inaFASTAfile intoa4-letteralphabetusingthe

functional()

method in

Bio::Tools::OddCodes (12). Itthencountsthenumber of each code foreach sequence aswell as eachfrequency.

hydro.pl Convertsthe amino acids fromthesequences inaFASTAfile intoa2-letteralphabetusingthehydrophobic()methodin

Bio::Tools::OddCodes(12). Itthencountsthenumber of each codeforeach sequenceaswell as each frequency.

makeComposite.pl Converts FASTAfilesofmultiple sequences intoa single

(composite)sequence. Thiscomposite sequence isthenableto beused with other programslisted here.

Table3. Descriptionoftheprograms usedinthispipeline workflow. AppendixB

(24)

Results

Naiveapproach

The intitialnaive approachto analyzingthedatasetwasdonetodeterminethe

counts ofeachamino acid

(using

thenormalalphabet) ineachAplsubset(Apl <0.1; 0.3

<Apl<0.7;Apl>0.7)and comparethe relative

frequency

of occurrence foreach amino acidbetweentheApl subsets. Acomparison ofthefrequencies betweentheApl<0. 1

subset andthe0.3 <Apl<0.7subset isshownin Figure 3. Asimilar comparison

betweentheApl<0. 1 subset andtheApl>0.7subsetis displayed in Figure 4.

FrequenciesofAmino Acidsin \pi <0.1 and(0.3<Apl<0.7)

Figure 3. FrequencyofIndividual Amino Acids in Two Apl Subsets. The Xaxis labelsrepresenttheone letterabbreviations oftheamino acids. Shown in blueare istheApl<0. 1 subsetandshowninyellowisthe0.3 <Apl<0.7subset. The Apl< 0.1 subsetconsists of60proteins which comprise22472 totalamino acids. The 0.3 <Apl<0.7subset consists of58 proteins whichcomprise 17906totalaminoacids. More informationabout eachindividualproteinintheseAplsubsets canbeseenin AppendixA.

(25)

FrequenciesofAmino Acidsin Apl<0.1 andApl>0.7

Figure 4.

Frequency

ofIndividual Amino Acids in Two Apl Subsets. The Xaxis labelsrepresenttheoneletterabbreviations oftheaminoacids. Shown in blueare

istheApl<0. 1 subsetand showninyellowistheApl>0.7 subset. The Apl<0. 1 subsetconsists of60proteinswhichcomprise22472 totalamino acids. TheApl> 0.7 subsetconsists of50proteinswhich comprise 15581 totalamino acids. More informationabout eachindividualproteinintheseAplsubsets canbeseenin Appendix A.

Alphabets approach

-Charge

The nextstep inanalysis wasto convert each oftheAplsubsets intoasequence

thatutilizesthefouralphabets. This decreasesthe size oftheaminoacidalphabet and

reducesthenumberofvariables

being

examined. The differentalphabets are

summarizedin Table 2.

Using

theChargealphabet,a comparisonofthefrequencies

betweentheApl<0. 1 subset andthe0.3 <Apl<0.7subsetis shownin Figure 5. Again

usingtheChargealphabetasimilar comparisonbetweenthe Apl<0.1 subset andtheApl

(26)

Frequencies ofAmino Acids (Charge alphabet) in

Apl<0.1 and (0.3<Apl<

0.7)

Apl< 0.1

?0.3< Apl< 0.7

CAN

Amino Acid (charge alphabet)

Figure 5.

Frequency

ofAmino Acids

Using

the Charge Alphabet in Two Apl Subsets.

Frequencies ofAmino Acids (Charge alphabet) in

Apl<0.1 andApl>0.7

80 70 -. 60 s? 50 > o g 40

|

30 "" 20 10 0 Apl< 0.1 ?Apl> 0.7; CAN AminoAcid (charge alphabet)

Figure 6.

Frequency

ofAmino Acids

Using

theCharge Alphabet in Two Apl Subsets.

(27)

-Chemical

Using

theChemicalalphabet,acomparisonofthefrequenciesbetweentheApl <

0. 1 subsetandthe0.3 <Apl <0.7 subsetisshown inFigure 7. Figure 8 displaysthe

same comparisonbetweentheApl <0. 1 subset andtheApl>0.7subset.

FrequenciesofAmino Acids (Chemical alphabet)inApl<0.1 and

(0.3<Apl<0.7)

Apl< 0.1 D0.3< Apl< 0.7

R M H C

AminoAcid (chemical alphabet)

Figure 7.

Frequency

ofAmino Acids

Using

the ChemicalAlphabetin Two Apl Subsets.

FrequenciesofAmino Acids (Chemical alphabet) in

Apl<0.1 andApl>0.7

Apl<0.1

?Apl>0.7

I R M H C

Amino Acid (chemical alphabet)

(28)

-Functional

UsingtheFunctionalalphabet,a comparison ofthefrequencies betweentheApl< 0. 1 subset andthe0.3 <Apl<0.7 subsetisshownin Figure9. Again usingthe

Functionalalphabet a similar comparisonbetweentheApl<0.1 subsetandtheApl>0.7

subset isdisplayed in Figure 10.

Frequencies ofAmino Acids (Functional alphabet) in

Apl<0.1 and(0.3<Apl<0.7)

Apl<0.1 D0.3<Apl< 0.7

A P

Amino Acid (functional alphabet)

Figure 9. FrequencyofAminoAcidsUsingtheFunctional Alphabetin Two Apl Subsets.

(29)

FrequenciesofAmino Acids (Functionalalphabet)in Apl<0.1 and Apl>0.7

Apl<0.1

DApl>0.7

A P

AminoAcid (functional alphabet)

Figure 10.

Frequency

ofAmino Acids

Using

theFunctional Alphabet in Two Apl Subsets.

-Hydrophobic

Using

theHydrophobicalphabet, acomparisonofthefrequencies betweenthe Apl<0.1 subsetandthe0.3 <Apl<0.7 subsetisshownin Figure 11. Again usingthe

Hydrophobicalphabet a similarcomparisonbetweentheApl<0. 1 subsetandtheApl>

(30)

FrequenciesofAmino Acids (Hydrophobic alphabet) in

Apl <0.1 and (0.3<Apl <0.7)

Apl<0.1 ? 0.3<Apl<0.7

I O

Amino Acid (hydrophobic alphabet)

Figure 11.

Frequency

ofAmino Acids

Using

theHydrophobic Alphabet in Two Apl Subsets.

Frequencies ofAmino Acids (Hydrophobic alphabet) in

Apl<0.1 and Apl >0.7

Apl<0.1 D Apl>0.7

I O

Amino Acid (hydrophobic alphabet)

Figure 12. FrequencyofAminoAcids

Using

theHydrophobic Alphabet in Two Apl Subsets.

Dipeptideapproach

Using

a moresophisticatedmethodthatlooksatdipeptides ofasequence gave an

(31)

is similarto thenaive approachinthat itjustexamines dipeptides usingthenormalamino

acid alphabet. Thisresults inupwardsof400differentdipeptides (there may be slightly

fewerthan400dipeptides ina given subsetowingto thechance thatnotallpossible

dipeptidesmayoccur). The difference in

frequency

ofevery dipeptide between Apl

subsets was also calculated("Deltafrequency"or"%"). Inotherwords,aDelta% of100

wouldmeanthatacertaindipeptideoccurred2timesas muchinonesubset comparedto

anothersubset.The differences, or"Delta%"valuescanbeseenin Figure 13 when

comparingtheApl< 0. 1 subsetandthe0.3 <Apl<0.7 subset. Figure 14 showsthe

similarDelta %values whencomparingtheApl <0. 1 subset andtheApl>0.7 subset.

To betterexplainFigures 13-16, considerthebar indicated

by

thearrowin Figure 13.

This barrepresentsthe 1 1 times that therewas aA%valuebetween 100%and 150%

whencomparing dipeptide frequencies inthe two different Aplsets.

DensitiesofDelta % Values inApl<0.1and 0.3<Apl<0.7

UsingaNormalAmino Acid Alphabet

Figure 13. DensityofDelta% ValuesofDipeptidesin Two Apl Subsets. The Apl<0.1

subset consists of60proteins which comprise22412totaldipeptides. The 0.3 <Apl<

0.7subset consists of58proteins which comprise 17848 totaldipeptides. More

informationabout eachindividualproteinintheseAplsubsets canbeseenin Appendix A.

(32)

DensitiesofDelta % Values inApl<0.1and Apl>0.7

UsingaNormal Amino Acid Alphabet

>25 >50 >75 >100 >150 >200 >300 >400 Delta%range

Figure 14.

Density

ofDelta % Values ofDipeptidesin Two Apl Subsets. The Apl <0. 1 subsetconsists of60proteins whichcomprise22412 totaldipeptides. The Apl>0.7 subsetconsists of50proteinswhichcomprise 15531 totaldipeptides. More information

abouteachindividual proteinintheseApl subsets canbeseeninAppendixA.

Dipeptide Threshold

Asimilaranalysis was performed onthesameAplsubsets wheredipeptidesthat

hadavery low

frequency

(whichmaychangeits Delta %valuetoorapidly, see

Discussion foranelaboration)weremonitored. A

frequency

of occurrencethreshold

value of0.1% hadtobemetfor dipeptides. Inotherwords, ifadipeptideoccurred so

infrequently

(under 0.1%ofthe totalnumber ofdipeptides)thenitwaseliminated. The

remaining dipeptideswere countedandtheDelta%values comparingtheApl <0. 1 subset andthe 0.3 <Apl<0.7subsetcanbeseenin Figure 15. Likewise,thecomparison

fortheApl <0.1 subsetandtheApl >0.7 subset canbeseenin Figure 16. Dipeptides

thatwerefound intheextreme positive ornegativeranges ofthesefigures areindicated

by

theone letteraminoacidcodes. Forinstance, thedipeptideRR

(arginine-arginine)

in Figure 15 was foundmuchless

frequently

inthe Apl<0.1 datasetthaninthe 0.3<Apl<

(33)

DensitiesofDelta %Values in Apl<0.1and0.3< \pl<0.7UsingaNormalAmino Acid Alphabet (wherefrequencyofdipeptidemustbeabove0.1)

<-50 <-40 <-30 <-20 <-10 <0 >0 >10 >20 >30 >40 >50 >60 >75 >100 Delta %rangeand particulardipeptides

Figure 15. DensityofDelta % Values ofDipeptides in Two Apl Subsetswitha

Thresholdof0.1%. 90 80 c ffi 70 w S 60 n a 50 a ai E 40 a 30 0) F 20 3 z 10 0

DensitiesofDelta %Values in Apl<0.1 andApl>0.7Usinga NormalAmino Acid Alphabet (where frequencyofdipeptide mustbeabove 0.1)

<-50 <-40 <-20 <0 >0 >20

Delta%range andparticulardipeptides

>80 >100

Figure 16. DensityofDelta% ValuesofDipeptides in Two Apl Subsetswitha

(34)

Dipeptide using Alphabets

The finalstep in analysis wastocombinethealphabetanddipeptideapproaches

together.

Using

the smaller alphabets

dramatically

reduced andcondensedtheresultsas

comparedtousingthenormal alphabet which creates400possibledipeptides.

-Charge

Using

theCharge alphabet,a comparisonofthedipeptidefrequencies betweenthe

Apl<0.1 subset andthe0.3 <Apl< 0.7 subsetis shownin Figure 17 as wellastheDelta

%values foreachdipeptide. Thesame comparisonis shownbetweentheApl<0. 1

subset andtheApl>0.7subsetin Figure 18.

ComparisonofDipeptides (basedon charge characteristic)takenfrom Apl <0.1 and0.3< Apl<0.7

Dipeptide(chargealphabet)

Figure 17. FrequenciesofCharge Alphabet Dipeptides in Two Apl Subsets. Shown in bluearethefrequencies ofeachdipeptideintheApl<0.1 subset andshowninyellowis difference in

frequency

foreachdidpeptide betweentheApl<0.1 subsetandthe 0.3 < Apl<0.7 subset.

(35)

ComparisonofDipeptides(basedoncharge characteristic) takenfrom Apl< 0.1 and Apl> 0.7

60 50 40 30 20 10 0 -10 -20 -30 -40 A^ NKI

AfsjJ

CA CN NC NN CC

Dipeptide (charge alphabet)

Figure 18. FrequenciesofCharge Alphabet Dipeptides in Two Apl Subsets. Shown in bluearethe frequencies of eachdipeptide in theApl<0.1 subsetandshowninyellowis

difference in

frequency

for eachdidpeptide betweentheApl<0. 1 subsetandtheApl >

0.7subset.

-Chemical

Using

theChemicalalphabet,a comparison ofthedipeptide frequencies between

theApl<0. 1 subsetandthe0.3 <Apl <0.7 subset isshownin Figure 19aswellasthe

Delta%valuesforeachdipeptide. Thesame comparisonisshownbetweentheApl<0. 1

subsetandtheApl>0.7subsetin Figure 20. The Chemicalalphabet withdipeptideswas

sufficientlylargethatitwas not possibleto

display

allthepossibledipeptide

combinations inFigures 19 and20. Instead onlythe

density

values were chosento

(36)

DensitiesofDelta%Values in Apl<0.1 and0.3< Apl<0.7) Usinga Chemical Alphabet 16

|

14 12 S 10 a) I 6 a n E SS(-28%) Al(-25%) AS(-24%) MS(-22%) IS(-20%)

I

I

IC(43%) IM(48%) rt*(48%)

J

RR(61%) <-20 <-10 <0 >0 >10 >20 >30 >40

Delta %range and particulardipeptides

>50 >60

Figure 19.

Density

ofDelta % Values ofChemical Alphabet Dipeptides in Two Apl Subsets. The Apl<0.1 subset consistsof60proteinswhich comprise22412 total

dipeptides. The 0.3<Apl <0.7subset consistsof58proteins which comprise 17848

totaldipeptides. More informationabout eachindividualproteinintheseAplsubsets can beseenin Appendix A.

DensitiesofDelta%Values inApl<0 1andApl>0.7UsingaChemical Alphabet

<-40 <-30 <-20 <-10 <0 >0 >10 >20 >30 >40

Delta%range and particulardipeptides

>50 >60 >70 >80

J

Figure 20. DensityofDelta% ValuesofChemical Alphabet Dipeptides in Two Apl Subsets. The Apl<0.1 subset consistsof60proteins which comprise22412total

dipeptides. The Apl >0.7 subset consists of50proteins whichcomprise 15531 total

dipeptides. Moreinformationabout eachindividualproteinintheseAplsubsets canbe seen inAppendix A.

(37)

-Functional

Using

the Functionalalphabet,a comparison ofthedipeptide frequencies between the Apl<0. 1 subset andthe 0.3 <Apl<0.7subsetis shownin Figure 2 1 as wellasthe

Delta%values foreach dipeptide. Thesame comparisonis shownbetweentheApl<0. 1

subset andtheApl>0.7subsetin Figure 22.

Comparisonofdipeptides (basedonfunctional characteristic)takenfrom

Apl<0.1and0.3<Apl<0.7

Dipeptide (functional alphabet)

Figure 21. FrequenciesofFunctional Alphabet Dipeptides in Two Apl Subsets. Shown in bluearethefrequenciesof eachdipeptide intheApl<0.1 subset and showninyellow is difference in

frequency

foreachdidpeptide betweentheApl<0. 1 subset andthe0.3 < Apl<0.7subset.

(38)

Comparisonofdipeptides (basedonfunctionalcharacteristic) takenfrom Apl< 0.1 andApl>0.7

30 20 10 0 -10 -20 -30 -40 /A AH CA I-A

jjLfc-fa.tfi.ll tUljlj

AC HH HC PC CH CP PH HP CC PP

Dipeptide(functional alphabet)

Figure 22. Frequencies ofFunctional Alphabet Dipeptides in Two Apl Subsets. Shown

inbluearethefrequenciesofeachdipeptideintheApl <0. 1 subsetand showninyellow

isdifference in

frequency

foreachdidpeptidebetweentheApl<0. 1 subsetandtheApl>

0.7subset.

-Hydrophobic

Using

theHydrophobicalphabet,a comparisonofthedipeptide frequencies betweentheApl<0.1 subset andthe0.3 <Apl<0.7 subsetis shownin Figure 23 aswell

as theDelta% valuesforeachdipeptide. Thesame comparisonisshownbetweenthe Apl<0. 1 subset andtheApl>0.7subset in Figure 24.

(39)

Comparisonofdipeptides(basedonhydrophobiccharacteristic) taken fromaApl<0.1and 0.3<Apl<0.7

%ofDipeptideinApl<0.1

DDelta%(piA<0.1- 0.3<Apl<0.7)

Dipeptide(hydrophobicityalphabet)

Figure 23. Frequencies ofHydrophobic Alphabet Dipeptides in Two Apl Subsets.

Shown inblueare thefrequencies ofeachdipeptide intheApl<0.1 subset and shownin

yellowisdifference in

frequency

foreachdidpeptide betweentheApl<0. 1 subsetand

the 0.3 <Apl<0.7subset.

Comparisonofdipeptides (basedonhydrophobic characteristic) taken from

Apl<0.1 andApl>0.7

%ofDipeptide inApl<0.1

DDelta %A(pl<0.1-Apl>0.7)

Dipeptide(hydrophobicityalphabet)

Figure24. FrequenciesofHydrophobicAlphabetDipeptides inTwo Apl Subsets.

Shownin bluearethefrequencies of eachdipeptideintheApl<0. 1 subset and shownin

yellowis difference in

frequency

foreach didpeptide betweentheApl<0.1 subset and

(40)

Discussion

When exploringthebehaviorof proteinsundergoing isoelectricfocusing, there

existsa

discrepancy

between predicted pi values andexperimentally determinedpi

values forahighpercentage ofthoseproteins. Thiscomparisonofpi values was

performedusingpredictionsbasedon our algorithm(11)or similar algorithms(19)and

experimental pi valuesdeterminedin different

laboratory

settings (14-18). Thesizeand

regular occurrence ofthesedifferencesjustifiedaclosestudyoftheprotein sequences in

aneffortto

identify

underlyingpatternsthatcouldcontribute to thesedifferences. The

question now

lay

inwhethertherewas enoughinformation intheresults thatwere

extractedtobeable tomoreaccuratelypredictpi valuesusingthe informationobtained. The first

key

elementwas

having

a reliabledatasetthatwasbothuniformand

robustenoughtogivemeaningfuldata. A datasetthat istoo diversewouldleadto

complications suchasthequestionofhowtohandlepost-translational modificationsin

predictingpi andMW.

Simply

finding

the frequenciesof all dipeptides inallknown protein sequences would provide adatasetthatis certainlyrobustenough.

Unfortunately,therobustness wouldbeoffset

by

thehigh levelof noise inthedata dueto thefactthatdifferentorganismshave differentpost-translationalmodifications. Adata

setthatistoo smallwould nothaveenoughdipeptideinformationtomake surethat the

dipeptides thatoccurinthelowest frequenciesare still seeninsufficientabundanceto

maintaintheirstatistical validity. To overcomebothofthesehurdles,thesearch space

was limited onlytoproteins in E. coli sinceit displays very fewpost-translational

modifications andhasa proteomethathas been sufficientlydocumentedtodo a case

(41)

In

keeping

withthe theme of

having

adatasetwithas littlenoiseaspossible, yet

still retainingas muchrobustness as possibleitwasdecidedthateventhoughwell

structured2DEdataexistedfrom5 differentgroups (14-18), itwasprobablybestto limit

theusage ofthisdatato oneortwo ofthesegroups (17and18). Boththe Yanet al. (18)

andTonellaet al.(19) groups performedlarge scale2DEstudies ontheE. coli proteome.

The Tonella (19) groupboastedover70%oftheE. coliproteome

being

coveredintheir data. Sincenone ofthegroupsusedthesame2DE conditionsitwasdecidedthat thedata

fromtheTonella(19) groupwouldbethe only dataused. The primary justificationwas

toensurethat the experimentalpiand MWvalues were gainedusing thesame conditions.

This inturnwould reduce asmuchnoise aspossible. Inaddition, the factthat theirdata

coveredover70%oftheE. coligenomeheldpromise forthisstudy.

Oncethe entiredataset was selected,anotherdecision hadtobe madeabouthow

toseparatethedatasothatclearlinescouldbeseenbetweenproteinsthathad verysmall

Aplvalues and proteinsthathadgreaterAplvalues. Doingso wouldmake itpossibleto

seeifsignificantsequencedifferences (atthedipeptidelevel)between Aplsubsets

existed. Itwas necessarytobreakthedataset intoa small numberofAplsubsets. These

arbitrary Apl cut-offranges (Apl<0.1; 0.3 <Apl <0.7;Apl>0.7)werechoseninorder

toseparatethedata into distinctsetsof similar sizethatcouldbe comparedwitheach

other.

Therewas

difficulty

in

deciding

howtoseparatetheentiredatasetintothese three

subsets. One possible approach wasto separatethedataset into many smaller sized

subsetsbasedonalargernumber ofAplranges. Onone hand

doing

thismight provide

(42)

relativetoadjacentAplranges. Ontheotherhand

by doing

itthisway, there is alossof

informationatthesequencelevel dueto the smallernumber of sequencesthatwouldbe

found ineachdataset. This, inturn, wouldthreaten the reliabilityofourfindings.

Therefore, thedatasethadtobeseparatedintosubsets ofsufficient robustness. The Apl

<0.1 subset consistsof60proteinswhichcomprise22472 totalamino acids or22412

total dipeptides. The 0.3 <Apl<0.7 subset consists of58proteinswhichcomprise

17906totalamino acids or 17848totaldipeptides. The Apl >0.7subset consists of50

proteinswhichcomprise 15581 totalamino acids or 15531 totaldipeptides. More

informationabouteachindividualproteinintheseAplsubsets,

including

Apl, a

descriptionandSWISS-2DPAGE AccessionNumber,canbeseenin Appendix A.

Theanalytical process is bestviewedas a pipeline as seenin Figure 2 inthe

Methodssection. Webeganouranalysis withthemost simple method(naiveapproach),

worktheirwaytomore complicated methods (alphabetsapproach), and end withthe

most complicated methods(dipeptidesusingalphabets approach). Alongthispath,the

relevance ofthedataalsobecomesmorecomplicated,butmore

interesting

atthesame

time (withafewexceptions).

Thenaive approachto

handling

thedatasetdidnotprovideanymeaningful

results. Itwas quicklyapparentthatindividual aminoacidfrequencies inagiven set of

protein sequencesdidnotvary amongthe threedatasubsets. Intheend,no amino acid

frequency

characteristics using simplythe naive approach werefoundtobe significantly

different betweenthethreeAplsubsets. Thiscanbe seenin Figures 3and4when

comparingtheApl<0. 1 subsetwiththe0.3<Apl<0.7 subset andtheApl <0. 1 subset

(43)

yellowfrequenciescanbeseenfor any individualamino acid; thevaluesare alsonearly identicalwhenFigure 3 andFigure 4 arecompared,as well. The lackofacorrelation

betweenAplvalues andthe

frequency

ofthese individualaminoacids showed usthatwe

neededtoconsidertheprobleminmoredepth

-morethanone aminoacid at atime.

To simplifythe analysis, thenumber of variableswas reduced

by

usingthe four alphabetsdescribedin Table 2atthenext stageinthepipeline. Again, theresultsdidnot

reveal anysignificanttrendsthat could affectthewaythatpi iscalculated. Figures 5 and

6 (Chargealphabetcomparisons),Figures 7and 8 (Chemicalalphabetcomparisons),

Figures 9and 10 (Functionalalphabetcomparisons),andFigures 1 1 and 12

(Hydrophobicalphabetcomparisons) showverysimilarresultsto thatofthe naive

approachin Figures 3 and4. There isnotrendofincrease ordecrease in Apl for any particularamino acidwhenmoving betweenthe threedatasets.

Itwasexpectedthatmore meaningful results wouldbeobtained

by

analysis ofthe

dipeptide frequencies. Allpreviouspi predictionalgorithms(2-8),

including

ours(11)

treat thepKaforeachamino acidindependently,regardless ofitsnear ordistant

neighbors. Atthispointit is instructivetoconsiderthe experimental conditions

normally employedfor isoelectric

focusing

(IEF). The biological functionofproteins

requiresthattheymaintaintheirthreedimensional structureintact. However,forIEF,we are interested only inseparatingthe proteins,notobservingtheirbiologicalfunction. To assurethebest separation,reagents such asureaanddetergentsare added priorto IEFto

disrupt anysecondary,tertiaryorquaternaryaspects of protein structure. Inthese

fully

denaturedproteins, theonlysignificantinteractionsareexpectedto occurbetweenamino

(44)

considerationoftheeffectofneighboringamino acids ontheirrespective sidechainpKA

valuesmayprove valuable.

Withrespecttoeach alphabetthatwas usedthe discussionwill advancefrom the

leastsignificant alphabet dipeptideresultsto the mostsignificant alphabetdipeptide

results. However,theanalysisusingthenormal amino acidalphabetwillbe discussed

first. Atfirstglance Figures 13 and 14 show somevery promisingresults. The Delta %

value representsthechangein

frequency

from one Aplsubset

being

comparedto thenext Apl subset. Therefore, Delta %valuesthat areinthe300and400ranges would seem

verysignificant. Theproblemwasthatmost ofthedipeptides thatfell into theseextreme

ranges were dipeptidesthatwhose overall

frequency

wasvanishinglysmall. Adipeptide

that occursonlyonceinoneAplsubset andmultipletimes inanotherAplsubsetis going to haveavery high Delta %value. Itwould notbewisetorelyon suchdipeptide

frequenciestoredesignof apiprediction algorithm. Tonegotiatethroughall ofthe400

dipeptides inthenormalalphabet,thesame analysis was run with athreshold

frequency

occurrencefor dipeptides 0.1%. Inotherwords,ifadipeptide didnot occurinat least 0. 1%ofthe time (oratleast 22timesintheApl<0.1 dataset, whichcontained22412

amino acids)itwas not usedforanalysis. Theresults ofthiscanbeseenin Figures 15

and 16. Therestill exist extreme outliersthathave Delta %valuesinthe 100range which

willlater bereanalyzed

by

comparison with some ofthealphabetdipeptideanalyses. The alphabetthatshowedthe least

interesting

results whenusingadipeptide

approachwasthehydrophobic alphabet. ComparisonsoftheAplsubsets usingthe

(45)

barsandis verynegligible ineach ofthe4dipeptides (nomore thanaDelta%value of

1.85was seen in anyofthe4dipeptides).

The charge alphabet showedslightlymore significant results for dipeptide

anaylsis. Delta %values reachedintothe 30+rangeforsome dipeptides. The AA

dipeptide

(negatively

charged amino acidfollowed

by

negativelycharged aminoacid; see

Table 2 fordefinitionsofallthealphabetcodes) hadaDelta%of-31.3% going fromthe

Apl<0. 1 subsetto theApl>0.7subset(Figure 1 8). The Delta % fortheAAdipeptide is

also large(-18.9%) intheother comparison oftheApl<0. 1 subsetandthe0.3 <Apl<

0.7subset(Figure 17). However,the

frequency

of occurrenceofthisAAdipeptide (as

shown inthebluebars) is very low inallthree Aplsubsets. Whatwe wouldliketoseeis

alarge Delta %value accompaniedwith alarge

frequency

of occurrenceforaparticular

dipeptide. Thiswas notapparentin anyofthe dipeptides usingthe Chargealphabet.

Staying

withthe theme that themost significant results will combine large Delta %valuealongwith alarge

frequency

ofoccurrencevaluefordipeptides,theFunctional

alphabetis considerednext. Figures 2 1 and22 representingtheanalysisusingthe

Functionalalphabet show a collection ofdipeptidesthathave both significantly large

Delta%values andsignificantly large frequencies of occurrence: AA, AH, HA, HP, CP,

PH, PP.

Itwasimportanttoreferbacktotheanalysisthatwasdone using dipeptides based

onthecompleteamino acidalphabet. Figures 1 5 and 16point out afewextreme

dipeptide outliers: KY,YS (Figure 15) andEE, NN,YT (Figure 16).

Converting

these

(46)

respectively. These threedifferentdipeptides all map backtoextreme outliersfromthe

analysisdoneusing the Functionalalphabet(Figures 21 and22).

Using

theChemicalalphabet withdipeptidescreateddatathatwassufficiently

largethatitwas not possibleto

display

all thepossibledipeptidecombinations in Figures

19and20. Instead onlythe

density

values were chosentodisplay. Particularoutlier

dipeptidesare labeledonthetop of each columnwiththeirrespectiveDelta%values.

The significanceofthese findings isthatit may leadtoamoreaccurate

calculation ofpithancurrently existingmethods(11, 19). Thesedata clearlysupportthe ideathat thepKAvalueforanamino acid sidechain, evenwhentheproteinis

fully

denatured, dependsonthemicroenvironment created

by

the nearest neighbors ofthat

amino acid.

Using

theextremeoutlierdipeptides thathave been identified fromthis

studyof180annotatedE. coliproteins,it may bepossible toadjustthealgorithms for

calculatingpi values. Ouralgorithmfor calculatingpi fromaminoacidsequence(11)

couldbemodifiedtoinclude theeffects of adjacentamino acidsonthe pKAvalues used inthecalculations. Thiswillbe anempiricalprocesswherebythe pKAvalues usedin

thealgorithm willbemodified

fractionally

to see whichchanges leadtoabetter

correlationbetweenactual andpredicted pi values forthe twooutlierdatasets (0.3<Apl

<0.7; andApl>0.7).

Iftheimprovementoftheaccuracyofthepi calculationprovestobe worthythere

any many futureadvancementsthatcouldbemade. The firstcouldbetobuildalarger datasettowork withandreruntheanalysisto comparetothedatashownhere. Beyond

the scope oftheE. coliproteome, furtherdatathatare availableatthe

ExPASy

Server's

(47)

microbialproteomes. Another stepwouldbeto porttheanalysis overto lowereukaryotic

proteomesthatcontain much morepost-translationalmodifications. A lotwould haveto be done interms ofpredictingorcategorizingthesepost-translational modificationsbut

in

doing

soit may leadtoan even more powerful approachtobetterpredictingpiin

(48)

Conclusions

AdatasetofE. coliproteins was collected andformattedto studythe

discrepancy

thatexistsbetweenexperimentalisoelectricpoint and predicted isoelectricpoint(Apl).

This datasetwas thensplitintothreeparts

depending

onthemagnitude ofAplforeach

protein. Several,multi-layered,sequential approaches were takenin reformattingthe

protein sequence data inan attempttogetabetterunderstandingof whatmightbe

causingthevarying Apl. Eachofthese stages representedadifferentpart ofa pipeline where thedatawere analyzed

by

comparingeach ofthe threeAplsubsets toone another.

Thepipeline consistedof anaiveapproach

(considering

individualamino acid

frequencies), followed

by

the applicationfour different alphabetstorepresent sequences

inasimplerway

by

groupingsimilar aminoacidsbasedontheircharge, functional,

chemical, andhydrophobicproperties . The final step inthepipeline involved

investigating

thedipeptidesof allofthesesequencesusing boththe 20amino acid alphabetandthesimplifiedgroupings. Thealphabetdipeptideapproachyieldedthe

most meaningfulresults showingthatcertaindipeptidesequences occur in greatly

different

frequency

betweenproteins inthe different Aplsubsets.

Future studies will attemptto showthattheresultsofthesedipeptide findings

bettercanbeusedtobetterpredictpi. Thiswill involvemodification of ourexistingpi

prediction algorithmto include theaffectofadjacentaminoacidsin sidechainpKA

values.

Using

a shortlistofonlythemost extreme cases whereadipeptideshowed

greatly different Apl fromone subsettothenext should result ina piprediction valuethat

ismoreaccurate. Oncethepi predictionis improvedthenextstepwouldbeto

(49)

them. Inaddition,similar analyses will beextendedto otherprokaryoticorganisms, and

References

Related documents

The JCHR has improved human rights law within the United Kingdom by influencing legislative debates so that Parliament fully scrutinizes bills for human rights law

The SACCOs strategic planning was assessed by five measures namely, setting of vision mission and goals, environmental analysis, strategy formulation,

We have presented evidence that improvements in early health caused test score gains later in life for Southern blacks, and that both were the result of the integration of

In fact, IL-4 primarily and uniquely triggers the initial polarization of naïve CD4 + Th cells towards a Th2 phenotype, whilst IL-13 is probably more relevant than IL-4 in

Likewise, milk production of dairy cows drinking sa- line water (TDS = 4,400 ppm) was not different from that of cows drinking normal water during periods of low

In studies performed on animals, increased expression of HSP70 proteins and HSP40 proteins was detected in the colorectal mucosa of mice with pharmacologically induced

species, 19 strains related to bacterial fish disease, and 33 strains of unidentified yellow bacteria listed in Table 1 were used as negative controls.. Sensitivity of

In the past decade, the ethanol industry has begun changing their production processes to capture more value from the corn kernel, altering the nutrient profile of distillers