Isoelectric point prediction from the amino acid sequence of a protein

(1)

Rochester Institute of Technology

RIT Scholar Works

Theses

Summer 2005

Isoelectric point prediction from the amino acid sequence of a

protein

Matthew Conte

Follow this and additional works at: https://scholarworks.rit.edu/theses

Recommended Citation

Conte, Matthew, "Isoelectric point prediction from the amino acid sequence of a protein" (2005). Thesis. Rochester Institute of Technology. Accessed from

This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact

(2)

THESIS

ISOELECTRIC

POINT PREDICTION FROM THE

AMINO

ACID SEQUENCE OF A PROTEIN

Submitted

_by

MatthewConte

Department ofBiological Sciences

In partial fulfillment ofthe requirements

Forthe MasterofScience degree in Bioinformatics at

Rochester Institute of

_Technology

(3)

-~

nIQlnformatlcs

~luT

To: Head, Department of Biological Sciences

Rochester Institute of Technology Department of Biological Sciences Bioinformatics Program

The undersigned state that _ _

...!...M----=.!~~· :....:~...!\--...!h~~~v...J~

\

~A....!...~C:!z<.loooO~Vl-"-!e..LJo...---

(Student Name)

_ _ --:-:::---:---:-:---_-:--_ _ ' a candidate for the Master of Science degree in (Student Number)

Bioinformatics, has submitted his/her thesis and has satisfactorily defended it.

This completes the requirements for the Master of Science degree in Bioinformatics at Rochester Institute of Technology.

Thesis committee members:

Name

Gary R. Skuse

(Committee Chair)

Paul A. Craig

(Thesis Advisor)

Name Illegible

Douglas P. Merrill

Date

(4)

Thesis/Dissertation Author Permission Statement

Title of thesis or dissertation: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Name of auth0J.

A~HhLw

(0/1

k

Degree: ~ "'S~

Program: --~G~;~o~M~f9~C-M-~~I.-.-s---College: Sc.iC ..

,e.

I understand that I must submit a print copy of my thesis or dissertation to the RIT Archi ves, per current

RIT guidelines for the completion of my degree. I hereby grant to the Rochester Institute of Technology

and its agents the non-exclusive license to archive and make accessible my thesis or dissertation in whole or in part in all forms of media in perpetuity. I retain all other ownership rights to the copyright of the thesis or dissertation. I also retain the right to use in future works (such as articles or books) all or part of

this thesis or dissertation.

Print Reproduction Permission Granted:

I,

&t+kw

~

It.

,

hereby grant permission to the Rochester Institute Technology to reproduce my print thesis or dissertation in whole or in part. Any reproduction will not be

for commercial use or profit.

Signature of Author:

Matthew Conte

Date:

Cf-

OJ..

-J..065

Print Reproduction Permission Denied:

1, , hereby deny permission to the RIT Library of the Rochester Institute of Technology to reproduce my print thesis or dissertation in whole or in part.

Signature of Author: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date:

-Inclusion in the RIT Digital Media Library Electronic Thesis

&

Dissertation (ETD) Archive

I, ' additionally grant to the Rochester Institute of Technology Digital Media Library (RIT DML) the non-exclusive license to archive and provide electronic access to my thesis or dissertation in whole or in part in all forms of media in perpetuity.

I understand that my work, in addition to its bibliographic record and abstract, will be available to the

world-wide community of scholars and researchers through the RIT DML. I retain all other ownership

rights to the copyright of the thesis or dissertation. I also retain th.: right to use in future works (such as articles or books) all or part of this thesis or dissertation. I am aware that the Rochester Institute of

Technology does not require registration of copyright for ETDs.

I hereby certify that, if appropriate, I have obtained and attached written permission statements from the

owners of each third party copyrighted matter to be included in my thesis or dissertation. I certify that the

version I submitted is the same as that approved by my committee.

(5)

Abstract

Proteinsoftendonotmigrate as expected intwodimensionalelectrophoresis

basedontheir_primarysequence. Thepredicted isoelectricpoint_(pi)

_frequently

doesnot

coincide with experimental pivalues obtainedinthelaboratory. Thereasonsforthese

differencesledto thisstudy. _Initially, 2DE data fromtheE. coli proteome was collected

and formatted. Thisdataset was splitintothreepartseach_consistingofdifferent levelsof

pi

_discrepancy

(Apl). Theprotein sequencedata foreachAplsubset was runthrougha

pipeline. Ateach stageofthepipelinethe datawere analyzed

_by

_comparingeach ofthe threeAplsubsets toone another. The pipelineconsistedofa naiveapproach

_(considering

individual amino acid_{frequencies),} followed

_by

theapplication four differentalphabets

to represent sequences inasimpler_way

_by

_groupingsimilaraminoacidsbasedontheir

charge,functional, chemical,andhydrophobic properties . Thefinal step inthepipeline

involved

_{investigating}

thedipeptidesof all ofthesesequences_{using both}the20amino

acid alphabet andthesimplifiedgroupings. Anevaluation ofthe alphabetdipeptide

analysisdemonstratedtheexistence of certaindipeptidesequences whichcorrelatewell

(6)

Table

of

Introduction

1

2

Methods

7

2.1

_Forming

thedata set 7

2.2 Experimental and predicted pi values 9

2.3

_Extracting

useful information from collected subset

sequences 10

2.2.1 Aminoacid_frequencyanalysis _{(naive approach)} ... 10

2.2.2_Frequencyof amino acids _{(alphabets approach)} ... 1 1

2.2.3 _Frequencyof amino acids _{(dipeptide approach)} ... 14

2.2.4Pipelineworkflow 15 3 Results 18 3.1 Naive approach 18 3.2 Alphabets approach 19 3.2.1 Charge 19 3.2.2 Chemical 21 3.2.3Functional 22 3.2.4 Hydrophobic 23 3.3 Dipeptide approach 24 3.4 Dipeptide threshold 26

3.5 _{Dipeptide using} alphabets 28

3.5.1 Charge 28 3.5.2 Chemical 29 3.5.3 Functional 31 3.5.4Hydrophobic 32 4 Discussion 34 5 Conclusions 42 6 References 44

(7)

Introduction

Two-dimensionalgel electrophoresis

_(2DE)

has beenanimportant

_laboratory

techniqueforthefield of proteomicsforovertwodecades. 2DEallowstheresearcherto

separate and

_identify

thousands ofproteinsfromacellularextractinasingle experiment.

2DE isdifficultandtime_consumingasit is necessarytodetermine ideal initial

conditions,waitfor_results, and_possiblychangeconditions afterthat(1). Inaddition,

reproducibilityof gels andcomparison of2DEresultsbetweenseparate groupshas

proveddifficult (1). In_2DE, proteins areseparatedinthefirst dimension

_by

their

isoelectricpoints(thepH at whichthenet charge oftheproteinis_zero)andinthesecond

dimension

_by

theirmolecular weights. Theaccurate prediction of proteinisoelectric

point_(pi) andmolecular weight_(MW)_{using simply}theamino acid sequence ofthe

protein would_{be extremely}valuabletoresearcherswho usetwo-dimensionalgel electrophoresis.

Computationalproceduresfor calculatingand_predictingthepifromtheamino acidcomposition of a proteinbasedonthedissociationconstants ofthe charged groups

withintheproteinhave been developed (2-8). The accuracyofthesealgorithmsis limited

_by

the_certaintyofthevaluesforthedissociationsconstants and

_by

microenvironmental effects suchas charge-chargeinteractionsandpost-translational

modifications.

To _{systematically}explorethe_{relationship between}pi,molecularweightand

protein_sequence,adataset of proteins was collected andorganizedfromamodel

(8)

post-phosphorylation which can alter_{the pI/MW;}thepresenceofthesemodificationsmakes

pI/MW predictions much more difficultsincethemodifications intheproteins_maycause

themtomigratetoapositionon a2-Dgelthat isquitedifferentthanwhatis predicted

based_solelyontheamino acid sequenceoftheprotein. E. coliisalsoone ofthebest

characterized prokaryotes and much more_{data beyond simply}theproteinsequencefor

each proteinis_widelyavailablefor it.

Atthispoint_{it is necessary}toconsiderthebasic structuralfeaturesof proteins and

therole ofindividualamino acidsinthestructure andfunctionof proteins. Figure 1

belowshowsthestructure ofthe 20amino acidswith side chain structures showninred

(10).

Thecharge onall proteins arisesfromsome oftheaminoacid side_chains, aswell

asthecarboxy-_and

amino-termini, some prostheticgroups, andbound ions. Ourpi

predictiontool₍₁₁₎is designedtocalculate chargebasedonthe side chains and

carboxy-and amino-termini. Thecharge on amino acid sidechainsdependsonthepH ofthe

solutionand_{the pKa}oftheside chains. It isalso affected

_by

the localizedenvironment

around a side chain. Ourcurrent calculation model usesthe

_following

_pK.Avaluesfor

ionizablegroups ontheproteinanddoesnot make _anyadjustments to thepKAvalues of

thesidechains regardless oftheirenvironment withintheprotein(Table 1). Wealso

assumethattheseparationis basedonthe totalcharge on_{the protein,}notthe

(9)

H H H H H 1 .0 H3N+-aC -ce 1 Vj 1

_P

H3N+ -ac -ce 1

XP

1

_P

H3N+-aC-ce 1 x'o 1

_P

H3N+-ac-c'e 1 XP 1

_P

H3N+ -aC C^e 1 XP (CH2)3 1 CH2 1 CH2 1 CH2 1 CH2 1 NH _CH2

t^

h

w

1 C=NH2 1 1 C=₀ |

KJ

_y

OH H NH2 Arginine (Arg/R) NH2 Glutamine (Gln/Q) Phenylalanine (Phe /_F) Tyrosine (Tyr/Y) Tryptophan (Trp.W) H 1

_p

H3N+ -ac c'e 1

XP

CH3 H 1

_P

H3N+-aC -Cp 1 "P H 1

_P

H3N+ -ttC-C> 1 XP CH2 H 1 /P H3N+ -aC - C^S 1

xo

H 1 /> H3N+ -Mc - ce 1

XC

H /

rcH2

(CH2)4 | HN ,N 1 OH

NH2 Glycine Alanine Histidine Serine

Lysine

(Lys/K)

(Gly/G) (Ala_/A) (His/_H) _(Ser/_S) H 1

_P

H3N+ -^C-C^e 1 XP CH3 1 CH2 H 1

_P

H3N+ -aC - C*e 1 XP CH2 1 COOH H 1

_P

H3N+ -aC - CS 1 ^P H-C-OH 1 CH3 H 1

_P

H3N+ -aC-Ce 1 XP CH2 1 ' SH H2 C

\

/

_P

H2N+ -aC - Ce 0 Proline ₁ (Pro /_P) COOH

jl GlutamicAcid AsparticAcid Threonine Cysteine

1 yP H3N+ -_ac- c -e 1

XC

CH2 1

(Glu/E) (Asp/_D) _(Thr/T) (Cys /_C)

H H H H 1

_P

HsN+^c-c'e 1 XP 1

_P

H,N+ -*c - ce 1

XP

1 /P H3N+ -"C CS 1 XP 1

_p

H3N+ -aC-Cve 1 ^P CH2 1 S CH2 1 CH CH2 1 c=_o HC-CH3 1 CH2 CH CfH3 CH3 1 CH3

P\

CH3 CH3 _NH21 _CH31

Methionine Leucine Asparagine Isoleucine Valine

(Met /_M) (Leu /_L) (Asn /_N) (He /₁₎ _(Val/V)

Figure 1. Structuresof amino acids with side chains showninred,carboxylate groups ingreen, andamino groupsin blue(10).

Thecharge ontheproteinis thesum ofthechargesontheindividualamino acid

side chains. However, thecharge onindividualaminoacid sidechains can_varywhen

(10)

pKaforglutamic acidis about4.1. In_lysozyme, twoglutamic acidresiduesareinthe

active site. Oneis inapolar environment andhasanormalpKAvalue. Theother

glutamate side chainisinahydrophobic _environment, whereanegative charge is

energeticallyunfavorable. Therefore the pKAvalue forthis glutamate side chain

increases,whichthendecreasestheextent ofthedeprotonationofthatside chain.This is

veryimportantinthemechanismoflysozymeactivity,which requiresthatone ofthe side

chainsbecharged

_{(deprotonated)}

andthe otherbeuncharged_(protonated) atthe same time.

Inasecond_example,the serine intheactive sites of serineproteaseshasa much

differentacid-basebehaviorthanotherserines_{normally found in}proteins (9). The

normal_pKAvalue forthe _{hydroxyl group}ontheserine side chainisgreaterthan 15, meaningthatthisgroup isnotfound inanionizedstateinmost proteins. Inserine

proteases, theinteractionoftheactive site serine with_{nearby histidine}and aspartate side

chains(the so-called catalytic_triad) leadsto theionizationofthe serinehydroxylgroup.

Meanwhile,the pKAvalueisreducedfromabout 15to a value closerto7 or8. This

example makesitclearthat themicroenvironmentof anindividualamino acidside chain

canchangeit ionization behavior.

Othereffects onthe pKAof anamino acid side chain canbeseen whencertain

aminoacidsare positioned nexttoeach other. Forexample,atypicalArginineresidue

whichis basicwill havea_pKAof about 12.5 (Table 1

_below)

and_carryafull+1 charge

inthephysiologicalpHrange. _However,whentwo ofthesebasicArginineresidues are

adjacent ina protein sequencethe pKAvalues will decrease,duetorepulsionbetweenthe

(11)

arginine side chains tobecome lessionizedand_{carry only}afractionalpositive charge.

Table 1 below lists_{the typical pKA}valuesforionizablegroups inproteins (9).

Group

TypicalpKa

Terminala-carboxyl _group 3.1

Asparticacid,

Glutamicacid 4.1

Histidine 6.0

Terminala-amino_group 8.0

Cysteine 8.3

Tyrosine 10.9

Lysine 10.8

Arginine 12.5

Table 1. Theseare_pKAvaluesthatare_{commonly found for}thesesidechainswhen

theyarepart of aprotein.ThepKAvaluesfortheseside chains_{may be}quite different forthefreeamino acidinsolution. pKAvalues alsodependon temperature,ionic strength, andthemicroenvironmentoftheionizable group(9).

Aswebegantoconsidertheimpactof amino acid sequence onionization behaviorof

individualaminoacid side_chains,theneedtocreate groupsofaminoacidsbasedontheir

chemicaland physical characteristics ratherthan_{concentrating}on eachindividualamino

acidbecameapparent. Weelectedto dividethe aminoacidsintogroupsbasedontheir

chemical, functional, charge,andhydrophobiccharacteristics. _Dividingsets of amino acids intothese groups enables ustouse smaller alphabetsbasedonthesecharacteristics

as opposedto _{simply using}thenormal20 letteraminoacid alphabet inourcalculations.

Weusedthese_propertygroupsto rewrite aprotein sequences intoan alternative

alphabetthatismuch smallerthan thenormal aminoacid alphabet of20characters (12).

(12)

onwhich amino acids fallunder what particulartypes. The Methodssectioncontains

examples of protein sequences thathave beentranslatedintothesedifferentalphabets.

Alphabet Type

(size)

Code

_Meaning

Amino Acidswith

thatCode Charge₍₃₎ A Negative D, E C Positive H,K,R N Nocharge _{A,C,F,G,I,L,M,} N,P,Q,S,T,V,W,Y Chemical₍₈₎ A Acidic _D, E L Aliphatic A,G,I,L,V M Amide _N,Q R Aromatic F,W,Y C Basic R,H,K H Hydroxyl S,T I Imino P S Sulphur C,M Functional(4) A Acidic _D, E C Basic H,K,R H Hydrophobic _A,F,I,L, M, P, V, W P Polar C,G,N,Q,S,T,Y Hydrophobic₍₂₎ I Hydrophobic _{A, F, I, L,} M, P, V,W 0 Hydrophilic _{C, D, E, G, H, K, N,} Q,R, S,T,Y Table 2. Descriptionoffourabbreviated amino acid sequence alphabets: _Charge,

Chemical,Functional, andHydrophobic (12). Shownarethenew alphabet codes usedforeachdifferentalphabet,whateach code represents intermsof properties of aminoacids,andthespecific amino acidsthatare included in

each property.

Proteinsthathaveasignificantdifferencebetweentheirpredicted pI/MW

(obtained usingsimilar algorithmsasmentioned_above)andtheirexperimental pI/MW

willbestudied. Asmentionedbefore,certain aminoacidsthatoccurinaparticular

(13)

of certain proteins(those withlargeAplvalues) thatdonot occurintheotherproteins

whose pi values were _accuratelypredicted are important.

_They

_may leadtoamore

accurate prediction ofthepi andMWall of proteinsfromtheiraminoacidcompositions.

Methods

Formingthedataset

The

_ExPASy

Server's SWISS-2DPAGE database(13)provides extensive2-Dgel information for_human, mouse,Arabidopsis thaliana,Dictyosteliumdiscoideum,E. coli, Saccharomyces cerevisiae,andStaphylococcusaureus _(N315)which arealso

cross-referencedin Swiss-Prot. Eachproteininthe database iscollected and annotatedfrom

experimental2-D gels readfromreference maps. Thedatabaseforthisproject contains 336proteins oftheE. coliproteome characterized

_by

five differentresearchgroups

(14-18). Itwasdecidedthatthe compilation of pI/MW setsfortheseproteins shouldbe

separated_accordingtoeach research_group since experimentalconditionsvaried_among

them. Theproteinscontributed

_by

thePhillips et al. _(14),Pasquali etal. _(15),and

Vanbogelen et al. ₍₁₆₎groupswereignored becausetheseproteins were also

characterized

_by

the Tonellaetal.(17)and Yanet al. ₍₁₈₎groups. Two setswere_created; thefirstcontains 228of all theproteinsdenoted

_by

Tonellaetal. and 153 proteins ofall

theproteinsdenoted

_by

Yanet al. The firstset wasalso separatedbasedonthepH range

used for isoelectric

focusing

(pH4-5, 4.5-5.5, 5-6, 5.5-6.7, 6-9,and6-11). We concentratedontheTonellaet al. setbecause itcoveredmorethan70%oftheE. coli

(14)

WethenmatchedthepI/MWdata foreachprotein with its FASTAsequence.

This allowsustocompare experimental pI/MWvalueswith predicted pI/MW values.

ExPASy

providesits owntool for_predictingpI/MWwhich requiresalist ofSwiss-Prot

proteinIDs asits inputofproteins(19). Wehavealsodevelopedourowntool that

includesapI/MW predictionwhichrequires inputofFASTA formatsequences,Genbank

format,orProtein Data Bank format (11). Both ofthesepredictiontools arebased (and

especiallypi for both_tools)on acalculation_using_pKAvalues of aminoacids as

described earlierintheintroductionand

_by

Bjellqvistet al. ₍₁₉₎ The first stepwasto

retrievethe2-D gelinformation forall oftheseproteins. _ExPASyprovides a_waytoget

thedata fromeach2-D gelinatabdelimited formatthatincludeseach spot (oneprotein

canhavemultiplespots on agel).

_Having

thisdata inatabdelimited formatgave afar greaterease of use whenlater performing anytypeof analysis onthedata (suchas

comparingexperimental pitopredicted pi). The fieldscontainedinthese files included:

gene_name,proteindescription, SWISS-2DPAGE Serial_Number,SWISS-2DPAGE

Accession_Number,identificationmethod(gel_{matching, microsequencing,} or peptide

mass_{fingerprinting),}experimental_pi,experimental_MW, and references.

AlistofSwiss-ProtproteinIDs (2DPAGE Accession Number

-e.g. _P00274)was

thenmadeforeach ofthe gels. This listofproteinswasthenusedtoretrieve aFASTA

file oftheproteinsfromeach gel(someproteins were repeatedformultiplespots). The

Swiss-Prot IDsweresubmittedtotheNCBItool_{for retrieving} sequencesat

http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Protein. The sequenceswere

downloaded in FASTA formattobeused inour predictiontool. BatchretrievalatNCBI

(15)

whatever_{reason, the} initialmethionine residuewhen_{retrieving in FASTA format. The}

FASTA fileforthe set of proteinsfromeach gelwasthen fed intoourtool wherethe

output can_{be conveniently} recordedtoaMicrosoft Excel file. _However, problems

occurred when_usingthe FASTA file from NCBI inourtoolsinceitwould orderthe file

basedonGenbank accession number andnot

_by

Swiss-Prot IDwhich was neededto

matchthetab delimitedfile foreach gel. Thiswas solved

_by

_removingtheGenbank

accession number

_(leaving

justtheSwiss-Prot _ID)fromeach protein_{entry in}each

respectiveFASTAfile_usingasimplePerlscript. Thiswas facilitated

_by

afewregular

expressions mostnotably: ":%s/gi|\d*|sp|//"(quotationsexcluded). ThepI/MW predict

toolat

_ExPASy

₍₁₉₎was not quiteas _easytouse sinceitdoesnot outputinto aformat

thatcanbe imported into Excelreadily. Theoutputfilewas edited_usingthe

_following

regularexpression:

":%sAs\s*At/g"

(quotations excluded)whichtransformedit into atab delimitedtext_file,_{allowing it}to_{be easily}manipulated in Excel. Neverthelessthe "ComputepI/MW

tool"

at

_ExPASy

₍₁₉₎ gave _strikinglysimilar resultstoourtool.

Bothexperimentaldatasets derived fromtheTonella data(1₇₎andthe Yan

(DIGE) data₍₁₈₎werecomparedwithbothpI/MW predictiontools andtheresults canbe seenintheExcelfilesathttp://www.rit.edu/~mac3948/E2D/Ecoli/.

Experimentalandpredicted pi values

Looking

atthe compileddatasetitwasnoticeablethatsome predicted pi values

were far different fromexperimental pi values. Someproteins differedinpredictedpi

versus experimental pi

_by

asmuchas 1.86pH units(e.g._P06128,

_{Phosphate-binding}

(16)

pi was_exactlythesame asthe experimentalpi(e.g. _P06960, Ornithine

carbamoyltransferase chain F_(OTCase-2), seeAppendix A).

Tobettercharacterizethesediscrepancies across alloftheproteinsasimple

calculation was performed:

Experimental pi

-predicted pi = _Delta _(A)pi (Eq. 1)

Thedifferenceinexperimental pi and predicted piwillbereferredasApl inthis paper.

Themainfocusofthis projectisto

_identify

potentialcauses of_{varying Apl}values.

Thedatasetwasthen_{broken down into roughly}thirds. Thefirst subset of

proteins consistedof60proteins wheretheAplvalue waslessthan0. 1. Anothersubset

held58 proteins ofAplvalues greaterthan_0.3,but lessthan0.7 (0.3 <_Apl<_{0.7). The}

lastthirdwasputintoasubset of50proteins wheretheAplvalue was greaterthan0.7. Refertothe tables in Appendix Aforalistoftheproteins ineachAplsubset.

The

_following

sections will providethesequentialstepsthatwereperformedon

the analysis ofthesedatasubsets. Itstartswith anaive approachto

_handling

thedatathat

dealswith_{simply calculating}raw frequenciesofthe20aminoacids. Thenext section explainshowwe usedthefour differentalphabetsto analyzethedata_subsets, still

focusing

onindividual aminoacid frequencies. The dipeptideapproachesaredescribed

next,followed

_by

afinal sectionthatsummarizeshowthewhole process flows together.

Extractinguseful information fromcollected subsetsequences

Amino acid

frequency

analysis (thenaiveapproach)

There isa naive approachto

_finding

a significantdifference betweeneachofthe

subsets ofAplranges. This methodinvolves

determining

the counts of each amino acid

(17)

acidbetweentheApl subsets. Ifasignificantdifference for anyaminoaciddoes exist

between_anyoftheApl_{subsets, then this} wouldbeof greatinterest. Itwouldthen be

possibletoadjust a pi predictionalgorithmbased onindividual aminoacid

_frequency

values and predictpivaluesthatwere closertoexperimental values.

Thefirst_{step in going}aboutthenaive approachwasto startfromthe listof

proteins foreachAplsubset. _{As previously}_described, thebatchsequence retrieval atthe

NCBI wasusedtoobtain aFASTA filethatcontained each sequenceincluded in each

Apl subset. A Perlprogramwasthenwrittentocountthenumberof aminoacidsin each

sequencefromaFASTA fileand calculatethe

_frequency

of_each, _outputtingatab

delimited file

_displaying

allofthe frequencies foreach sequence. Thecode ofthis

programcanbe found in Appendix B

-aacounts.pl.

Another Perlprogram was written whichconcatenates eachseparatesequence

intoone

_long

sequence. Thisallows oneto lookattheamino acidfrequencies

encompassingeachAplsubset as a wholeinsteadof protein

_by

protein. Theprogram

also makes surethateachprotein sequenceis kept separate andthat theheader lineof

each sequenceisremoved(see Appendix B

-makeComposite.pl),which willbeshown

to_{be important shortly}when

_looking

at twoamino acidsthatoccurone right afterthe

other(see dipeptideapproach).

Frequencyofamino acids(alphabets_approach) Chargealphabet

Amore sophisticatedanalysisofamino acid

_frequency

canbe doneifthe amino

(18)

side chainsoftheamino acids canbeused toassignthemtofourabbreviatedamino acid

alphabets_(Charge, _{Chemical, Functional,} andHydrophobic). The Chargealphabet(see

Table₂₎ is basedon whetherthe side chain of an amino acidcanhavea positiveor

negativecharge, oris_simplyuncharged(neutral). Glutamic Acid (Glu/_E) andAspartic

Acid

_(Asp

/_D)are the_onlyamino acidsthatcontainthe_negativelycharged carboxyl

group (COO). Therefore, intheCharge alphabet_theyare groupedtogetherandgiventhe

code A. _Likewise, Lysine (Lys /_K)andArginine

_(Arg

/_R)are aminoacidsthatcontain

the _positivelycharged amino groups(thelysine sidechaincontains ane-amino_groupand

argininehas aguanidino group). Inthe Chargealphabet_theyare groupedtogetherwith

thecode C. Histidine (His /_H) isalsogrouped intothe_positivelychargedamino acid

group becauseprotonation ofthe nitrogenon itsside chainoccurs easily. _{The remaining}

15 aminoacidshave side chainswhich_{normally do}notdemonstratecharge behavior in

proteins; theyare groupedtogetherand giventhecode N. Anexample of_using the

Chargealphabet canbeseenbelow:

ACDEFGH (original sequence)

i

NNAANNC (Charge alphabet sequence)

Chemicalalphabet

The Chemicalalphabetincorporatestwo groupings, acidicandbasicwith codesA

andC,respectively. These groupings areanalogousto theAandC groupings inthe Chargealphabetforthesame reasons. The Chemicalalphabetcharacterizesthe

remaining 15 amino acidsbasedonmorethan theirlackof acharge. Asparagine (Asn /

N)andGlutamine (Gin /

_Q)

areamino acidsthatcontainan amide

_(CONH2)

and are

(19)

/_W), andTyrosine _{(Tyr, Y)}contain aromatic rings(code R). Serine(Ser/_S)and Threonine (Thr/_T)containthehydroxyl _group_(OH)ontheirside chains (code H). Proline (Pro / _P) contains animino_group _(>C=NH)on itsside chain(code I). Finally,

the sulfur_containingamino acids areCysteine (Cys /_C) andMethionine (Met / _M)are

groupedtogetherwithcode S. Anexample of_usingtheChemical alphabetcanbeseen

below:

ACDEFGHNPS (original sequence)

I

LSAARACMIH (Chemical alphabet sequence)

Functionalalphabet

The Functionalalphabet againincorporatestheA_(acidic) andC_(basic)groups as

didtheChargeandChemicalalphabets. The Functionalalphabet characterizes the

remainingaminoacidsinto 2 groups: H _{(hydrophobic)}andP_(polar)basedon whether

theaminoacidis hydrophobic (suchas_Alanine)or polar(suchas Cysteine). Anexample

of_usingtheFunctionalalphabetcanbeseenbelow:

1

HPAAHPC (Functional alphabet sequence)

Hydrophobicalphabet

TheHydrophobic alphabetis similarto thelatter halfofthe Functionalalphabet.

Itgroups aminoacidsbased onlyonhydrophobicity. Aminoacidsthatarehydrophilic

(suchas _Cysteine)are giventhecodeI. Aminoacidsthatarehydrophobic(suchas Alanine)aregiventhecodeO. Anexample of_usingthe Hydrophobicalphabet canbe

(20)

1

OIIIOII (Hydrophobic alphabet sequence)

Perlprograms were written thatconvert normalsequences intoeach ofthe four

alphabetsjust described(see_{charge.pl, chemical.pl,} _{functional.pl,}andhydro.pl in Appendix B). Theprograms also calculate and

_display

the

_frequency

of each alphabetic

codethatis chosen.

Frequency

ofamino acids (dipeptide_approach)

The problemthatcertain abnormal_pKAside chains values of amino acids

affectingthe overallcharge of aprotein stillhadnotbeen dealtwith_upuntilthispoint.

All thathad been consideredwasthesumof asetofstrict _pKAvalues foreach amino

acidwithout_taking intoaccount_anychangesthatmight occurduetocertain amino acids

being

nexttootheramino acids insequence. Theapproachto_solvingthis problemwas to examine_every

"dipeptide"

inthe threeAplsubsets. Asequenceoflength 7 has 6

dipeptides. Forexample,

Sequence: Dipeptides: Dipeptide counts: Frequency:

ABCABBC AB AB = ₂ _0.333 BC BC = 2 0.333 CA CA = 1 _0.167 AB BB = ₁ _0.167 BB BC

The

_frequency

atwhich eachdipeptideoccurs inaparticular sequence isof

interest, particularly,when_theyare consideredin eachAplsubset. A Perl program was

(21)

dipeptideinthe sequencesofthe FASTA afile that is input (see Appendix B- dipeps.pl

fordipeptides output in

_increasing

order ordipepsA.pl fordipeptidesoutput

alphabeticallyfrom AA ... VV). Aswasthe case earlier_withthenormalamino acid

alphabet, thenumberofdifferentdipeptides(20x20=_{400 for}_the

normal_alphabet)

becameproblematic. The samedipeptidetechnique wasappliedto sequences after

convertingthem intothe_Charge,

_Chemical,

_Functional, andHydrophobic alphabetsto alleviatethisproblem.

Combining

an entireApl subsetofFASTAsequences intoone

_long

sequence

(using

makeComposite.pl

-seeAppendix_B)alsobecameproblematic. Tocountthe number ofdipeptides ina set ofsequencesthathas beencombinedintoone

_long

sequence, special attention needstobe paidsothatthelastaminoacidinone sequence

andthefirstamino acidinthenextsequence arenot counted as adipeptide. The format

ofthe outputfile frommakeComposite.pl handlesthisproblem

_by

_replacingeach accessionlinewith ablanknewline. Theotherprogramscan now usethis formatted

FASTA fileso thatthedipeptidecountsarejustas accurate as naive and alphabetcounts.

Pipeline Workflow

So fartherehave beenstages at whichthe

_frequency

of anamino_acid, _groupof aminoacids (coded accordingto the fouralphabets),dipeptide, orgroupeddipeptide (coded accordingtothe fouralphabets)has beenexamined. Theprocess of_transforming thedatatoreach each ofthese stages _mayappear somewhatconfusing. Figure 2 below

diagrams howtogo fromaninitialset ofFASTA sequences(foreach_{Apl subset)} toeach stage of analysis. The flow in_takingthenaiveapproachwould gofromFASTA

(22)

sequencetomakeComposite.plto aacounts.pl andthenanalysis. _However,the flow for

examiningdipeptideswithafunctionalalphabetismore complex. Itbegins

_by

transferringtheFASTAsequencetomakeComposite.pl tofunctional.pltodipeps.pl (or

dipepsA.pl)

followed

_by

analysis. Table 3 belowgivesabrief descriptionofeach

program usedinthis pipeline workflow(fora moredetaileddescriptionandcode ofeach

program see Appendix B).

( \ Aplsunset FASTA file v. J charge.pl [ i ' chemical.pl 1 \ * ~~~~~ ^-^^^^r \ dipeps.pl or dipepsA.pl ^ ) makeComposite.pl i r functional.pl i hydro.pl " r ~\ analysis i i' aaco ants.pi ^ )

Figure 2. Workflow diagramthatshows howto getto each stageof analysis (naive, alphabets,dipeptides).

(23)

Program Description

aacounts.pl Countsthe number of each aminoacid(normal alphabet) ina sequence fromaFASTA fileanddetermines the

_frequency

of each. Output istoFASTAfilename.aacounts

charge.pl Convertsthe amino acids fromthe sequencesinaFASTA file

into a3-letteralphabet_usingthe_charge()methodin

Bio::Tools::OddCodes (12). Itthencounts thenumberofeach codeforeach sequence as wellas eachfrequency.

chemical.pl Convertsthe amino acids fromthe sequencesinaFASTA file

intoan8-letteralphabet_usingthe_chemical()methodin

Bio::Tools::OddCodes (12). Itthencountsthenumber ofeach code foreach sequence as well as eachfrequency.

dipeps.pl Countsthenumber of eachdifferentaminoacidpairforeach sequence inthegivenFASTAfiles. It displays each pairin orderfrom highest

_frequency

to lowest.

dipepsA.pl Countsthenumberofeachdifferentaminoacid pairforeach sequenceinthegivenFASTAfiles. It displayseach pairin alphabetical order(AA ... W).

functional.pl Convertsthe amino acidsfromthesequences inaFASTAfile intoa4-letteralphabet_usingthe

_functional()

method in

Bio::Tools::OddCodes (12). Itthencountsthenumber of each code foreach sequence aswell as eachfrequency.

hydro.pl Convertsthe amino acids fromthesequences inaFASTAfile intoa2-letteralphabet_usingthe_{hydrophobic()}methodin

Bio::Tools::OddCodes(12). Itthencountsthenumber of each codeforeach sequenceaswell as each frequency.

makeComposite.pl Converts FASTAfilesofmultiple sequences intoa single

(composite)sequence. Thiscomposite sequence isthenableto beused with other programslisted here.

Table3. Descriptionoftheprograms usedinthispipeline workflow. AppendixB

(24)

Results

Naiveapproach

The intitialnaive approachto _analyzingthedatasetwasdonetodeterminethe

counts ofeachamino acid

_(using

thenormal_{alphabet) in}eachAplsubset(Apl <0.1; 0.3

<_Apl<_0.7;_Apl>_0.7)_{and compare}_the _relative

frequency

of occurrence foreach amino acidbetweentheApl subsets. Acomparison ofthefrequencies betweentheApl<0. 1

subset andthe0.3 <_Apl<_0.7_subset _is_shown_{in Figure 3. A}_{similar comparison}

betweentheApl<_{0. 1} _{subset and}_the_Apl>_0.7subsetis displayed in Figure 4.

FrequenciesofAmino Acidsin \pi <0.1 and(0.3<_Apl<0.7)

Figure 3. _FrequencyofIndividual Amino Acids in Two Apl Subsets. The Xaxis labelsrepresenttheone letterabbreviations oftheamino acids. Shown in blueare istheApl<_{0. 1} _subsetandshowninyellowisthe0.3 <Apl<0.7subset. The Apl< 0.1 subsetconsists of60proteins which comprise22472 totalamino acids. The 0.3 <_Apl<_0.7subset consists of58 proteins whichcomprise 17906totalaminoacids. More informationabout eachindividualproteinintheseAplsubsets canbeseenin AppendixA.

(25)

FrequenciesofAmino Acidsin Apl<_0.1 _andApl>0.7

Figure 4.

_Frequency

ofIndividual Amino Acids in Two Apl Subsets. The Xaxis labelsrepresenttheoneletterabbreviations oftheaminoacids. Shown in blueare

istheApl<0. 1 subsetand showninyellowistheApl>_0.7 subset. The Apl<0. 1 subsetconsists of60proteinswhichcomprise22472 totalamino acids. TheApl> 0.7 subsetconsists of50proteinswhich comprise 15581 totalamino acids. More informationabout eachindividualproteinintheseAplsubsets canbeseenin Appendix A.

Alphabets approach

-Charge

The next_{step in}analysis wasto convert each oftheAplsubsets intoasequence

thatutilizesthefouralphabets. This decreasesthe size oftheaminoacidalphabet and

reducesthenumberofvariables

being

examined. The differentalphabets are

summarizedin Table 2.

Using

theChargealphabet,a comparisonofthefrequencies

betweentheApl<_{0. 1} subset andthe0.3 <Apl<0.7subsetis shownin Figure 5. Again

usingtheChargealphabetasimilar comparisonbetweenthe Apl<0.1 subset andtheApl

(26)

Frequencies of_{Amino Acids (Charge alphabet) in}

Apl<_0.1 _and _(0.3<Apl<

_0.7)

Apl< 0.1

?0.3< Apl< 0.7

CAN

Amino Acid (charge alphabet)

Figure 5.

_Frequency

ofAmino Acids

_Using

the Charge Alphabet in Two Apl Subsets.

Frequencies of_{Amino Acids (Charge alphabet) in}

Apl<0.1 andApl>_0.7

80 70 -. 60 s? 50 > o g 40

|

30 "" 20 10 0 Apl< _0.1 ?Apl> _0.7; CAN AminoAcid (charge alphabet)

Figure 6.

_Frequency

ofAmino Acids

_Using

theCharge Alphabet in Two Apl Subsets.

(27)

-Chemical

Using

theChemicalalphabet,acomparisonofthefrequenciesbetweentheApl <

0. 1 subsetandthe0.3 <_Apl <_0.7 subsetisshown inFigure 7. Figure 8 displaysthe

same comparisonbetweentheApl <_{0. 1} _{subset and}_the_Apl>_0.7subset.

Frequenciesof_{Amino Acids (Chemical alphabet)}_inApl<_0.1 and

(0.3<Apl<0.7)

Apl< 0.1 D0.3< Apl< 0.7

R M H C

Amino_{Acid (chemical alphabet)}

Figure 7.

_Frequency

ofAmino Acids

_Using

the ChemicalAlphabetin Two Apl Subsets.

Frequenciesof_{Amino Acids (Chemical alphabet) in}

Apl<0.1 andApl>0.7

Apl<0.1

?Apl>0.7

I R M H C

Amino Acid (chemical alphabet)

(28)

-Functional

UsingtheFunctional_alphabet,a comparison ofthefrequencies betweentheApl< 0. 1 subset andthe0.3 <_Apl<_0.7 subsetisshownin Figure9. Again usingthe

Functionalalphabet a similar comparisonbetweentheApl<_0.1 subsetandtheApl>0.7

subset isdisplayed in Figure 10.

Frequencies ofAmino Acids (Functional alphabet) in

Apl<0.1 and(0.3<Apl<_0.7)

Apl<0.1 D0.3<_Apl< 0.7

A P

Amino Acid (functional alphabet)

Figure 9. _FrequencyofAminoAcids_UsingtheFunctional Alphabetin Two Apl Subsets.

(29)

FrequenciesofAmino Acids (Functional_alphabet)in Apl<0.1 and Apl>0.7

Apl<0.1

DApl>0.7

A P

AminoAcid _{(functional alphabet)}

Figure 10.

_Frequency

ofAmino Acids

_Using

theFunctional Alphabet in Two Apl Subsets.

-Hydrophobic

Using

theHydrophobic_alphabet, acomparisonofthefrequencies betweenthe Apl<0.1 subsetandthe0.3 <Apl<0.7 subsetisshownin Figure 11. Again usingthe

Hydrophobicalphabet a similarcomparisonbetweentheApl<0. 1 subsetandtheApl>

(30)

Frequenciesof_{Amino Acids (Hydrophobic alphabet) in}

Apl <_0.1 and (0.3<Apl <_0.7)

Apl<0.1 ? 0.3<Apl<0.7

I O

Amino Acid (hydrophobic alphabet)

Figure 11.

_Frequency

ofAmino Acids

_Using

theHydrophobic Alphabet in Two Apl Subsets.

Frequencies of_{Amino Acids (Hydrophobic alphabet) in}

Apl<0.1 and Apl >0.7

Apl<0.1 D Apl>0.7

I O

Amino Acid (hydrophobic alphabet)

Figure 12. _FrequencyofAminoAcids

_Using

theHydrophobic Alphabet in Two Apl Subsets.

Dipeptideapproach

Using

a moresophisticatedmethodthatlooksatdipeptides ofasequence gave an

(31)

is similarto thenaive approachinthat itjustexamines dipeptides usingthenormalamino

acid alphabet. Thisresults inupwardsof400different_{dipeptides (there may be slightly}

fewerthan400dipeptides ina given subset_owingto thechance thatnotallpossible

dipeptides_mayoccur). The difference in

_frequency

of_{every dipeptide between Apl}

subsets was also calculated("Deltafrequency"or"%"). Inother_words,aDelta% of100

wouldmeanthatacertaindipeptideoccurred2timesas muchinonesubset comparedto

anothersubset.The _differences, or"Delta%"valuescanbeseenin Figure 13 when

comparingtheApl< _{0. 1} _subset_and_the_0.3 <Apl<_0.7 subset. Figure 14 showsthe

similarDelta %values when_comparingtheApl <0. 1 subset andtheApl>0.7 subset.

To betterexplainFigures _13-16, considerthebar indicated

_by

thearrowin Figure 13.

This barrepresentsthe 1 1 times that therewas aA%valuebetween 100%and 150%

when_{comparing dipeptide frequencies in}the two different Aplsets.

DensitiesofDelta % Values inApl<0.1and 0.3<Apl<0.7

UsingaNormalAmino Acid Alphabet

Figure 13. _DensityofDelta% ValuesofDipeptidesin Two Apl Subsets. The Apl<0.1

subset consists of60proteins which comprise22412totaldipeptides. The 0.3 <Apl<

0.7subset consists of58proteins which comprise 17848 totaldipeptides. More

informationabout eachindividualproteinintheseAplsubsets canbeseenin Appendix A.

(32)

DensitiesofDelta % Values inApl<0.1and Apl>0.7

UsingaNormal Amino Acid Alphabet

>25 >50 >75 >100 >150 >200 >300 >400 Delta%range

Figure 14.

_Density

ofDelta % Values ofDipeptidesin Two Apl Subsets. The Apl <0. 1 subsetconsists of60proteins whichcomprise22412 totaldipeptides. The Apl>0.7 subsetconsists of50proteinswhichcomprise 15531 totaldipeptides. More information

abouteachindividual proteinintheseApl subsets canbeseeninAppendixA.

Dipeptide Threshold

Asimilaranalysis was performed onthesameAplsubsets wheredipeptidesthat

hada_{very low}

_frequency

(which_maychangeits Delta %valuetoo_rapidly, see

Discussion foran_elaboration)weremonitored. A

_frequency

of occurrencethreshold

value of0.1% hadtobemetfor dipeptides. Inother_words, ifadipeptideoccurred so

infrequently

(under 0.1%ofthe totalnumber of_dipeptides)thenitwaseliminated. The

remaining dipeptideswere countedandtheDelta%values _comparingtheApl <0. 1 subset andthe 0.3 <Apl<_0.7subsetcanbeseenin Figure 15. _Likewise,thecomparison

fortheApl <_0.1 subsetandtheApl >0.7 subset canbeseenin Figure 16. Dipeptides

thatwerefound intheextreme positive ornegativeranges ofthesefigures areindicated

by

theone letteraminoacidcodes. Forinstance, thedipeptideRR

_{(arginine-arginine)}

in Figure 15 was foundmuchless

frequently

inthe Apl<0.1 datasetthaninthe 0.3<_Apl<

(33)

DensitiesofDelta %Values in Apl<_0.1and0.3< \pl<_0.7_UsingaNormalAmino Acid Alphabet (where_frequencyofdipeptidemustbeabove_0.1)

<-50 <-40 <-30 <-20 <-10 <0 >0 >10 >20 >30 >40 >50 >60 >75 >100 Delta %rangeand particulardipeptides

Figure 15. _DensityofDelta % Values ofDipeptides in Two Apl Subsetswitha

Thresholdof0.1%. 90 80 c ffi 70 w S 60 n a 50 a ai E 40 a 30 0) F 20 3 z 10 0

DensitiesofDelta %Values in Apl<0.1 andApl>0.7_Usinga NormalAmino Acid Alphabet (where _frequencyofdipeptide mustbeabove _0.1)

<-50 <-40 <-20 <0 >0 >20

Delta%range andparticulardipeptides

>80 >100

Figure 16. _DensityofDelta% ValuesofDipeptides in Two Apl Subsetswitha

(34)

Dipeptide _{using Alphabets}

The final_{step in} analysis wastocombinethealphabetanddipeptideapproaches

together.

_Using

the smaller alphabets

_dramatically

reduced andcondensedtheresultsas

comparedto_usingthenormal alphabet which creates400possibledipeptides.

-Charge

Using

theCharge _alphabet,a comparisonofthedipeptidefrequencies betweenthe

Apl<0.1 subset andthe0.3 <_Apl< 0.7 subsetis shownin Figure 17 as wellastheDelta

%values foreachdipeptide. Thesame comparisonis shownbetweentheApl<0. 1

subset andtheApl>0.7subsetin Figure 18.

ComparisonofDipeptides (basedon charge _{characteristic)}takenfrom Apl <0.1 and0.3< Apl<0.7

Dipeptide(chargealphabet)

Figure 17. FrequenciesofCharge Alphabet Dipeptides in Two Apl Subsets. Shown in bluearethefrequencies ofeachdipeptideintheApl<_0.1 subset andshowninyellowis difference in

_frequency

foreachdidpeptide betweentheApl<0.1 subsetandthe 0.3 < Apl<0.7 subset.

(35)

ComparisonofDipeptides(basedoncharge characteristic) takenfrom Apl< 0.1 and Apl> 0.7

60 50 40 30 20 10 0 -10 -20 -30 -40 A^ NKI

_AfsjJ

CA CN NC NN CC

Dipeptide (charge alphabet)

Figure 18. FrequenciesofCharge Alphabet Dipeptides in Two Apl Subsets. Shown in bluearethe frequencies of eachdipeptide in theApl<_0.1 _subsetandshowninyellowis

difference in

_frequency

for eachdidpeptide betweentheApl<0. 1 subsetandtheApl >

0.7subset.

-Chemical

Using

theChemicalalphabet,a comparison ofthedipeptide frequencies between

theApl<0. 1 subsetandthe0.3 <_Apl <0.7 subset isshownin Figure 19aswellasthe

Delta%valuesforeachdipeptide. Thesame comparisonisshownbetweentheApl<_{0. 1}

subsetandtheApl>0.7subsetin Figure 20. The Chemicalalphabet withdipeptideswas

sufficientlylargethatitwas not possibleto

display

allthepossibledipeptide

combinations inFigures 19 and20. Instead onlythe

_density

values were chosento

(36)

DensitiesofDelta%Values in Apl<0.1 and0.3< Apl<_{0.7) Using}a Chemical Alphabet 16

|

14 12 S 10 a) I 6 a n E SS(-28%) Al(-25%) AS(-24%) MS(-22%) IS(-20%)

I

IC(43%) IM(48%) rt*(48%)

J

RR(61%) <-20 <-10 <0 >0 >10 >20 >30 >40

Delta %range and particulardipeptides

>50 >60

Figure 19.

_Density

ofDelta % Values ofChemical Alphabet Dipeptides in Two Apl Subsets. The Apl<_0.1 _{subset consists}of60proteinswhich comprise22412 total

dipeptides. The 0.3<Apl <0.7subset consistsof58proteins which comprise 17848

totaldipeptides. More informationabout eachindividualproteinintheseAplsubsets can beseenin Appendix A.

DensitiesofDelta%Values inApl<0 1andApl>0.7UsingaChemical Alphabet

<-40 <-30 <-20 <-10 <0 >0 >10 >20 >30 >40

Delta%range and particulardipeptides

>50 >60 >70 >80

J

Figure 20. _DensityofDelta% ValuesofChemical Alphabet Dipeptides in Two Apl Subsets. The Apl<_0.1 _{subset consists}of60proteins which comprise22412total

dipeptides. The Apl >_0.7 subset consists of50proteins whichcomprise 15531 total

dipeptides. Moreinformationabout eachindividualproteinintheseAplsubsets canbe seen inAppendix A.

(37)

-Functional

Using

the Functional_alphabet,a comparison ofthedipeptide frequencies between the Apl<_{0. 1} _{subset and}_the _0.3 <_Apl<_0.7_subset_is _shown_{in Figure} _{2 1} _as _well_as_the

Delta%values foreach dipeptide. Thesame comparisonis shownbetweentheApl<_{0. 1}

subset andtheApl>_0.7subsetin Figure 22.

Comparisonofdipeptides (basedon_{functional characteristic)}takenfrom

Apl<0.1and0.3<Apl<0.7

Dipeptide (functional alphabet)

Figure 21. FrequenciesofFunctional Alphabet Dipeptides in Two Apl Subsets. Shown in bluearethefrequenciesof eachdipeptide intheApl<0.1 subset and showninyellow is difference in

_frequency

foreachdidpeptide betweentheApl<_{0. 1} subset andthe0.3 < Apl<_0.7_subset.

(38)

Comparisonofdipeptides (basedonfunctionalcharacteristic) takenfrom Apl< 0.1 andApl>0.7

30 20 10 0 -10 -20 -30 -40 /A AH CA I-_A

jjLfc-fa.tfi.ll tUljlj

AC HH HC PC CH CP PH HP CC PP

Dipeptide(functional alphabet)

Figure 22. Frequencies of_{Functional Alphabet Dipeptides in Two Apl Subsets. Shown}

inbluearethefrequenciesofeachdipeptideintheApl <0. 1 subsetand showninyellow

isdifference in

_frequency

foreachdidpeptidebetweentheApl<_{0. 1} subsetandtheApl>

0.7subset.

-Hydrophobic

Using

theHydrophobic_alphabet,a comparisonofthedipeptide frequencies betweentheApl<0.1 subset andthe0.3 <_Apl<0.7 subsetis shownin Figure 23 aswell

as theDelta% valuesforeachdipeptide. Thesame comparisonisshownbetweenthe Apl<_{0. 1} subset andtheApl>0.7subset in Figure 24.

(39)

Comparisonofdipeptides(basedonhydrophobiccharacteristic) taken fromaApl<0.1and 0.3<Apl<0.7

%ofDipeptideinApl<_0.1

DDelta%(piA<0.1- 0.3<Apl<_0.7)

Dipeptide_{(hydrophobicity}_alphabet)

Figure 23. Frequencies ofHydrophobic Alphabet Dipeptides in Two Apl Subsets.

Shown inblueare thefrequencies ofeachdipeptide intheApl<0.1 subset and shownin

yellowisdifference in

_frequency

foreachdidpeptide betweentheApl<_{0. 1} subsetand

the 0.3 <_Apl<_0.7_subset.

Comparisonofdipeptides (basedon_{hydrophobic characteristic) taken from}

Apl<0.1 andApl>_0.7

%ofDipeptide inApl<0.1

DDelta %A(pl<0.1-_Apl>_0.7)

Dipeptide(hydrophobicityalphabet)

Figure24. FrequenciesofHydrophobicAlphabetDipeptides inTwo Apl Subsets.

Shownin bluearethefrequencies of eachdipeptideintheApl<0. 1 subset and shownin

yellowis difference in

frequency

foreach didpeptide betweentheApl<0.1 subset and

(40)

Discussion

When exploringthebehaviorof proteins_{undergoing isoelectric}_focusing, there

existsa

_discrepancy

between predicted pi values and_{experimentally determined}pi

values forahighpercentage ofthoseproteins. Thiscomparisonofpi values was

performed_usingpredictionsbasedon our algorithm₍₁₁₎or similar algorithms(19)and

experimental pi valuesdeterminedin different

_laboratory

settings (14-18). Thesizeand

regular occurrence ofthesedifferencesjustifiedaclose_studyoftheprotein sequences in

aneffortto

_identify

_underlyingpatternsthatcouldcontribute to thesedifferences. The

question now

_lay

inwhethertherewas enoughinformation intheresults thatwere

extractedtobeable tomore_accuratelypredictpi values_usingthe informationobtained. The first

_key

elementwas

_having

a reliabledatasetthatwasbothuniformand

robustenoughtogivemeaningfuldata. A datasetthat istoo diversewouldleadto

complications suchasthequestionofhowtohandlepost-translational modificationsin

predictingpi andMW.

_Simply

_finding

the frequenciesof all dipeptides inallknown protein sequences would provide adatasetthat_{is certainly}robustenough.

Unfortunately,therobustness wouldbeoffset

_by

thehigh levelof noise inthedata dueto thefactthatdifferentorganismshave differentpost-translationalmodifications. Adata

setthatistoo smallwould nothaveenoughdipeptideinformationtomake surethat the

dipeptides thatoccurinthelowest frequenciesare still seeninsufficientabundanceto

maintaintheirstatistical validity. To overcomebothofthese_hurdles,thesearch space

was limited onlytoproteins in E. coli sinceit displays very fewpost-translational

modifications andhasa proteomethathas been _sufficientlydocumentedtodo a case

(41)

In

_keeping

withthe theme of

_having

adatasetwithas littlenoiseaspossible, yet

still _retainingas muchrobustness as possibleitwasdecidedthateventhoughwell

structured2DEdataexistedfrom5 differentgroups _(14-18), itwas_probablybestto limit

theusage ofthisdatato oneortwo ofthesegroups (17and18). Boththe Yanet al. ₍₁₈₎

andTonellaet al.₍₁₉₎ groups performedlarge scale2DEstudies ontheE. coli proteome.

The Tonella ₍₁₉₎ _groupboastedover70%oftheE. coliproteome

_being

coveredintheir data. Sincenone ofthegroupsusedthesame2DE conditionsitwasdecidedthat thedata

fromtheTonella₍₁₉₎ _groupwouldbethe _{only data}used. _{The primary justification}was

toensurethat the experimentalpiand MWvalues were gained_using thesame conditions.

This inturnwould reduce asmuchnoise aspossible. In_{addition, the} factthat theirdata

coveredover70%oftheE. coligenomeheldpromise forthisstudy.

Oncethe entiredataset was _selected,anotherdecision hadtobe madeabouthow

toseparatethedatasothatclearlinescouldbeseenbetweenproteinsthat_{had very}small

Aplvalues and proteinsthathadgreaterAplvalues. _Doingso wouldmake itpossibleto

seeifsignificantsequencedifferences (atthedipeptide_level)between Aplsubsets

existed. Itwas _necessarytobreakthedataset intoa small numberofAplsubsets. These

arbitrary Apl cut-offranges (Apl<_0.1; 0.3 <Apl <_0.7;Apl>_0.7)werechoseninorder

toseparatethedata into distinctsetsof similar sizethatcouldbe comparedwitheach

other.

Therewas

difficulty

in

deciding

howtoseparatetheentiredatasetintothese three

subsets. One possible approach wasto separatethe_{dataset into many} smaller sized

subsetsbasedonalargernumber ofAplranges. Onone hand

_doing

thismight provide

(42)

relativetoadjacentAplranges. Ontheotherhand

_{by doing}

itthisway, there is alossof

informationatthesequencelevel dueto the smallernumber of sequencesthatwouldbe

found ineachdataset. _This, in_turn, wouldthreaten the _reliabilityofourfindings.

Therefore, thedatasethadtobeseparatedintosubsets ofsufficient robustness. The Apl

<_0.1 _{subset consists}_of₆₀_proteins_which_comprise₂₂₄₇₂ _total_{amino acids} _or₂₂₄₁₂

total dipeptides. The 0.3 <_Apl<_0.7 _{subset consists of}₅₈_proteins_which_comprise

17906totalamino acids or 17848totaldipeptides. The Apl >_0.7subset consists of50

proteinswhichcomprise 15581 totalamino acids or 15531 totaldipeptides. More

informationabouteachindividualproteinintheseAplsubsets,

including

Apl, a

descriptionandSWISS-2DPAGE Accession_Number,canbeseenin Appendix A.

Theanalytical process is bestviewedas a pipeline as seenin Figure 2 inthe

Methodssection. Webeganouranalysis withthemost simple method(naiveapproach),

worktheir_waytomore complicated methods (alphabetsapproach), and end withthe

most complicated methods(dipeptides_usingalphabets approach). _Alongthispath,the

relevance ofthedataalsobecomesmorecomplicated,butmore

_interesting

atthesame

time (withafewexceptions).

Thenaive approachto

_handling

thedatasetdidnotprovide_anymeaningful

results. Itwas _quicklyapparentthatindividual aminoacidfrequencies inagiven set of

protein sequencesdidnot_{vary among}the threedatasubsets. Inthe_end,no amino acid

frequency

characteristics _{using simply}the naive approach werefoundtobe significantly

different betweenthethreeAplsubsets. Thiscanbe seenin Figures 3and4when

comparingtheApl<0. 1 subsetwiththe0.3<Apl<0.7 subset andtheApl <0. 1 subset

(43)

yellowfrequenciescanbeseenfor any individualamino acid; thevaluesare also_nearly identicalwhenFigure 3 andFigure 4 are_compared,as well. The lackofacorrelation

betweenAplvalues andthe

_frequency

ofthese individualaminoacids showed usthatwe

neededtoconsidertheprobleminmoredepth

-morethanone aminoacid at atime.

To simplifythe analysis, thenumber of variableswas reduced

_by

_usingthe four alphabetsdescribedin Table 2atthenext stageinthepipeline. _Again, theresultsdidnot

reveal _anysignificanttrendsthat could affectthe_waythatpi iscalculated. Figures 5 and

6 (Chargealphabet_{comparisons),}Figures 7and 8 (Chemicalalphabet_{comparisons),}

Figures 9and 10 (Functionalalphabet_{comparisons),}andFigures 1 1 and 12

(Hydrophobicalphabet_comparisons) show_verysimilarresultsto thatofthe naive

approachin Figures 3 and4. There isnotrendofincrease ordecrease in Apl for any particularamino acidwhen_{moving between}the threedatasets.

Itwasexpectedthatmore meaningful results wouldbeobtained

_by

analysis ofthe

dipeptide frequencies. Allpreviouspi predictionalgorithms_(2-8),

_including

ours₍₁₁₎

treat thepKaforeachamino acid_{independently,}regardless ofitsnear ordistant

neighbors. Atthispointit is instructivetoconsiderthe experimental conditions

normally employedfor isoelectric

focusing

(IEF). The biological functionofproteins

requiresthat_theymaintaintheirthreedimensional structureintact. _However,for_IEF,we are interested only in_separatingthe proteins,not_observingtheirbiologicalfunction. To assurethebest separation,reagents such asureaanddetergentsare added priorto IEFto

disrupt anysecondary,tertiaryor_quaternaryaspects of protein structure. Inthese

_fully

denaturedproteins, theonlysignificantinteractionsareexpectedto occurbetweenamino

(44)

considerationoftheeffectof_neighboringamino acids ontheirrespective sidechainpKA

values_mayprove valuable.

Withrespecttoeach alphabetthatwas usedthe discussionwill advancefrom the

leastsignificant alphabet dipeptideresultsto the mostsignificant alphabetdipeptide

results. _However,theanalysis_usingthenormal amino acidalphabetwillbe discussed

first. Atfirstglance Figures 13 and 14 show some_{very promising}results. The Delta %

value representsthechangein

_frequency

from one Aplsubset

_being

comparedto thenext Apl subset. _Therefore, Delta %valuesthat areinthe300and400ranges would seem

verysignificant. Theproblemwasthatmost ofthedipeptides thatfell into theseextreme

ranges were dipeptidesthatwhose overall

_frequency

was_vanishinglysmall. Adipeptide

that occurs_onlyonceinoneAplsubset andmultipletimes inanotherAplsubset_{is going} to havea_{very high Delta %}value. Itwould notbewiseto_relyon suchdipeptide

frequenciestoredesignof apiprediction algorithm. Tonegotiatethroughall ofthe400

dipeptides inthenormal_alphabet,thesame analysis was run with athreshold

_frequency

occurrencefor dipeptides 0.1%. Inotherwords,ifadipeptide didnot occurinat least 0. 1%ofthe time (oratleast 22timesintheApl<_0.1 _dataset, _whichcontained22412

amino _acids)itwas not usedforanalysis. Theresults ofthiscanbeseenin Figures 15

and 16. Therestill exist extreme outliersthathave Delta %valuesinthe 100range which

willlater bereanalyzed

_by

comparison with some ofthealphabetdipeptideanalyses. The alphabetthatshowedthe least

_interesting

results when_usingadipeptide

approachwasthehydrophobic alphabet. ComparisonsoftheAplsubsets _usingthe

(45)

barsand_{is very}negligible ineach ofthe4dipeptides (nomore thanaDelta%value of

1.85was seen _{in any}ofthe4dipeptides).

The charge alphabet showed_slightlymore significant results for dipeptide

anaylsis. Delta %values reachedintothe 30+rangeforsome dipeptides. The AA

dipeptide

_(negatively

charged amino acidfollowed

_by

_negativelycharged aminoacid; see

Table 2 fordefinitionsofallthealphabet_{codes) had}aDelta%of-31.3% _{going from}the

Apl<_{0. 1} _subset_{to the}_Apl>_0.7_subset_{(Figure 1 8). The Delta % for}_the_AA_{dipeptide is}

also large_(-18.9%) intheother comparison oftheApl<0. 1 subsetandthe0.3 <Apl<

0.7subset(Figure 17). _However,the

_frequency

of occurrenceofthisAAdipeptide (as

shown intheblue_bars) _{is very low in}allthree Aplsubsets. Whatwe wouldliketoseeis

alarge Delta %value accompaniedwith alarge

_frequency

of occurrenceforaparticular

dipeptide. Thiswas notapparentin anyofthe _{dipeptides using}the Chargealphabet.

Staying

withthe theme that themost significant results will combine large Delta %value_alongwith alarge

_frequency

ofoccurrencevaluefor_dipeptides,theFunctional

alphabetis considerednext. Figures 2 1 and22 representingtheanalysis_usingthe

Functionalalphabet show a collection ofdipeptidesthat_{have both significantly large}

Delta%values and_{significantly large frequencies} of occurrence: _AA, _AH, _{HA, HP, CP,}

PH, PP.

Itwasimportanttoreferbacktotheanalysisthatwasdone using dipeptides based

onthecompleteamino acidalphabet. Figures 1 5 and 16point out afewextreme

dipeptide outliers: _KY,YS (Figure ₁₅₎ and_{EE, NN,}YT (Figure 16).

_Converting

these

(46)

respectively. These threedifferentdipeptides all _{map back}toextreme outliersfromthe

analysisdone_using the Functionalalphabet(Figures 21 and22).

Using

theChemicalalphabet withdipeptidescreateddatathatwas_sufficiently

largethatitwas not possibleto

_display

all thepossibledipeptidecombinations in Figures

19and_{20. Instead only}the

_density

values were chosentodisplay. Particularoutlier

dipeptidesare labeledonthe_top of each columnwiththeirrespectiveDelta%values.

The significanceofthese findings isthat_{it may lead}toamoreaccurate

calculation ofpithan_{currently existing}methods_(11, 19). These_{data clearly}supportthe ideathat the_pKAvalueforanamino acid side_chain, evenwhentheproteinis

_fully

denatured, dependsonthemicroenvironment created

_by

the nearest neighbors ofthat

amino acid.

_Using

theextremeoutlierdipeptides thathave been identified fromthis

studyof180annotatedE. coli_proteins,it may bepossible toadjustthealgorithms for

calculatingpi values. Ouralgorithmfor calculatingpi fromaminoacidsequence₍₁₁₎

couldbemodifiedtoinclude theeffects of adjacentamino acidson_{the pKA}values used inthecalculations. Thiswillbe anempiricalprocess_whereby_{the pKA}values usedin

thealgorithm willbemodified

_fractionally

to see whichchanges leadtoabetter

correlationbetweenactual andpredicted pi values forthe twooutlierdatasets (0.3<Apl

<_0.7; andApl>0.7).

Iftheimprovementofthe_accuracyofthepi calculationprovesto_{be worthy}there

any many futureadvancementsthatcouldbemade. The firstcouldbetobuildalarger datasettowork withandreruntheanalysisto comparetothedatashownhere. Beyond

the scope oftheE. coli_proteome, furtherdatathatare availableatthe

_ExPASy

Server's

(47)

microbialproteomes. Another _stepwouldbeto porttheanalysis overto lowereukaryotic

proteomesthatcontain much morepost-translationalmodifications. A lotwould haveto be done interms of_predictingor_categorizingthesepost-translational modificationsbut

in

_doing

soit may leadtoan even more powerful approachtobetter_predictingpiin

(48)

Conclusions

AdatasetofE. coliproteins was collected andformattedto _studythe

_discrepancy

thatexistsbetweenexperimentalisoelectricpoint and predicted isoelectricpoint(Apl).

This datasetwas thensplitintothreeparts

_depending

onthemagnitude ofAplforeach

protein. _Several,_{multi-layered,}_{sequential approaches were} taken_{in reformatting}the

protein sequence data inan attempttogetabetter_{understanding}of whatmightbe

causingthevarying Apl. Eachofthese stages representedadifferentpart ofa pipeline where thedatawere analyzed

_by

_comparingeach ofthe threeAplsubsets toone another.

Thepipeline consistedof anaiveapproach

_(considering

individualamino acid

frequencies), followed

_by

the applicationfour different alphabetstorepresent sequences

inasimpler_way

_by

_groupingsimilar aminoacidsbasedontheircharge, functional,

chemical, andhydrophobicproperties . The final step inthepipeline involved

investigating

thedipeptidesof allofthesesequences_{using both}the 20amino acid alphabetandthesimplifiedgroupings. Thealphabetdipeptideapproachyieldedthe

most meaningfulresults _showingthatcertaindipeptidesequences occur _{in greatly}

different

_frequency

betweenproteins inthe different Aplsubsets.

Future studies will attemptto showthattheresultsofthesedipeptide findings

bettercanbeusedtobetterpredictpi. Thiswill involvemodification of our_existingpi

prediction algorithmto include theaffectofadjacentaminoacidsin sidechain_pKA

values.

_Using

a shortlistof_onlythemost extreme cases whereadipeptideshowed

greatly different Apl fromone subsettothenext should result ina piprediction valuethat

ismoreaccurate. Oncethepi predictionis improvedthenext_stepwouldbeto

(49)

them. In_addition,similar analyses will beextendedto otherprokaryoticorganisms, and

Isoelectric point prediction from the amino acid sequence of a protein

Rochester Institute of Technology

Rochester Institute of Technology

RIT Scholar Works

RIT Scholar Works

Theses

Summer 2005

Isoelectric point prediction from the amino acid sequence of a

Isoelectric point prediction from the amino acid sequence of a

protein

protein

Matthew Conte

Recommended Citation

Recommended Citation

ISOELECTRIC

POINT PREDICTION FROM THE

AMINO

ACID SEQUENCE OF A PROTEIN

by

Technology

-~­

nIQlnformatlcs

~luT

...!...M----=.!~~· :....:~...!\--...!h~~~v...J~

\

~A....!...~C:!z<.loooO~Vl-"-!e..LJo...---­

Gary R. Skuse

Paul A. Craig

Name Illegible

Douglas P. Merrill

Thesis/Dissertation Author Permission Statement

A~HhLw

(0/1

k

,e.

Print Reproduction Permission Granted:

&t+kw

~

It.

,

Matthew Conte

Cf-

OJ..

-J..065

Print Reproduction Permission Denied:

-Inclusion in the RIT Digital Media Library Electronic Thesis

Dissertation (ETD) Archive

Abstract

frequently

discrepancy

by

(considering

by

by

investigating

Table

Contents

Introduction

Methods

Forming

Extracting

Introduction

(2DE)

laboratory

identify

by

by

by

by

by

following

P

XP

P

P

P

t^

h

w

KJ

_by

_Technology

-~

~A....!...~C:!z<.loooO~Vl-"-!e..LJo...---

_frequently

_discrepancy

_by

_(considering

_by

_by

_{investigating}

_Forming

_Extracting

_(2DE)

_laboratory

_identify

_by

_by

_by

_by

_by

_following

_P

_P

_P

_P

_y

_p

_P

_P

_P

_P

_P

_P

_P

_P

_P

_p

_{(deprotonated)}

_below)

_Meaning

_They

_ExPASy

_by

_by

_by

_by

_by

_by

_Having

_by

_by

_(leaving

_by

_ExPASy

_following

_ExPASy

_by

_{Phosphate-binding}

_identify

_following

_handling

_by

_finding

_frequency

_frequency

_displaying

_long

_by

_looking

_frequency

_(Asp

_(Arg

_Q)

_(CONH2)