An Evolutionary Computation Approach to Optimization of Isoelectric Point Prediction in Proteins

(1)

Rochester Institute of Technology

RIT Scholar Works

Theses

Thesis/Dissertation Collections

2006

An Evolutionary Computation Approach to

Optimization of Isoelectric Point Prediction in

Proteins

Chris Parkin

Follow this and additional works at:

http://scholarworks.rit.edu/theses

This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please [email protected].

Recommended Citation

(2)

.~

~~

..

~.rfOrmatlcs

RIT

"

To: Head, Department of Biological Sciences

Rochester Institute of Technology Department of Biological Sciences Bioinformatics Program

The undersigned state that _ _ _ ~Co.!.hll.r.L=is!..!o.to~p.L.!h.l:=.e.!...r .!...P.!=!.aru.k;u.iu.n _ _ _ _ _ _ _ _ _ , a

candidate for the Master of Science degree in Bioinformatics, has submitted his/her

thesis and has satisfactorily defended it.

This completes the requirements for the Master of Science degree in Bioinformatics at Rochester Institute of Technology.

Thesis committee members:

Name

paul Craig

(Committee Chair)

Paul Craig

(Thesis Advisor)

Illegible Signature

Gary R. Skuse, Ph.D. Director of Bioinformatics

Date

(3)

Thesis/Dissertation Author Permission Statement

Title of thesis

or.disse~tion:

12

!£~~~f//£r;r

f/Xft

f

o3°

D

1{t:s12ach

In

tJp

fun I Za

hf!YJ

a

5'0 f

-e

I 0Y1 0

J'n

k

Name of author:

Chnsdoph<:c

1?uhlO

Degree:

13,

C i

do

em IX -hie:>

1I!ll'JV1S

~o~: ~S~g~\D~

______________________________________________ __

College:

.s

c

i

e.n (

e

I understand that I must submit a print copy of my thesis or dissertation to the RIT Archives, per current RIT guidelines for the completion of my degree. I hereby grant to the Rochester Institute of Technology and its agents the non-exclusive license to archive and make accessible my thesis or dissertation in whole or in part in all forms of media in perpetuity. I retain all other ownership rights to the copyright of the thesis or dissertation. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation.

Print Reproduction Permission Granted:

I, Christopher Parkin , hereby grant permission to the Rochester Institute Technology to reproduce my print thesis or dissertation in whole or in part. Any reproduction will not be for commercial use or profit.

Signature of Author: _{Christopher Parkin} Date:

Print Reproduction Permission Denied:

L

,

hereby deny permission to the RIT Library of the

Rochester Institute of Technology to reproduce my print thesis or dissertation in whole or in part. Signature of Author: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date: _______ __

Inclusion in the RIT Digital Media Library Electronic Thesis

&

Dissertation (ETD) Archive

L Christopher Parkin , additionally grant to the Rochester Institute of Technology Digital Media Library (RIT DML) the non-exclusive license to archive and provide electronic access to my thesis or dissertation in whole or in part in all forms of media in perpetuity.

I understand that my work, in addition to its bibliographic record and abstract, will be available to the world-wide community of scholars and researchers through the RIT DML. I retain all other ownership rights to the copyright of the thesis or dissertation. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I am aware that the Rochester Institute of Technology does not require registration of copyright for ETDs.

I hereby certify that, if appropriate, I have obtained and attached written permission statements from the owners of each third party copyrighted matter to be included in my thesis or dissertation. I certify that the version I submitted is the same as that approved by my committee.

(4)

An

Evolutionary

Computation Approach to Optimization

ofIsoelectric Point Prediction in Proteins

Submitted

by

Chris Parkin

DepartmentofBiologicalSciences

Inpartialfulfillmentoftherequirements

fortheMasterofSciencedegree in

Bioinformaticsat

RochesterInstitute of_Technology

(5)

Abstract

An

Evolutionary

ComputationApproachtoOptimization

ofIsoelectric Point Prediction in Proteins

by

Christopher Parkin

MasterofSciencein Bioinformatics

Rochester Instituteof

Technology

Professor Paul_Craig,Chair

Computational

biology

hasattackedtheproblem ofisoelectricpoint prediction withlittle

success, _achievinga rough_{accuracy level}of_{only 30%. In}_2005,MatthewConteperformeda

study focused onthe_{relationship between}sequence characteristics andisoelectricpoint

prediction accuracy. Results indicatedthatchargesbetweenadjacent amino acids couldhave a

significantimpactontheoverall predicted pi fortheprotein. Inthis _studyweintroduce an

evolutionarycomputation approach aimed at_{accounting for}theseproblemdipeptides. Foreach

possibledipeptide

involving

charged amino acids(7chargeable groups->49_{possibilities), the}

algorithm predicts a_pKavalue_that,whenincluded inthe pi prediction_algorithm,should resultin

a moreaccurate prediction.

By

_{accounting for}thesecharged, adjacent amino_acids,thepi

predictionshowedimprovements forthoseproteins withthegreatestdeviation between

experimentaland predicted pi value(Apl>0.7). _However,theseresults were notgeneralized, as

theincorporationofthesevalueshadthereverse effect on_remaining_proteins,most_notablythose

fromthemost accuratedataset(Apl<0.1). Whilethisresearchlaysafoundation for

improving

thepi prediction algorithm,additional exploration remains_{necessary for}an overall_accuracy

(6)

Using

0.1<ApI<0.3 22

Using

0.3<ApI<0.7 23

Using

Apl>0.7 24

3.3.9 Average Apl Values 25

4 Discussion 26

5 Conclusion 31

(7)

Introduction

Two-dimensionalgel electrophoresis

(2DE)

firstemergedin 1975whenDr. Patrick O'Farrell

displayedthe_abilitytoseparate 1,100polypeptides from Escherichiacoli[1]. Withthe_theory

andtechnique

being

_slightlyahead ofits time,itwas

initially

practiced

by

_onlyahandfulof

scientists aroundtheworld. Since_then, theemergence of new analytical_tools, combined with

numerous_large-scale,publicinformation_databases,has shed a whole newlightonthis once

dormanttechnique [2]. _Today, 2DEremains a

leading

techniqueforseparation and

identificationof proteins.

Isoelectric

focusing (IEF)

isthemainfocusofthis _studyand makes_upthefirst dimensionof

2DE. IEFisa methodinwhich amphoteric molecules are separatedina polyacrylamide gel

accordingtotheirisoelectricpoint values[2]. Whenplacedina pH _gradient,a protein will

migrateto theposition whereitsnet chargeisequaltozero. ThepHatthispositionisknownas

theisoelectricpoint

(pi)

value. Isoelectricpointis determined

by

charged groupsin_{the protein,}

andis oftenbetween 3 and_12,withmost

falling

between 4and7[1 1,12].

TraditionaltechniquesusedtoformpHgradients involved mixingampholytesthathad been

chemicallyengineeredtoacertain_pKavalue[13]. While thismethod worked_efficiently,thepH

gradient was_{extremely difficult}toreproduce. Since_then, immobilizedpH gradients

(IPGys)

havebeen introduced. Inan_IPG, theampholytes areboundinacrylamide_gel,

forming

afixed

pH gradient and_{ensuring reproducibility [8,13].}

(8)

Theseconddimensionof2DE isa separation

by

molecular mass. Achargeis appliedtoa

bufferthat surrounds_{the gel,}_attractingthemoleculestothe opposite end and_causingthem to

migrate. Thelargerofthesemoleculestravel the slowest and will remain nearthe_top ofthe_gel,

whilethesmaller moleculeswilltravelfurtherandbeseentowardthebottomofthegel. After

staining, theend result of2DEisa grid of spots with each spot_referringto thelocationof a

protein molecule inthegel(Figure 1). The Xvalueinthegrid corresponds to thepi valueofthat

protein,whiletheYvalue correspondstothedistancemigratedinthegel.

Theapplication ofthistechniquehasprovento

bea powerfultoolandhasprovided researchers

witha great amount ofdata [5]. _However,the

difficulty

andtimerequirements associated with

performingand

interpreting

_{2DE correctly have led}

totheemergence of computational approaches to

2DE [5]. Whilethebenefitsassociated with

simulations are often quite_attractive, the

limitationsplaced uponthepi prediction portion of

the2DEsimulationhaveprovedtobetheAchilles

=

.

[image:8.531.39.230.230.429.2]

-r

Figure 1. Sampleoutputfrom

2-Dimensional Electrophoresis. Obtained

from Swiss 2DPAGEdatabase,protein

ID#P16700[7,10]

heel oftheentire simulation.

The isoelectricpoint prediction algorithmtobeoptimizedinthis_{study is}part of a2DE

simulatorthatwas _{originally developed}atthe RochesterInstituteof

Technology

as part of an

honor'sthesisproject[3]. Thisalgorithmwasimplemented tocalculate_charge,basedonside

chains ofaminoacidsfoundinthesequence. Thecharge on each side chainis afunctionofthe

(9)

Amino Acid DefaultpKaValue

R- Arginine 12

D- Aspartic Acid 4.05

E- GlutamicAcid 4.45

H- Histidine 5.98

K- Lysine 10

C

-Cysteine 9

Y- Tyrosine 10

charge ontheaminoacid side chains isshownin Table 1

Using

thevaluesfrom Table 1 and

startingwith a pH of_7,the algorithm

looksat eachindividual amino acidin

the sequenceand computesitscharge.

Eachindividualamino acid charge is

thenaddedtoa_runningtotalof_charge,

resulting intotal charge fortheprotein.

Ifthe totalchargeisgreaterthan0.005

orlessthan-0.005, thepH value usedin

, , . ,. , , , , Table1. DefaultpKavaluesusedinoriginalpi

thecalculation isadmsted andthe charge ,.

, .f, J

prediction algorithm

calculationisrepeated. Thiscycle continues untilthe totalcharge isreportedtobebetween

-0.005 and_0.005,_practicallyzero. _Finally,thepH value_{resulting in}a net charge of zero onthe

proteinis_returned, and consideredthepi value forthatprotein.Whilethe pKavalues are

heavily

relied onin_{this calculation,} variables such as post-translational modifications and charge-charge

interactions areleftunaccounted_for, _{significantly affecting}predictionaccuracy[3].

In_2005,Matthew Conteperformed sequential analyses on numerousproteinsfromtheE.

coli_proteome,obtained fromthe Swiss 2DPAGE_database[3, 7]. In

doing

_so,heuncovered a

correlationbetweentheoccurrence ofchargeddipeptidesinthesequence andthelevelof

discrepancy

betweenexperimental and predicted_pi,known_{in his study}as well asthisone asApl

[3]. Hisresults showedthat thehigherthenumberofcharge-chargedipeptides inthe sequence,

thegreaterthedeviationbetweenactual and predictedpi valueforthatprotein[3].

Thepicalculationisbasedon_{the pKa}values fortheamino acid sidechains. Basedonhis

[image:9.531.127.483.76.335.2]

(10)

results, ourhopewastoderivenew_pKavalues_using ageneticalgorithm. AsinConte'swork,

Escherichiacoli wastheproteome ofchoice. E. coliisthought tohave a_relativelylownumber

ofpost-_{transcriptional}

modifications suchasmethylation and_{phosphorylation,}andit isone of

themost_widelystudiedbacteria in_science,_making itanidealsubject[3]. _Furthermore,

experimentalisoelectricpointdata from_onlyone_groupwas_used,_{assuring consistency in lab}

practicesanddatasubmission[3,9].

Nowa cornerstonein

biology,

evolution andthe_underlying_theoryof natural selection are

accreditedtoCharlesDarwinafterhisresearchinthemid

19th

century [4]. Histheoryof natural

selectionproposedthatindividuals bestadaptedto their_surrounding environment are more

likely

tosurviveand mate. Over_time,thosewiththeless-favorabletraitsdie_out,while favorabletraits

are passed_on,_eventually

introducing

adaptationsintothepopulation.

Evolutionary

computation modelslikethegeneticalgorithm

(GA)

usedinthis_study

loosely

follow Darwin'stheories. In_{this case,}eachindividual inthepopulation isa set of_pKavalues

usedtoaccommodatethecharge-charge amino acid pairsthat_{normally hurt}the_accuracyof pi

prediction. Ineach_{generation, the}most well adaptedindividualsarethose thatleadto themost

accurate pi_prediction,and areknownasthefittestofthepopulation.

According

toevolution, thefittest individualsarethosemost

likely

tosurvive and_mate, so

thefittest fromeach generation _{automatically}survive intothenext. Over_time,simulated

processesofmutation,crossoverand recombination are appliedtoeach_generation,_{resulting in}a

population ofthebestpossible_{individuals. Further details regarding}theworkings oftheGA are

(11)

Methods

ExPASv 2DPAGE Database

The

ExPASy

server's Swiss2DPAGE database(http://ca.expasy.org/ch2d/) contains vast

2DEgelinformationforhuman,mouse,Arabidopsis thaliana,Dictyosteliumdiscoideum,

Escherichia_coli,Saccharomycescerevisiae,andStaphylococcusaureus(N315))[7]. Foreach

proteininthe_database,information_regardingexperimentalpi_value,molecular_weight,

experimental_{methods, references,} anda photo oftheactualgel runintheexperimentis

available[7].

Because_manygroupshavecontributedto this_database, it isnot uncommontofindmultiple

submissionsfor anyone protein. Forthatreasonandto

keep

experimental practices consistent,

onlythoseentries fromTonellawere usedinthis_{study [9]. For}ease of_{use, the}Swiss 2DPAGE

allows fortheinformationtobe downloaded intoatab delimitedtextfiletobe importedtoa

spreadsheet14]. Thefields availableinthisfile include genename, description,Swiss 2DPAGE

accession_number, spot_ID,experimental_pi,experimental molecular_weight,_mapping_methods,

commenttopics and a referenceto the_{group carrying}outtheexperiments.

TrimmingtheDataSet

After obtainingtheinitial Tonella dataset_{containing roughly 340}_proteins,itwas not

uncommonto see_upto eight entries _{for any}one protein. _Again,duplicationsare a result of

post-translationalmodificationsthatcause a change inpI/MW onthe protein,

leading

toa unique spot

onthegel. Becausemost oftheseduplicatepi values were quite similar(oftenwithin.01 of one

(12)

another), anaveragepi value wastaken torepresenttheproteininthe data_set,andthe_remaining

duplicatesremoved. Inthe eventthat

drastically

differentpi values were_{recorded, only}thefirst

entrywas _saved,andthatprotein wasomitted from_trainingthegeneticalgorithmlaterinthe

study.

170proteins remainedafter allduplicateswereremovedfromthedata set,which werethen

broken into fourgroupsbasedonthedifference betweenexperimental and predicted pivalue,

knownas _jpKO.l, 0.1<_pl<_{0.3, 0.3<_pl<0.7,} and (Appendix A). Thegreatest

concern forthis_studywereproteinsfoundinthe and0.3<_pl<0.7datasets,with

expectationsthat

improving

thosepredictions would_{greatly improve}theoverall_{accuracy level}

forthealgorithm. Fora complete

listing

oftheproteins used after_trimming, seeAppendixA.

Sequence

Gathering

All Swiss 2DPAGEproteinentries are cross-linked withthe Swiss-Prot_database, _{making it}

possibletoacquireFASTAformattedsequencethrough theNCBI Batch Entrezsearch[15]. To

usethis_tool,a simple listofthe 170protein accession numbers was uploadedto the_NCBI,

which returnedall 170proteins in FASTA format. To easilyassociatetheexperimental pi value

withtheproteinsequence,each experimental value was_manuallyenteredintothesecondline of

therespectiveFASTAfileofthatprotein. Thisresultedinonelarge FASTA formatted file

containingall 170proteins, completewith accession_{numbers, pKa}value andsequence. Aperl

scriptthenparsedthisfile and saved each proteinsequence_separately,

basing

thefilename on

theprotein's accession number. Finally, another short program was writtentoreadinall 170

protein sequences and sortthem_accordingto thedifference betweenexperimental andpredicted

pi value. Alldata filesusedinthisresearchhave beensavedintoa compressedfolderandcanbe

obtainedathttp://www.rit.edu/~cdp3511/thesis/

(13)

Training

&

Testing

Data

Afterthisorganization wascomplete,eachdataset was runthrough the algorithminthe

following

manner. _First,afolderwasmadetocontain

"training

data,"which containedfour

proteins chosenfromthedataset. Each oftheseproteins wasdeemedacceptable

(only

one

Swiss 2DPAGEsubmission per_protein)and was _{automatically}readfromthe

directory by

the

GA,which requires

being

seededwithtwo_chromosomes,one

fromtheoriginal_trainingrun and onefromthefirst_testingrun. Theresultsfromthiswerethen

usedtoseedfournewproteinsthatbecomethe_testingset. Thisprocess continueduntil all

acceptable proteins fromthedataset were a part ofthe_{training data,}_givingthebestoverall_pIQ

valuesforthatset.

The GeneticAlgorithm

Thegenetic _algorithm,writtenintheJava_programming

language,

wasthe

driving

force

behindthisproject. As_previously

indicated,

the GAisset_upto

loosely

simulate evolution and

follows Charles Darwin's_theoryof"survivalofthefittest". As_{mentioned, the}original

(14)

predictionalgorithm steps_{through the sequence,}

looking

atone aminoacid at atime. Inthe

following

sections, theideas and codebehindthealgorithm areexplained.

TheChromosome

Thefirst_{step in any GA}isto

develop

aninitialpopulationof what are called

chromosomes[4]. Achromosomeis an object_representingtheparameters usedto optimizethe

problem athand. Forthepurposes ofthis_experiment, a chromosome couldbe definedas an

arrayof

binary

integervaluesthatrepresent_pKa_values,one foreachdipeptideofinterest. For

example,ifwe wantedtorepresent an arginine whenitoccurs nexttoanother_arginine,or an

argininenexttoan aspartic acid(as_theymight occurintheprotein_{sequence), the}_arraymight

holdvaluessuch as "001

1"

or"0110". Whenconvertedto_integers,these

binary

strings equal

"3"

and_"6",which wouldbecometherespective_pKavalues associated with"AA"and"AD"in

thatchromosome.

Eachchromosomethenholdstheentire set ofpKavalues usedtooptimizethepi prediction

algorithm. The initialpopulationis obtained

by

_usinga random number_generator,_providinga

numberbetween 0 and 14inclusive torepresenttorepresent each_pIQvalue.

Fitness

Aftertheinitialgenerationisin_place,each chromosomeistestedforwhatis knownasits

"fitness."

Asnoted, each chromosomeholdstheparametersthatareutilized intheprediction

algorithm, outputtinga predicted pi value. In_{this experiment,}achromosome'sfitnesscanbe

definedastheaveragedifferencebetweentheexperimental andpredictedpi valueforeach

protein

being

tested. _Therefore,ifwehave 100chromosomesand are_testingon a setof10

proteins, thatmeansthatforeach_{generation, the}fitnessvalueiscalculated 1,000times.

Testing

ontheentiredataset means 100chromosomestestedon _{170 proteins,}for 17,000calculationsper

(15)

generation.

Foundinthefitnessfunction,thepi prediction algorithmis simplytheoriginalalgorithm,

modifiedtolookattwoamino acids at atime. Forexample, theoriginal algorithm would see a

"K"

inthesequenceand assignitapKavalueof10. _Instead,themodified algorithm seestheK

andthencheckstheamino acid

immediately

following. Ifit isan amino acid with a charged side

chain, like arginineforexample, thefunctionlooks atthecurrent chromosome and extractsthe

correspondingpKavaluefor_K-R, and assignsitto K. After

doing

so, thealgorithm steps ahead

one spot and seesthe_R, andthenrepeatstheprocess. Theoverallfitnessthendepends onhow

welltheparametersfound inthechromosome_work,orhowclosethe_resultingpi prediction ends

up

being

to the experimental pi.

Afterall chromosomesinthegenerationhave beenassigned afitnessvalue,theyare sorted.

The_top5% fittestchromosomes are called

"survivors,"

and are_{automatically}placedinthenext

generation. _Remainingchromosomes are choseninpairstorepresent_parents, and_theyare

matedtoproducetwonew offspring.

Tournament Selection

Themethod

by

whichchromosomes are chosen_{for mating is known}as a"tournament"

selection.

Many

variations oftournamentselection_exist,withthechosen method_mostly

being

personal preference. In_{this case, the tournament}selection starts out

by

_{selecting 4}chromosomes

atrandom, excludingthe surviving 5%. Fromthefourselected_chromosomes,thetwowiththe

bestfitnessvalues are mated

by

crossover. For_example,considerthe

following

parent

(16)

ParentA= _{1010 1100 0011 0101}

ParentB=1111 0000 1 100 001 1

Now, considerthepossiblechildren_resulting froma cross ofParentAand ParentB:

ChildA=

1010 1000 1 100 001 1

ChildB= _{1111 0100 0011 0101}

Noticetheeffectsthat thiscrossoverhadonthesecond_pKavalues forthesechromosomes.

Initially,thesecondpKavaluelistedin ParentAhada value of"1100"or _12, whileParentB was

"0000"

or0.

Following

_{the crossover,} ChildAhas"1000"or8while ChildBhas _"0100",or4.

By

implementing

this typeof_crossover,as well as

introducing

random mutation ofindividual

bits,

numerousvariations canbe_quicklyintroducedinto_{the population,}_simulating evolution

(seethesections on crossover and mutationformoreinformation). Thetournamentselection

repeats,again_{selecting four}chromosomes at random and_matingthefittest_two, untilthenew

generation containsthedesiredamount of chromosomes(defaultsetto 100forthisexperiment).

Crossover

To

keep

the_matingprocess_unbiased, crossover and mutation wereboth implemented

randomly. Asmentioned, a crossoverrequirestwoparentchromosomes,and results inthe

creationoftwooffspring. _First, a crossover pointis determined_usinga random number

generator. BecauseaChromosomeobjectis _actuallyan_arrayof

binary

_{strings, this}

determinationmust_{actually be done in}twosteps:

1. Randomlyselectanindex intheChromosome arraytosetthecrossover pointin. This

shouldbea numberfrom 0-24inclusiveandpointstoonefour bit_pKavalue.

2. Withinthe_stringselected atthat_index,choose a pointtocrossover. _{Each string} hasa

(17)

Afterthecrossover pointis _{selected, the}crossoveriscarriedout as_previously_{demonstrated,}

with secondhalfofone chromosome addedto thefirsthalfof_{the other,} and vice versa.

Mutation

Mutationsare_simplyanother_wayto introducevariationintothepopulationand occur

roughly 5%ofthe time. Although different fromthe crossover, theyworkina similar manner.

Afterthetworandom selections are_{made, the}selectionis_simplyflipped from0to 1 or 1 to

zero.

Forinstance:

ChromosomeAbeforemutation= _{1001 1011 0011 0101 0111}

IfChromosomeA wastobe selectedformutation andthe second positionin_{the array, third}

positioninthat_stringwere_selected,themutation would end_upasfollows:

ChromosomeAafter mutation= _{1001 1001 0011 0101 0111}

The resultingChromosomehasgonefrom

having

a_pKavalue of 1 1 inthesecond positionto

one

having

a_pKaof_9,which couldhaveasignificantimpactonthe overall pi prediction.

Mutationoffitchromosomes couldhaveadetrimentaleffect on overallpopulationfitness. To

avoidthisproblem mutation rates arekept_low,nohigherthan 5%.

Inadditionto _{automatically}

being

placedintothenext_population,thefittestchromosomes

are saved after eachgeneration. Ifafter apre-determinednumber ofgenerations(always

between50and 150inthis_study), thefittestchromosomehasnot_changed,thatfitnessis

determinedtobethebestpossibleunderthose_conditions, andthe _{corresponding}_pKavalues are

returned.

(18)

JavaClasses

Containing

roughly 800lines of code(comments

included),

theprogram consisted ofthree

classes, theGeneration class,theChromosomeclass, andtheEvolveclass. See Table 1 foran

explanation of each ofthe threeclasses andtheimportantfunctionswithinthoseclasses.

Class Name Explanation

Chromosome.class Used for representingpKavaluesforthedipeptidesinquestion, a

Chromosomeobjectisan _arrayof

binary

strings usedtorepresentintegers.

Thisclassisusedtoperform operations suchas:

Randomcreation of new chromosomes

Mating(crossover)and_offspringcreation

Fitness determination

Information gathering from Chromosomesthemselves

Generation.class The Generationclassisasort of containerforthechromosomesineach

population.

Availability

ofaGenerationobjectbecomes especiallyuseful

when_passingthe_survivingchromosomes fromone generationto thenext.

Functionalitycontained withinthisclassincludes:

Creationof_initial,randomgeneration

Creationof a new generationbasedon chromosomes fromthe

previous generation(aforementionedtournament_selection)

Utilities_{for accessing individual}chromosomes withinthe

generation

Sorting

by

fitness level

Introductionof mutations

Utilities_{for reporting}results

Evolve.class The smallestclass ofthe_three,Evolveis _simplyusedtogetthealgorithm

runningandtodeterminewhentoendthe_evolvingprocess.

Mostly

all

actual

functionality

is borrowed fromtheother_classes, sothis class canbe [image:18.531.33.499.172.548.2]

thoughtof as an organizer oftheentire process.

Table 2. Thelist_ofJavaclasses_comprisingthegenetic algorithm andtheirrole intheprocess.

Theprevious section gives an overallideaofhowthegenetic algorithm works. Forthe

(19)

Results

An Example Genetic Algorithm Run

Figure2 showstheprogress made

by

a genetic algorithm when run on a set offive proteins. _{This is only}meantto

display

themannerinwhichtheGAarrivesatitsconclusion,and doesn't

directly

correspondtothefinalresults.

An Example Run

of

the

Genetic

Algorithm

H Average Apl

0.25

0.2

h 0.15 Q. < _0.1

0.05

0

5 10 15 20 25 30 35 40 45 50

[image:19.531.73.465.256.550.2]

# of Generations

Figure 2. Thisgraph shows progress made

by

thegenetic algorithm on a set_of five_randomlychosen proteinsfrom theApl>0.7 dataset

Thefiveproteins wereselected at randomfromtheApl>_{0.7 data}setforuseinthis

example. Typicalof mostGA_{runs, the} algorithmmakesquickimprovements _{early in}the run,

(20)

seenhere ismost

likely

anindicationthat the_underlyingtheoriesbehindtheGAneedtobe

strengthened.

Inthis example, thealgorithmwas allowedtorunfor 50generations without_any

improvementonthe_topfitnessvalue. Great improvementscanbenotedfortheproteinsinthis

example, astheAplvalues wentfrom

being

over0.7on averageto

having

an averageAplof

0.03. _{Unfortunately,}resultsliketheseareuncommonwhen_using alargernumberof protein

sequences.

Suggested_pKaValues

Thegenetic algorithm was run onfour differentproteindata setsbefore

being

run onthe

completeEscherichiacolidata. Eachofthefoursets correspondedtoadifferent levelof

discrepancy

betweenexperimentaland predicted pi values_(Apl),andtheresults oftheseruns are

shownbelow in Table 3.

Dipeptide Pair I)ata Set UsedlinGA

Apl<0.1 0.1<Apl<0.3 0.3 <Apl<0.7 Apl>_0.7 C()mplete

HH 6 12 13 7 1

HK 5 5 3 3 3

HR 1 8 7 10 13

HE 1 11 3 5 3

HD 13 9 7 12 12

HC 9 9 1 11 13

HY 10 8 5 11 12

KH 11 3 6 13 1

KK 5 14 11 1 9

KR 11 12 14 13 12

KE 7 13 1 3 13

KD 14 5 1 10 5

KC 11 14 11 14 14

KY 11 14 5 13 14

RH 12 7 1 1 7

(21)

RR 7 11 8 10 5

RE 12 9 9 13 14

RD 9 14 10 10 14

RC 1 1 7 12 9

RY 13 11 5 10 12

DH 5 12 1 13 5

DK 11 1 9 3 1

DR 3 5 5 11 1

DE 3 8 5 3 7

DD 5 1 3 5 5

DC 7 14 3 13 3

DY 5 3 13 3 12

EH 11 1 1 1 1

EK 3 1 2 3 1

ER 5 5 5 3 5

EE 3 3 5 5 7

ED 5 3 6 11 5

EC 13 3 1 1 5

EY 1 10 11 1 1

CH 6 3 3 1 1

CK 11 7 5 3 1

CR 5 9 9 3 3

CE 7 8 1 8 7

CD 3 13 1 10 12

CC 5 1 5 7 10

CY 9 1 3 13 7

YH 13 1 12 13 13

YK 8 13 14 11 12

YR 11 3 1 13 1

YE 12 14 8 1 1

YD 5 9 13 10 10

YC 14 1 13 10 9

YY 12 14 1 13 11

Table3.pKavalues suggested

by

GAforincorporationintothepi predictionalgorithm. Each column showsthevaluessuggested when _usingthedatasetindicated.

Eachcolumn representsthepKavaluessuggested

by

the geneticalgorithm when_running

on adifferentset ofdata. For_instance,thefirstcolumn ofdatarepresentsthefittestchromosome

fromthe GAruns_usingproteins assignedto theApl<_{0. 1 data}set. Whenusedinthepi

[image:21.531.68.467.38.528.2]

(22)

prediction_algorithm, thesedipeptidepKavaluesresultedinthehighest average_{accuracy level}

forthatgroup.

Atfirst_glance,thereare certain aspects ofTable 3 thatstandout as problem areas. Most

notableisthe

inconsistency

when_comparingone columnto thenext. Anumber oftimesa value

suggestedforuse fromonedatasetis_verydistant fromthatfromanotherdataset. Forexample,

theGA suggested apKavalue of6 for histidinewhenitoccurs nextto anotherhistidine inthe

Apl<_0.1 _dataset.

Moving

acrossto the0.1 <Apl<_0.3_{column, the} value suggestedforthe

samedipeptidepairismuch_higher,at 12.

In addition, some ofthevalues suggested

by

the algorithm_{don't entirely}make sense.

Aspartic Acidhasadefaultvalue of_4.05,buthas suggested_pKavalues upwards of13 fromthe

genetic algorithm. Ashift ofthismagnitude seemsimprobableandisevidencethat thefitness

functionassociated withthis_{GA may}need alteration. _Alone, thisinformation has littleto_say

abouthoweachsuggesteddipeptidepKahas affectedthe_accuracyof pi prediction. Inthe

following

series ofgraphs, thesuggestedpKavalues from Table 3are puttothe testwhenthe

newAplvalues are comparedto thoseoftheoriginal pi prediction. _Again, thedifference

betweentheoriginal and new algorithmsisthe incorporationofdipeptidepKavaluesthatwere

expected tohavea positive effect ontheoverall prediction accuracy. Forcompleteexcel

(23)

Effectson piPrediction

UsingApK 0.1 Data Set

Effects of Modified Algorithm on Apl < 0.1 Data Set

-Apl Using Original Algorithm Apl Using Modified Algorithm

0.40

0.35

0.25

0.

0.05

&

<&

^

& ^

4> <$

&

(#

4*

&

#V

^

< <

9 <" < <y <*'

<r <?v <?

<r <P < <$r <r*

<? < <5r <2V <3r <r

Protein Accession #

Figure3. A comparison_ofApl beforeand afterthe incorporation_ofdipeptidepKa valuesinto

theprediction algorithmfortheApl< 0.1 data. _UsingpKavalues suggested

by

theGAforthe

Apl<0.1 dataset, the pink,jaggedlineshowsAplvalues when_usingthemodified algorithm.

Theblue linecorrespondstoAplvaluesforthesame protein set when _usingthe original,

unmodified algorithmforprediction.

Furtherevidenceis found in Figure_3, where we see a clearindicationthatnot all proteins

were_positivelyaffected

by

thenew prediction method. The_blue,_gradually

increasing

line

representstheAplbeforeaddition ofdipeptide_pKavalues andthe_jagged,pinklineshowsthe

new

discrepancy

levels. Whilesomeimprovements canbe seen(wherethepinkline dips below [image:23.531.42.500.182.453.2]

(24)

the

blue),

the_majorityoftheresults showa negative impacton_prediction, _{especially in}those

proteinsthat_previouslyshowed a

fairly

high levelof accuracy. Toexplainthe

increasing

nature

oftheblue line inFigure 3 andthefiguresto follow is_verysimple. Priorto_creatingthese

graphs, theproteinswere sorted

by

theoriginalApl_values,which were calculated_usingthe

original pi predictionmethod.

Using

0.1<ApI< 0.3 Data Set

Effects of Modified Algorithm on

0.1 < Apl < 0.3 Data Set

-Apl _Using Original Algorithm Apl _Using Modified Algorithm

0.7

0.6

0.5

m 0.4

o.

0.3

0.2

0.1

r

A. / ^-A

-"

wrr

v

V

1 i i \ i i r

J

^

<$

^

J> jf>

tf

$>

^

$>

tf

J? J>

J? JP

&

^

</ </

/

</ </4?

/

^ ^

/

/>*

^ ^

</

^

</ Protein Accession #

Figure 4. Acomparison_ofpibeforeand afterthe_{incorporation of dipeptide}pKavaluesintothe

prediction algorithmforthe0.1<pl<0.3 data. _Using_pKavalues suggested

by

theGAforthe

0.1<ApI<0.3dataset, the pink,jaggedlineshowsAplvalues when_usingthemodified

algorithm. The blue linecorrespondstoAplvaluesforthesame protein set when_usingthe

[image:24.531.43.496.233.510.2]

(25)

Similarresults are found inFigures_4, 5 and_6,_showingbothpositive and negative

impactson prediction accuracy. _However, it becomes clearthat thedatasets with greater

discrepancy

levels_generallyyieldagreater overallimprovementon prediction accuracy.

Considerthecomparison ofFigures 3 and6. On onlythreeoccasionsdidthenew prediction

accuracy decrease fortheApl>_{0.7 data}set(Figure 6),whereasthenegativeimpacts seemto

outweighthepositivefortheApl<0.1 data. Thethemecanalsobeseenin comparing Figures 3

and_5,wherethereisan_accuracy on allbut fourproteinsinthe0.3<Apl <_{0.7 data} set(Figure

5).

Using

0.3<ApI< 0.7 Data Set

Effects of Modified Algorithm on 0.3 < Apl < 0.7

Data Set

-Apl UsingOriginal Algorithm -Apl _Using Modified Algorithm

1.2

0.8

0.6

0.4

0.2

1

(\

r\

/ . f, '

\

/v

r

A

/

~VT\

_\ _/ _\

A

J

/N

A

/

j i r 1 1 1 r

v

r i i ~i 1 1

r-i 1 r -^

i/

V

X

x/

s>

^

J?

_<f

^

,<& <&

4>

#

<? <& <o

^

<f

&

<$

/

^

/> #*

</

/>

^

/>

^ / / / /

/> />

^

Protein Accession #

Figure 5. Acomparison_ofpibeforeand aftertheincorporation _ofdipeptidepKa valuesintothe

prediction algorithmforthe0.3<pl<0.7data.

Using

pKa values suggested

by

the GAforthe 0.3<ApI<0.7dataset, the pink,jagged lineshowsAplvalues when_usingthemodified algorithm.

Theblue linecorrespondstoAplvaluesforthesame protein set when_usingthe_original,

unmodified algorithmforprediction.

[image:25.531.42.500.288.568.2]

(26)

Using

Apl>0.7 Data Set

Effects of Modified Algorithm on Apl > 0.7 Data Set

-Apl_UsingOriginal Algorithm -Apl _UsingModifiedAlgorithm

2.5

31-5

0.5 _r^XX_

^\tv

\

/

-.

_

rl ? o,*1

<> -^ \b ^ *

i? ^ o^

f? ^ i1 > i<%

^ 41

*-> AA -v>

> S? <? ^

oSP fV ^ ^ -^ _(^ /\< q\?

^ ^

^ /\V y<V CS^ ,^ (^

ProteinAccession # />

^

^VV"<r>v<*V<r^ 4?<f4?4^4^ 4?^'4?^>>^4*

Figure6. Acomparison_{of Apl before} and afterthe_{incorporation of dipeptide}pKa valuesinto

theprediction algorithmfortheApl>0.7data. _UsingpKa values suggested

by

theGAforthe Apl>0.7dataset, the pink,jaggedlineshowsAplvalues when_usingthemodified algorithm.

The blue linecorrespondstoAplvaluesforthesame protein set when_usingthe original,

[image:26.531.37.498.201.469.2]

(27)

Using

Entire E. coliDataSet

Proceeding

theGArunson each ofthefourpartialdata_{sets, the}bestchromosomefrom

each run was usedtoseed onelastrunonthe entireE. coli_data, andthose suggestedpKavalues

were pluggedintothealgorithm. Theresults ofthisrun are showninFigure 7. Again,itis

evidentthatmostimprovements cameforthoseproteins withhigh Aplvalues,whilethe

modifiedalgorithm faltered forthemore accurate proteins.

Effects of Modified Algorithm on Complete Data Set

Apl_Using Original Algorithm Apl _Using ModifiedAlgorithm

3.00

2.50

2.00

a _1.50

1.00

0.50

0.00

FT

; i.X ' [ T'

1 1 11 1 1 1 11 1m 1 1 1 1 1 1 1 1ifII 1 1 M 1 1 1 1 M 1 1 1'l u 1 1 1 1 N l

E.coli Protein Data Set

Figure 7. Acomparison_ofApl beforeand afterthe_{incorporation of}dipeptidepKa valuesinto

theprediction algorithmforthecompleteE. colidata.

Using

pKa values suggested

by

theGA

fortheentiredataset, the pink,jaggedlineshowsAplvalues when_usingthemodified algorithm.

Theblue linecorrespondstoAplvaluesforthesame protein set when _using_{the original,}

unmodifiedalgorithmforprediction.

[image:27.531.43.495.264.547.2]

(28)

Furthermore,

theentiredataset was usedtotest themodified algorithm when

incorporating

pKavaluesfrom theApl<0. 1, 0. 1 <Apl <0.3, 0.3<Apl A 0.7 andApl>0.7 datasets andthe

correspondingresults canbe found infigures _{8, 9,} 10, and 11. _Again,theinaccuraciestend to

overshadowthepositive effectshadon piprediction.

EffectsonEntire E.coliData Set

Using

Values Predicted inApl< 0.1 Data

Using

Values Suggested from ApKO.l Data

-Apl _UsingOriginal Algorithm -Apl _Using Modified Algorithm

3.0

2.5

2.0

M

*

1-5

1.0

0.5

0.0

Proteins in E.coliData Set

f

i

k

I I

i

1

,

1 1! A ft ,

Ml

I

₁ i

iinrd-

M

/ Mil/

, , i, ft

d

a

1

'vimVl i

--,

ir^m ii >i i. r i

Figure8. Acomparison_{of Apl before}and afterthe _{incorporation of dipeptide}pKa valuesinto

theprediction algorithmforthecompleteE. colidata. _UsingpKa values suggested

by

the GA fortheApl<0.1 dataset, the pink,jaggedlineshowsAplvalueswhen_usingthemodified

algorithm. Theblue line correspondstoAplvaluesforthesame proteinset when_usingthe

[image:28.531.41.494.185.438.2]

(29)

EffectsonEntireE. coliDataSet Using Values Predicted in O.K Apl<0.3 Data

Using

Values Suggested from 0.1 < Apl < 0.3 Data

-Apl_Using Original Algorithm Apl Using Modified Algorithm

3.00

2.50

2.00

S-_1-50

1.00

0.50

0.00

wmUaM

II1 1 1 1 1 11II l'1'l11 11 1

A

rrfc

W-A

i ii ii i ii ii i ii minn iriiriiiiiflrrriTiTHTivirn^rfrn'ifirHniTiTiTiTHrniTiTrrTTii umuiiiiu nuMiininiiiiriiniMiiiMiiinnrr 1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163

Proteins in E.coli Data Set

Figure 9. Acomparison _{of Apl before}and afterthe_{incorporation of dipeptide}pKa values into

thepredictionalgorithmforthecompleteE. colidata. _UsingpKa values suggested

by

theGA

forthe0.1<ApI<0.3dataset, the pink,jaggedlineshowsAplvalues when_usingthemodified

algorithm. The bluelinecorrespondstoAplvaluesforthe same protein set when_usingthe

original, unmodifiedalgorithmforprediction.

[image:29.531.39.498.193.507.2]

(30)

EffectsonEntire E. coliData Set

Using

Values Predicted in 0.3<_Apl<_{0.7 Data}

Using

Values Suggested from 0.3<ApI<0.7 Data

Apl _Using Original Algorithm Apl _Using Modified Algorithm

3.0

2.5

2.0

- _{1 5}

1.0

0.5

0.0 rT-rr't'im-^^'i!rTr triii'viti'itittitiii'i i imi

_S ILL

\ .A/IS iTTl

n Ink / /

li ml l ii i il i ii i ii in i liili li ii i ill li i ii i in il i ii l li i li 1 1 ll i li ill i li i il l ii l li i iiiill i ii i ii i ill il ill l li l

Proteins in E. coli Data Set

Figure 10. A comparison_{of Apl before}and aftertheincorporation_ofdipeptidepKa valuesinto

by

theGA

forthe0.3<ApI<0.7 dataset, thepink,jaggedlineshowsAplvalues when_using themodified

algorithm. The blue line correspondstoAplvaluesforthe same protein set when_usingthe

[image:30.531.37.501.139.450.2]

(31)

Effects onEntire E. coliDataSet

Using

Values Predictedin Apl >0.7 Data

Effects of Modified Algorithm on Entire Data Set

Using

Values Suggested from Apl>0.7 Data

Apl _Using Original Algorithm Apl _Using Modified Algorithm

3.00

2.50

2.00

1.50

1.00

0.50

0.00

u

fmMuMum

iiiniiiiii'iiiii

ii

-H

Proteins in E. coli Data Set

Figure 11. Acomparison_{ofApl before}and afterthe_{incorporation of dipeptide}pKa values into

by

the GA

fortheApl>0.7 data_{set, the pink,}jaggedline showsAplvalues when_usingthemodified

algorithm. Theblue linecorresponds toAplvaluesforthesame protein set when_usingthe

original, unmodified algorithmforprediction.

[image:31.531.38.499.169.474.2]

(32)

Asanalternativemethodfor

displaying

_{these results,}averageAplvalues foreachdataset

are shown in Table 4. Althoughthe overall_accuracyappears tohave decreased_slightly,from

0.31 to0.33 on_{average, the}averageAplvalue was decreased

by

about0% inthe0.3 <Apl<_0.7

data_set,and

by

_roughly30% forproteins intheApl>_0.7 set. Whiletheproblem of prediction

accuracy clearlystill_{remains, these}results_{may be}a_stepintherightdirection.

[image:32.531.118.424.298.522.2]

Average Apl Values BeforeandAfter

Table 4 showstheaverageAplvaluesforeachdatasetbeforeand afterincorporationof

dipeptidepKa values intheprediction algorithm.

DataSet Original Avg.Apl Modified Avg. Apl

ApKO.l 0.0455 0.0970

0.1<ApI<0.3 0.18 0.17

0.3<Apl<0.7 0.5060 0.2782

Apl> _0.7 _1.1148 _0.8403

Complete Set 0.3069 0.3340

Complete Set

Using

ApKO.l Values 0.3069 0.3583

Complete Set

Using

0.1<ApI<0.3 Values 0.3069 0.3793

Complete Set

Using

0.3<ApI<0.7 Values 0.3069 0.3627

Complete Set

Using

(33)

Discussion

Overall,itappearsthatour

learning

algorithm wasn't_completely effectivein

improving

onisoelectricpoint predictioninproteins. Whileonecan _onlyspeculateasto_{exactly why}the

resultsappeared as_{they did,}oneideawasthat the_trainingdataset wasinsufficient fortheGAto

produce reasonable results.

Totest this_theory, afinalexperimentwas performedthatis knownas a

"leave-one-out"

approach. This approach addressesthe_trainingset problem

by including

allbutone proteinin

the_trainingdata. Forexample, inadatasetthatcontains 170proteins, thefirst GAtrainingrun

includedprotein#s 2

-170, while protein#1 was setaside asthe_testing_{data. After collecting}

pKa valuesfromtheGArun onthe_{training data,}thosevalues wereincorporatedintothepi

predictionalgorithmto test theireffectson predictionforthe_{testing data,}or protein#1. This

information,

including

experimentalpi_value,predictedpi_value,and predicted pi valuefromthe

modified _algorithm,wasthenrecordedintoatable.

After recordingthe_data,protein#1 wasputbackintotheset of proteins and protein#2

was removed and set aside asthe_testingdata. AgaintheGAwas run and results were collected

and recordedas_theywereinthefirstrun. _Next,protein#2was re-introducedintothedataset

and protein#3 was removed andset_aside,andthisprocess wasrepeatedover and over. When

each ofthe 170proteins inthedatasethadat onetimebeen setasideas_testing_data,the

experiment was complete. Thenext_stepwastocompare averageAplvaluesoftheoriginal and

modified pi prediction algorithms.

(34)

Using

theoriginalprediction _{algorithm, the}averageApl was0.31.

Using

theleave-one-out

approachtooptimizethepi predictionshowedasignificantdecrease inaccuracy,ending up with

anaverageof0.47. Whilethiswasn'ttheresultthatwashoped_for,it isconsistentwith results

fromtheprevious experimentwhere we were unabletoimproveon overall prediction accuracies

forthecompletedataset.

Vastpossibilitiesexist for_expandingonthisworkinan attemptto _{significantly}improve

our pi prediction algorithm. _First,_cuttingdownthe listofdipeptides inthechromosome might

makethegenetic algorithm more efficientinitsresults. ThroughouttheGA_runs,itbecame

clearthatchromosomes_{containing notably}different_pKavalues could oftentimes resultin very

closefitnessvalues. Ifthosedipeptideswere_greatly

impacting

theprediction_algorithm,we

would expecttosee consistent results. _Instead,theinconsistenciesmightindicatethat the

charges onthesidechainsoftheseadjacent amino acidsdonot affect one anothertoalarge

extent, inwhich case_tryingto accountforthem_{may actually hurt}prediction accuracy. The

studytoaccomplishthismightincludea comparison of sequence characteristicsbetweenthe

positively and_negativelyaffected proteins.

By

_narrowingthesearch space inthis manner, the

chances of

having

a positive effect withoutthenegative repercussions shouldincrease.

Anotherproblem area inthis_studyand_{possibility for}furtherresearch mightinvolve

limiting

how farthesuggested_pKavaluesare allowedtodeviate fromthe_{default. As previously}

mentioned,someof_{the pKa}values were morethandoubledinthemodified predictionalgorithm.

To illustratethis problem, we might consider_{any randomly} chosen

dipeptide,

likea

histidine-aspartic acidcombination,for instance.

Histidinehas adefaultpKaof_5.98, butwhenitoccurred nexttoasparticacidintheApl>

(35)

times theH-D dipeptideoccurredinthis

data,

whichis thesmallest setofthefour. Atthestartof

theGAruns, all_pKavaluesare_randomly_generated, soiftherewere alowoccurrence ofH-D

combinations_{in any data}_{set, the}fitnessvalue wouldn'tbeaffected asmuchasit is

by

highly

occurring dipeptides. Inturn,thismeansthatoutrageouspKavalues might notend_up

being

replaced andcould survive inthefittestpopulation.

Thispoint couldbeusedtoexplaintheresults when_runningonthe completedataset.

Again,thepKavalue for H-Dwas suggestedtobe 12. Forthis exampleitis importantto

keep

in

mindthat theGArun_usingthecompletedataset was

initially

seeded withthe_top chromosomes

fromeach ofthefourprevious runs. Thismeansthat the_top chromosome fromtheApl>_0.7run

was_used,

immediately introducing

apKavalue of12intothepopulation. Evenintheinstance

thatthechromosome wasn'tinthe_top 5% fitness levelanddidn't surviveto thenext_generation,

it's

likely

that thevalue of12 for H-D stayedintactthrougha series of_matingand crossover

events.Ifat some pointthechromosome _containingthat_pKavalue wasinthe_top5%offitness

levels,itwas_{automatically}movedto thenext_generation, _savingthatvalueforthehistidine

-aspartic acid pair.

Over_time, thefittestchromosomes end_up

being

reproduced more_readily,whichinthis

example would meanthatthevalue of12 fortheH-Dpairdominatesthepopulation eventhough

itmight nothavea significant effect on predictionaccuracy. _Eventually,thisvalueis

incorporated intothe_algorithm, and could end_up

having

anegativeimpactontheprediction.

Therefore,

by limiting

how farthepKavalues can

deviate,

itwoulddecreasethenegativeimpact

incases such as_this, and might notovershadowthesuggested_pKavaluesthat_reallyare

having

a

positive effect.

Athirdapproach mightbetoexpand onthechargeable groupsandintroduceuncharged

(36)

amino acidsintotheequation. PreviousresearchhasshownthatN-terminalasparaginehada

significantimpactonthepredicted pi value [5]. Althoughthemeans

by

whichthisoccurs

remain_unclear, one_{possibility may be} that the_hydrophobic, uncharged amino acidsinterfere

with_charged,adjacentside chainswhenincontactwith water. Althoughpossibilities are

extensiveforthistypeof_research,one method of_attackingthisproblem mightbetoconsider

events where ahydrophobicamino acid restsbetweencharged side chains. Similartothe

research presentedhere,an_{evolutionary programming}approach couldbeusedinattempttofind

pKavaluesthat_remedytheproblem.

(37)

Conclusion

Thisthesisworkhasinvestigatedthe_possibility of

improving

isoelectricpoint prediction

by

using evolutionary programmingto accountforcharge-chargeinteractionswithinthe

sequence. Whileanincrease in accuracywasseen on a small_scale,itwas not substantial enough

and was overshadowed

by

decreases_{in accuracy in}other areas. Forthatreason wecannot_say

our workhasresultedinabetteralgorithm. _However,isoelectricpoint predictionis adifficult

problemthatstillhasmuch roomfor investigation. Whiletheresultsfailedtoyieldevidenceto

an overall_accuracy_increase,theinformationpresentedhereputs us one small_stepclosertoa

successful pi prediction and provides a genetic algorithmthat_mayprove usefulin futurestudies.

(38)

Bibliography

[1] Hamdan,H. and_Righetti,

P.G.(2005)

"ProteomicsToday: Protein Assessmentand

Biomarkers

Using

Mass _{Spectrometry,}2D_{Electrophoresis,}and

Microarray

Technology".

Hoboken,NJ,John

Wiley

and _Sons,Inc.

[2] Fichmann, J. and_Westermeier,R.(1999)"2-D Protein Gel Electrophoresis: AnOverview." Methods in Molecular Biology: Vol. _112(1-9)

[3] Conte,M. ₍₂₀₀₅₎"Isoelectric Point Prediction FromtheAmino Acid Sequenceof a

Protein"

submittedaspart of aMaster's Thesis ProjectatRITin 2004

[4] Mitchell,M. ₍₁₉₉₈₎"An IntroductiontoGeneticAlgorithms"

The MIT _Press, 1999

[5] Cargile,B.J., Talley, D.L., Stephenson,J.L. ₍₂₀₀₄₎ "ImmobilizedpH gradients as afirst

dimension inshotgun proteomics and analysis ofthe_accuracyof pi_{predictability}of peptides".

Electrophoresis 25: 936-945

[6] Hortsmann,C.S. _{(2001) Big}Java:

Programming

andPractice_Wiley, 1stEdition

[7] "SWISS-2DPAGE Two-dimensionalpolyacrylamidegel electrophoresis

database"

Foundat

http://us.expasy.org/ch2d/

[8] Bjellqvist, B., Hughes, G., Pasquali, C, Paquet, N., Ravier, F., Sanchez, J.-C,et al.

(1993)

"The

focusing

positions of polypeptidesin immobilizedpH gradients canbepredicted fromtheir

amino acid sequences".Electrophoresis 14:1023-1031.

[9]

TonellaL., Hoogland_C,BinzP.-A.,Appel_R.D.,Hochstrasser_D.F.,Sanchez J.-C. "New

perspectivesintheEscherichiacoliproteomeinvestigation". Proteomics 1:409-423(2001).

[10] "ComputepI/MxforSwiss-Prot/TrEMBL entriesor a user-enteredsequence".Foundat

http://us.expasy.org/tools/pi_tool.html

[1

1]

SilleroA., Ribeiro,J.M.

(1989)

"Isoelectricpoints of proteins:theoreticaldetermination.

AnalyticalBiochemistry" 179: 319-325

[12]

Righetti, P.G., Caravaggio, T.(1976)"Isoelectricpoints and molecular weights ofproteins.

Journalof_{Chromatrography}

"

(39)

[13]

Cargile,B._J., et al.

(2004)

"Gel Based Isoelectric

Focusing

ofPeptides andthe

Utility

of

IsoelectricPointinProteinIdentification."

Journalof proteome research3.1 (2004): 1 12-9.

[14]

"Getproteinlistforareference

map."

Foundat

http://www.expasy.org/cgi-bin/get-ch2d-table.pl

[15]

"NCBI Batch Entrezsearch". Foundat

http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Protein

Appendix A

-Escherichia

coli

Data Set

Protein Actualpi Predicted

|Actual-Pred|

Color Codes: P0AE08 5.05 5.050048828 4.88E-05 ApKO.l

P05055 5.13 5.129943848 5.62E-05 0.1<ApK0.3 P45578 5.2 5.200439453 4.39E-04 0.3<Apl<_0.7

P0AEZ9 5.74 5.7421875 0.0021875 Apl>_0.7

P37689 5.15 5.152587891 0.002587891

P0ABB0 5.81 5.806274414 0.003725586

P0A6L0 5.52 5.514892578 0.005107422 P13029 5.16 5.16583252 0.00583252 P61714 5.19 5.183349609 0.006650391 P23869 5.51 5.502929688 0.007070312

P09030 5.8 5.807983398 0.007983398

P0AEZ3 5.28 5.26965332 0.01034668

P0A7A9 5.06 5.049194336 0.010805664

P0A6P1 5.22 5.234619141 0.014619141

P0AFU8 5.67 5.655029297 0.014970703

P0ABB4 4.95 4.932983398 0.017016602

P0A817 5.1 5.121826172 0.021826172

P0AB71 5.56 5.537109375 0.022890625

P36683 5.24 5.263671875 0.023671875

P0ACU7 5 4.973144531 0.026855469

P0A6E4 5.28 5.252563477 _0.027436523

P0A6F5 4.91 4.879150391 _0.030849609

P00509 5.53 5.561035156 0.031035156

P39172 5.58 5.611450195 0.031450195

PI 6703 5.47 5.437988281 0.032011719

[image:39.531.48.296.275.656.2]

(40)

P0AE67 4.95 4.915039063 0.034960938

P0ADU2 5.77 5.807128906 0.037128906

P0A9C3 4.9 4.860778809 0.039221191

P0A877 5.38 5.338867188 0.041132812

P0C054 5.63 5.588378906 0.041621094

P62707 5.82 5.861816406 0.041816406

P0A799 5.15 5.107299805 0.042700195

P0AAI9 5.03 4.98425293 0.04574707

P0A6M8 5.21 5.256408691 0.046408691

P26646 5.6 5.648193359 0.048193359

P07004 5.39 5.438842773 0.048842773

P0A6D3 5.34 5.389282227 0.049282227

P0A7Z4 5.04 4.988952637 0.051047363

P0A870 5.08 5.132080078 0.052080078

P63284 5.44 5.383728027 0.056271973

P0A796 5.43 5.487548828 0.057548828

P08312 5.74 5.799438477 0.059438477

POAGEO 5.41 5.472167969 0.062167969

P0A6G7 5.6 5.537109375 0.062890625

P0A6F9 5.23 5.166259766 0.063740234

P0A7F3 7.01 6.941894531 0.068105469

POAE18 5.71 5.638793945 0.071206055

P24216 4.96 4.888549805 0.071450195

P09832 5.48 5.551635742 0.071635742

P23721 5.47 5.391845703 0.078154297

POA850 4.93 4.849884033 0.080115967

P0AB55 5.29 5.208984375 0.081015625

P76149 5.38 5.46105957 0.08105957

P0AG67 4.99 4.908416748 0.081583252

PI6659 5.06 5.146606445 0.086606445

P0AA25 4.8 4.711669922 0.088330078

P0A9A9 5.78 5.688354492 0.091645508

P08142 5.22 5.314086914 0.094086914

PI8843 5.34 5.434570313 0.094570313

POAC55 5.78 5.875488281 0.095488281

P05194 5.31 5.213256836 0.096743164

P0A6Y8 4.96 4.863128662 0.096871338

P0A9M5 5.44 5.538818359 0.098818359

P0A6D7 5.18 5.280761719 0.100761719

POAEDO 5.2 5.097900391 0.102099609

P0A9D2 5.76 5.863525391 0.103525391

(41)

P0A8G6 5.51 5.615722656 0.105722656

P0AG78 6.49 6.596679688 0.106679687

P0AEQ3 7.32 7.435791016 0.115791016

P60595 5.24 5.359375 0.119375

P0ABU2 5.02 4.900085449 0.119914551

P0A7L0 8.24 8.115966797 0.124033203

P0ABD8 4.78 4.654846191 0.125153809

P0AF03 4.84 4.965454102 0.125454102

P04949 4.7 4.573669434 0.126330566

P68066 4.98 5.106445313 0.126445312

P0A6E6 5.34 5.46875 0.12875

P0A6A3 5.72 5.8515625 0.1315625

P75797 7.32 7.186279297 0.133720703

P29744 4.82 4.683044434 0.136955566

P09029 5.75 5.612304688 0.137695313

P61889 5.49 5.629394531 0.139394531

P00547 5.33 5.472167969 0.142167969

P0A7E1 7.28 7.13671875 0.14328125

P67910 4.98 4.835571289 0.144428711

P0AGD3 5.45 5.595214844 0.145214844

P0A9A6 4.83 4.680480957 0.149519043

P00946 5.16 5.31237793 0.15237793

P12758 5.66 5.82421875 0.16421875

P0A955 5.43 5.595214844 0.165214844

P0A8M0 5.01 5.195739746 0.185739746

P69783 4.95 4.762939453 0.187060547

P0AFC7 5.42 5.612304688 0.192304688

P0A9Q9 5.2 5.393554688 0.193554687

P25553 5.29 5.095336914 0.194663086

P0AEX9 5.23 5.435424805 0.205424805

P28635 4.95 5.156860352 0.206860352

P0A9G6 4.98 5.189758301 0.209758301

P0A6W5 4.95 4.73815918 0.21184082

P39177 6.25 6.037841797 0.212158203

P0A862 5.02 4.800537109 0.219462891

P0A715 6.1 6.323242188 0.223242188

P0AC69 4.96 4.727050781 0.232949219

P0A7N1 8.3 8.065551758 0.234448242

P0ABU5 4.92 4.685180664 0.234819336

P0A7K2 4.87 4.633056641 0.236943359

P0AES9 4.84 5.0859375 0.2459375

P0AD96 5.31 5.561889648 0.251889648

(42)

P38489 5.55 5.812255859 0.262255859

P0AEK4 5.33 5.595214844 0.265214844

P0A6N1 5.58 5.314086914 0.265913086

P0A855 7.05 6.78125 0.26875

P46850 5.65 5.928466797 0.278466797

P0A9C5 5 5.282470703 0.282470703

P35340 5.2 5.485839844 0.285839844

P0A9M2 5.38 5.080810547 0.299189453

P04036 5.11 5.46105957 0.35105957

P37902 7.87 7.516113281 0.353886719

P0A940 5.33 4.967163086 0.362836914

P0A763 5.19 5.557617188 0.367617187

P0A8X2 5.2 5.591796875 0.391796875

P63020 4.96 4.568115234 0.391884766

P76290 5.8 5.373046875 0.426953125

P04816 5.08 5.516601563 0.436601562

P0A8Q6 5.4 4.959472656 0.440527344

P0AG82 6.85 7.293945313 0.443945313

P0ABT2 5.27 5.718261719 0.448261719

P09551 5.17 5.622558594 0.452558594

P08200 4.7 5.17565918 0.47565918

P0A8P3 5.46 5.937011719 0.477011719

P23847 5.71 6.196777344 0.486777344

P37329 6.7 7.187988281 0.487988281

POAEUO 4.99 5.489257813 0.499257812

P0AEE5 5.19 5.697753906 0.507753906

P0AFK9 4.76 5.270507813 0.510507813

P0ADG7 5.49 6.017333984 0.527333984

PI6700 6.58 7.128173828 0.548173828

P69441 5 5.567871094 0.567871094

P0AFZ3 5.01 4.428833008 0.581166992

P23843 5.47 6.052368164 0.582368164

P69797 5.17 5.755859375 0.585859375

P0AGE9 5.73 6.321533203 0.591533203

P0A6P9 4.74 5.344848633 0.604848633

P18335 5.19 5.797729492 0.607729492

P0A858 5.01 5.649047852 0.639047852

P31663 5.25 5.926757813 0.676757813

POCOVO 8.01 7.329833984 0.680166016

P0A879 5.03 5.717407227 0.687407227

tUUBUBSBBBBKNUtt

P0ADG4 5.71 6.453125 0.743125

P30859 5.07 5.813964844 0.743964844

(43)

P00894 8.11 7.352050781 0.757949219

P0AET2j 5 5.762695313 0.762695313

P0A910 : _5.23 _5.996826172 _0.766826172

P61316 , 5.52 6.306152344 0.786152344

P0ABK51 4.95 5.843017578 0.903017578

P0ADE8

\

6.11 5.179931641 0.930068359 P0ADA3 8.84 7.902770996 0.937229004

P0AFL3 8.52 7.567382813 0.952617187 P76002 9.2 8.246704102 0.953295898

POAFGO 5.4 6.37109375 0.97109375

P0AD59 5.33 6.306152344 0.976152344

P0AEM9 5.19 6.230957031 1.040957031

P0A7R1 5.1 6.186523438 1.086523438

: P77348 8.55 7.314453125 1.235546875

P0A9B2 5.32 6.583007813 1.263007812

P33136 8.04 6.608642578 1.431357422

P00811 9.06 7.55456543 1.50543457

P0ADV7 10.3 7.978393555 2.321606445

P68919 10.6 8.2578125 2.3421875

(44)

Appendix B

-Genetic

Algorithm

Source

Code

Chromosome

_.java

/*

*_{Chromosome.java}

*

author:Chris Parkin

*_date:

September,2006

*

*_{Class for}

constructionandmanipulationof a_chromosome,which contains *

information regardingpKavaluesforaminoacidtriplets

*/

publicclassChromosome_{

private_String[]_chromosome;_//array_representingthechromosome

//chromosomelabels

private static_Stringf]_represented;

*_default

constructor_(randomlyassigned_values) *_{each position}_in

the arrayisassigned a valuebetween 0and15

public_ChromosomeO_{ represented=_new

String[] { "HH","HK","HR","HE","HD","HC","HY",

"KH","KK","KR","KE","KD","KC","KY",

"RH","RK","RR","RE","PvD","RC","RY",

"DH","DK","DR","DE","DD","DC","DY",

"EH","EK","ER","EE","ED","EC","EY", "CH","CK","CR","CE","CD","CC","CY", "YH","YK","YR","YE","YD","YC","YY"};

chromosome=_new

String[represented.length]; for(int_i=0;i<represented.length;i++) {

int intValue=

(int)((Math.random() * 14)+ _1);

Stringbinary=_Integer._{toBinaryString(int}_Value);

if(binary.length()<4){

binary=

addLeadingZeros(binary,4-binary.length()); }

chromosome[i]=binary;

(45)

}

/***************************************************************

*

createan_emptychromosomeof specificlength

**************#************************************************/

publicChromosome(int_length)_{

chromosome=_new

String[length];

}

/****#**********************************************************

*

createa chromosomebasedon aninput_string_array

************************************************************

public_{Chromosome(String[]} input){

chromosome= input; } /*************************************************************** * getLength *

returnthelengthofthechromosome_array

***************************************************************/

publicint_getLength()_{

returnchromosome.length; } /*************************************************************** * getValueAt *

returnthecurrent pKa valueforthespecified_{array index in}_binaryformat * _@param:_arraylndex- _theindex

corresponding to therepresented_array ***************************************************************/

public_String_getValueAt(intarraylndex_{) {}

return chromosome_{[arraylndex];}

}

/*************************************************************** *

getlntValueAt

* _return_the_{current pKa value}_for_the_specified

array index in int form

*

@param:arraylndex- _the

index correspondingto therepresented_array

publicint getIntValueAt( intarraylndex₎_{

return_{Integer.parseInt(chromosome[arrayIndex],}_2);

}

* mutate *

mutatethechromosome_{by flipping}a_randomlychosenbit

public void_mutate()_{

//randomlychoose anindex inthechromosome

int_indexToFlip=

(int)(Math.random() *chromosome.length);

(46)

}

//getthe_binaryvaluetomutate

StringbeforeMutation=

chromosome[indexToFlip];

//randomlychoose whichofthefourbitsto_flip

int_bitToFlip=

(int)(Math.random()*4);

//convertthe_stringintosomethingthatcanbeedited

char[]editable=

beforeMutation.toCharArray();

//determineto_fliptoa 1 or a0 if(editable[bitToFlip] == '0*){ editable[bitToFlip]= T; } else{ editable[bitToFlip]= '0'; }

//createanew,post-mutation_stringand stickit back inthechromosome

StringafterMutation=_new

An Evolutionary Computation Approach to Optimization of Isoelectric Point Prediction in Proteins

Rochester Institute of Technology

Recommended Citation

Rochester Institute of Technology Department of Biological Sciences Bioinformatics Program

Thesis committee members:

Thesis/Dissertation Author Permission Statement

Print Reproduction Permission Granted:

Inclusion in the RIT Digital Media Library Electronic Thesis

Evolutionary

involving

Contents

Using

leading

forming

difficulty

Technology

heavily

biology,

According

leading

improving

basing

following

being

looking

binary

being

immediately

being

Following

binary

having

Containing

Availability

Mostly

Genetic

being

inconsistency

following

increasing

discrepancy

Using

Using

Using

Using

Using

Using

Using

displaying

Using

by including

Using

impacting

limiting

highly

immediately introducing

having

Conclusion

Bibliography

Programming

Focusing

|Actual-Pred|

POAGEO 5.41 5.472167969 0.062167969

Appendix B