An Evolutionary Computation Approach to Optimization of Isoelectric Point Prediction in Proteins

(1)

Rochester Institute of Technology

RIT Scholar Works

Theses

2006

An Evolutionary Computation Approach to Optimization of

Isoelectric Point Prediction in Proteins

Chris Parkin

Follow this and additional works at:

https://scholarworks.rit.edu/theses

Recommended Citation

Parkin, Chris, "An Evolutionary Computation Approach to Optimization of Isoelectric Point Prediction in

Proteins" (2006). Thesis. Rochester Institute of Technology. Accessed from

This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact

(2)

.~

~~

..

~.rfOrmatlcs

RIT

"

To:

Head, Department of Biological Sciences

Rochester Institute of Technology

Department of Biological Sciences

Bioinformatics Program

The undersigned state that _ _ _

~Co.!.hll.r.L=is!..!o.to~p.L.!h.l:=.e.!...r .!...P.!=!.aru.k;u.iu.n _ _ _ _ _ _ _ _ _ ,

a

candidate for the Master of Science degree in Bioinformatics, has submitted his/her

thesis and has satisfactorily defended it.

This completes the requirements for the Master of Science degree in Bioinformatics at

Rochester Institute of Technology.

Thesis committee members:

Name

paul Craig

(Committee Chair)

Paul Craig

(Thesis Advisor)

Illegible Signature

Gary

R.

Skuse, Ph.D. Director of Bioinformatics

Date

475-2532 (voice) [email protected]

(3)

Thesis/Dissertation Author Permission Statement

Title of thesis

or

.

disse~tion

:

12

!£~~~f//£r;r

f/Xft

f

o3°

D

1{t:s12ach

In

tJp

fun

I

Za

hf!YJ

a

5'0

f

-e

I

0Y1

0 J'n

k

Name of author:

Chnsdoph<:c

1?uhlO

Degree:

13,

C i

do

em

IX

-hie:>

1I!ll'JV1S

~o~: ~S~g~\D~

____________________________________________

College:

.s

c

i

e.n (

e

I understand that I must submit a print copy of my thesis or dissertation to the RIT Archives, per current

RIT guidelines for the completion of my degree. I hereby grant to the Rochester Institute of Technology

and its agents the non-exclusive license to archive and make accessible my thesis or dissertation in whole

or in part in

all

forms of media in perpetuity. I retain all other ownership rights to the copyright of the

thesis or dissertation. I also retain the right to use in future works (such as articles or books)

all

or part of

this thesis or dissertation.

Print Reproduction Permission Granted:

I,

Christopher Parkin

,

hereby

grant

permission to the Rochester Institute

Technology to reproduce my print thesis or dissertation in whole or in part. Any reproduction will not be

for commercial use or profit.

Signature of Author:

_{Christopher Parkin}

Date:

Print Reproduction Permission Denied:

L

,

hereby deny permission to the RIT Library of the

Rochester Institute of Technology to reproduce my print thesis or dissertation in whole or in part.

Signature of Author: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date: _____

Inclusion in the RIT Digital Media Library Electronic Thesis

&

Dissertation (ETD) Archive

L

Christopher Parkin

,

additionally grant to the Rochester Institute of Technology

Digital Media Library

(RIT

DML) the non-exclusive license to archive and provide electronic access to

my thesis or dissertation in whole or in part in

all

forms of media in perpetuity.

I understand that my work, in addition to its bibliographic record and abstract, will be available to the

world-wide community of scholars and researchers through the RIT DML. I retain all other ownership

rights to the copyright of the thesis or dissertation. I also retain the right to use in future works (such as

articles or books) all or part of this thesis or dissertation. I am aware that the Rochester Institute of

Technology does not require registration of copyright for ETDs.

I hereby certify that,

if

appropriate, I have obtained and attached written permission statements from the

owners of each third party copyrighted matter to be included in my thesis or dissertation. I certify that the

version I submitted is the same as that approved by my committee.

(4)

An

_Evolutionary

Computation Approach

to

Optimization

of

Isoelectric Point Prediction in Proteins

Submitted

_by

Chris Parkin

DepartmentofBiologicalSciences

Inpartialfulfillmentoftherequirements

fortheMasterofSciencedegree in

Bioinformaticsat

RochesterInstitute of

_Technology

(5)

Abstract

An

_Evolutionary

ComputationApproachtoOptimization

ofIsoelectric Point Prediction in Proteins

by

Christopher Parkin

MasterofSciencein Bioinformatics

Rochester Instituteof

_Technology

Professor Paul

_Craig,

Chair

Computational

_biology

hasattackedtheproblem ofisoelectricpoint prediction withlittle

success, _achievinga rough_{accuracy level}of_{only 30%. In}

_2005,

MatthewConteperformeda

study focused onthe_{relationship between}sequence characteristics andisoelectricpoint

prediction accuracy. Results indicatedthatchargesbetweenadjacent amino acids couldhave a

significantimpactontheoverall predicted pi fortheprotein. Inthis _studyweintroduce an

evolutionarycomputation approach aimed at_{accounting for}theseproblemdipeptides. Foreach

possibledipeptide

_involving

charged amino acids(7chargeable groups->49_{possibilities), the}

algorithm predicts a_pKavalue_that,whenincluded inthe pi prediction_algorithm,should resultin

a moreaccurate prediction.

_By

_{accounting for}thesecharged, adjacent amino_acids,thepi

predictionshowedimprovements forthoseproteins withthegreatestdeviation between

experimentaland predicted pi value(Apl>0.7).

_However,

theseresults were notgeneralized, as

theincorporationofthesevalueshadthereverse effect on_remaining_proteins,most_notablythose

fromthemost accuratedataset(Apl<0.1). Whilethisresearchlaysafoundation for

_improving

thepi prediction algorithm,additional exploration remains_{necessary for}an overall_accuracy

(6)

_ExPASy

2DPAGE Database 5

2.2

_Trimming

theDataSet 5

2.3

_Training

&

_Testing

7

2.4 The Genetic Algorithm 7 2.4. 1 The Chromosome 8 2.4.2 Fitness 8 2.4.3 Tournament Selection 9 2.4.4 Crossover 10 2.4.5 Mutation 10 2.4.6 JavaClasses 12 3 Results 13 3.1 An Example GA Run 13 3.2 SuggestedpKaValues 13

3.3 Effectson piPrediction 16 3.3.1

_Using

Apl<0.1 Data 16 3.3.2

_Using

0.1 <Apl<0.3 Data 17 3.3.3

_Using

0.3 <Apl<0.7 Data 18 3.3.4

_Using

Apl>0.7 Data 19 3.3.4

_Using

Complete Data Set 20 3.3.5 Overall Effect

_Using

Apl<0.1 21 3.3.6 Overall Effect

_Using

0.1<ApI<0.3 22 3.3.7 Overall Effect

_Using

0.3<ApI<0.7 23 3.3.8 Overall Effect

_Using

Apl>0.7 24 3.3.9 Average Apl Values 25 4 Discussion 26 5 Conclusion 31

(7)

Introduction

Two-dimensionalgel electrophoresis

_(2DE)

firstemergedin 1975whenDr. Patrick O'Farrell

displayedthe_abilitytoseparate 1,100polypeptides from Escherichiacoli[1]. Withthe

_theory

andtechnique

_being

_slightlyahead ofits time,itwas

_initially

practiced

_by

_onlyahandfulof

scientists aroundtheworld. Since_then, theemergence of new analytical_tools, combined with

numerous

_large-scale,

publicinformation

_databases,

has shed a whole newlightonthis once

dormanttechnique [2].

_Today,

2DEremains a

_leading

techniqueforseparation and

identificationof proteins.

Isoelectric

_{focusing (IEF)}

isthemainfocusofthis _studyand makes_upthefirst dimensionof

2DE. IEFisa methodinwhich amphoteric molecules are separatedina polyacrylamide gel

accordingtotheirisoelectricpoint values[2]. Whenplacedina pH _gradient,a protein will

migrateto theposition whereitsnet chargeisequaltozero. ThepHatthispositionisknownas

theisoelectricpoint

_(pi)

value. Isoelectricpointis determined

_by

charged groupsin_{the protein,}

andis oftenbetween 3 and

_12,

withmost

_falling

between 4and7[1 1,12].

TraditionaltechniquesusedtoformpHgradients involved mixingampholytesthathad been

chemicallyengineeredtoacertain_pKavalue[13]. While thismethod worked_efficiently,thepH

gradient was_{extremely difficult}toreproduce. Since

_then,

immobilizedpH gradients

_(IPGys)

havebeen introduced. Inan

_IPG,

theampholytes areboundinacrylamide_gel,

_forming

afixed

pH gradient and_{ensuring reproducibility [8,13].}

(8)

Theseconddimensionof2DE isa separation

_by

molecular mass. Achargeis appliedtoa

bufferthat surrounds_{the gel,}_attractingthemoleculestothe opposite end and_causingthem to

migrate. Thelargerofthesemoleculestravel the slowest and will remain nearthe

_top

ofthe_gel,

whilethesmaller moleculeswilltravelfurtherandbeseentowardthebottomofthegel. After

staining, theend result of2DEisa grid of spots with each spot_referringto thelocationof a

protein molecule inthegel(Figure 1). The Xvalueinthegrid corresponds to thepi valueofthat

protein,whiletheYvalue correspondstothedistancemigratedinthegel.

Theapplication ofthistechniquehasprovento

bea powerfultoolandhasprovided researchers

witha great amount ofdata [5].

_However,

the

difficulty

andtimerequirements associated with

performingand

_interpreting

_{2DE correctly have led}

totheemergence of computational approaches to

2DE [5]. Whilethebenefitsassociated with

simulations are often quite_attractive, the

limitationsplaced uponthepi prediction portion of

the2DEsimulationhaveprovedtobetheAchilles

=

.

-r

Figure 1. Sampleoutputfrom

2-Dimensional Electrophoresis. Obtained from Swiss 2DPAGE

database,

protein

ID#P16700[7,10]

heel oftheentire simulation.

The isoelectricpoint prediction algorithmtobeoptimizedinthis_{study is}part of a2DE

simulatorthatwas _{originally developed}atthe RochesterInstituteof

_Technology

as part of an

honor'sthesisproject[3]. Thisalgorithmwasimplemented tocalculate_charge,basedonside

chains ofaminoacidsfoundinthesequence. Thecharge on each side chainis afunctionofthe

(9)

Amino Acid DefaultpKaValue R- Arginine 12 D- Aspartic Acid 4.05 E- GlutamicAcid 4.45 H- Histidine 5.98 K- Lysine 10 C -Cysteine 9 Y- Tyrosine 10

charge ontheaminoacid side chains isshownin Table 1

Using

thevaluesfrom Table 1 and

startingwith a pH of

_7,

the algorithm

looksat eachindividual amino acidin

the sequenceand computesitscharge.

Eachindividualamino acid charge is

thenaddedtoa_runningtotalof_charge,

resulting intotal charge fortheprotein.

Ifthe totalchargeisgreaterthan0.005

orlessthan-0.005, thepH value usedin

, , . ,. , , , , Table1. DefaultpKavaluesusedinoriginalpi

thecalculation isadmsted andthe charge ,.

, .f,

J

prediction algorithm

calculationisrepeated. Thiscycle continues untilthe totalcharge isreportedtobebetween

-0.005 and

_0.005,

_practicallyzero.

_Finally,

thepH value_{resulting in}a net charge of zero onthe

proteinis_returned, and consideredthepi value forthatprotein.Whilethe pKavalues are

_heavily

relied onin_{this calculation,} variables such as post-translational modifications and charge-charge

interactions areleftunaccounted

_for,

_{significantly affecting}predictionaccuracy[3].

In

_2005,

Matthew Conteperformed sequential analyses on numerousproteinsfromtheE.

coli_proteome,obtained fromthe Swiss 2DPAGE

_database[3,

7]. In

_doing

_so,heuncovered a

correlationbetweentheoccurrence ofchargeddipeptidesinthesequence andthelevelof

discrepancy

betweenexperimental and predicted_pi,known_{in his study}as well asthisone asApl [3]. Hisresults showedthat thehigherthenumberofcharge-chargedipeptides inthe sequence,

thegreaterthedeviationbetweenactual and predictedpi valueforthatprotein[3].

Thepicalculationisbasedon_{the pKa}values fortheamino acid sidechains. Basedonhis

(10)

results, ourhopewastoderivenew_pKavalues_using ageneticalgorithm. AsinConte'swork,

Escherichiacoli wastheproteome ofchoice. E. coliisthought tohave a_relativelylownumber

ofpost-_{transcriptional}

modifications suchasmethylation and_{phosphorylation,}andit isone of

themost_widelystudiedbacteria in_science,_making itanidealsubject[3].

_Furthermore,

experimentalisoelectricpointdata from_onlyone_groupwas_used,_{assuring consistency in lab}

practicesanddatasubmission[3,9].

Nowa cornerstonein

_biology,

evolution andthe_underlying

_theory

of natural selection are

accreditedtoCharlesDarwinafterhisresearchinthemid

19th

century [4]. His

theory

of natural

selectionproposedthatindividuals bestadaptedto their_surrounding environment are more

_likely

tosurviveand mate. Over

_time,

thosewiththeless-favorabletraitsdie_out,while favorabletraits

are passed_on,_eventually

_introducing

adaptationsintothepopulation.

Evolutionary

computation modelslikethegeneticalgorithm

_(GA)

usedinthis_study

_loosely

follow Darwin'stheories. In_{this case,}eachindividual inthepopulation isa set of_pKavalues

usedtoaccommodatethecharge-charge amino acid pairsthat_{normally hurt}the_accuracyof pi

prediction. Ineach_{generation, the}most well adaptedindividualsarethose thatleadto themost

accurate pi_prediction,and areknownasthefittestofthepopulation.

According

toevolution, thefittest individualsarethosemost

_likely

tosurvive and_mate, so

thefittest fromeach generation _{automatically}survive intothenext. Over_time,simulated

processesofmutation,crossoverand recombination are appliedtoeach_generation,_{resulting in}a

population ofthebestpossible_{individuals. Further details regarding}theworkings oftheGA are

(11)

Methods

ExPASv 2DPAGE Database

The

_ExPASy

server's Swiss2DPAGE database

(http://ca.expasy.org/ch2d/)

contains vast

2DEgelinformationfor

human,

mouse,Arabidopsis thaliana,Dictyostelium

discoideum,

Escherichia_coli,Saccharomycescerevisiae,andStaphylococcusaureus(N315))[7]. Foreach

proteininthe

_database,

information_regardingexperimentalpi_value,molecular_weight,

experimental_{methods, references,} anda photo oftheactualgel runintheexperimentis

available[7].

Because_manygroupshavecontributedto this

_database,

it isnot uncommontofindmultiple

submissionsfor anyone protein. Forthatreasonandto

_keep

experimental practices consistent,

onlythoseentries fromTonellawere usedinthis_{study [9]. For}ease of_{use, the}Swiss 2DPAGE

allows fortheinformationtobe downloaded intoatab delimitedtextfiletobe importedtoa

spreadsheet14]. Thefields availableinthisfile include genename,

description,

Swiss 2DPAGE

accession_number, spot

_ID,

experimental_pi,experimental molecular_weight,_mapping_methods,

commenttopics and a referenceto the_{group carrying}outtheexperiments.

Trimming

theDataSet

After obtainingtheinitial Tonella dataset_{containing roughly 340}_proteins,itwas not

uncommonto see_upto eight entries _{for any}one protein.

_Again,

duplicationsare a result of

post-translationalmodificationsthatcause a change inpI/MW onthe protein,

leading

toa unique spot onthegel. Becausemost oftheseduplicatepi values were quite similar(oftenwithin.01 of one

(12)

another), anaveragepi value wastaken torepresenttheproteininthe data_set,andthe_remaining

duplicates

removed. Inthe eventthat

_drastically

differentpi values were_{recorded, only}thefirst

entrywas _saved,andthatprotein wasomitted from

_training

thegeneticalgorithmlaterinthe

study.

170proteins remainedafter allduplicateswereremovedfromthedata set,which werethen

broken into fourgroupsbasedonthedifference betweenexperimental and predicted pivalue,

knownas _jpKO.l, 0.1<_pl<

_{0.3, 0.3<_pl<0.7,}

and (Appendix A). Thegreatest

concern forthis_studywereproteinsfoundinthe and0.3<_pl<0.7datasets,with

expectationsthat

_improving

thosepredictions would_{greatly improve}theoverall_{accuracy level}

forthealgorithm. Fora complete

_listing

oftheproteins used after_trimming, seeAppendixA.

Sequence

_Gathering

All Swiss 2DPAGEproteinentries are cross-linked withthe Swiss-Prot

_database,

_{making it}

possibletoacquireFASTAformattedsequencethrough theNCBI Batch Entrezsearch[15]. To

usethis_tool,a simple listofthe 170protein accession numbers was uploadedto the

_NCBI,

which returnedall 170proteins in FASTA format. To easilyassociatetheexperimental pi value

withtheproteinsequence,each experimental value was_manuallyenteredintothesecondline of

therespectiveFASTAfileofthatprotein. Thisresultedinonelarge FASTA formatted file

containingall 170proteins, completewith accession_{numbers, pKa}value andsequence. Aperl

scriptthenparsedthisfile and saved each proteinsequence_separately,

_basing

thefilename on

theprotein's accession number.

Finally,

another short program was writtentoreadinall 170

protein sequences and sortthem_accordingto thedifference betweenexperimental andpredicted

pi value. Alldata filesusedinthisresearchhave beensavedintoa compressedfolderandcanbe

obtainedathttp://www.rit.edu/~cdp3511/thesis/

(13)

Training

&

_Testing

Data

Afterthisorganization wascomplete,eachdataset was runthrough the algorithminthe

following

manner.

_First,

afolderwasmadetocontain

_"training

data,"which containedfour

proteins chosenfromthedataset. Each oftheseproteins wasdeemedacceptable

_(only

one

Swiss 2DPAGEsubmission per_protein)and was _{automatically}readfromthe

_{directory by}

the

GA,

which requires

_directory

name asinput. Thealgorithm wasthensettorun_startingwith a

randomlygeneratedpopulation. After 80 generations on_{average, the}ideal set of_pKavalues that

resultedinthebestoverall pi predictionforthatdataset wasdisplayed.

Next,

a similar run was carried out onfournew_proteins,knownas

"testing"

data. This

time,

the initialpopulation was seeded with a chromosome_representingthefittestvaluesfrom

thepreviousrun.

_{Theoretically,}

ifthe pKavalues found intheprevious runleadtoan_accuracy

increase forthe

_training

_data,

_they

couldbe expectedtomake a positiveimpacton_{accuracy for}

similar proteins fromthesameorganism.

_Again,

theresults were_collected,comparedto the

original, andthe4proteins_previouslyknownas

_testing

datawereaddedto the

_training

data.

Alleight proteins werethenrun at_{once, this time}

_being

seededwithtwo_chromosomes,one

fromtheoriginal

_training

run and onefromthefirst

_testing

run. Theresultsfromthiswerethen

usedtoseedfournewproteinsthatbecomethe

_testing

set. Thisprocess continueduntil all

acceptable proteins fromthedataset were a part ofthe

_{training data,}

_givingthebestoverall_pIQ

valuesforthatset.

The GeneticAlgorithm

Thegenetic _algorithm,writtenintheJava_programming

_language,

wasthe

_driving

force

behindthisproject. As_previously

_indicated,

the GAisset_upto

_loosely

simulate evolution and follows Charles Darwin's

_theory

of"survivalofthefittest". As_{mentioned, the}original

(14)

predictionalgorithm steps_{through the sequence,}

_looking

atone aminoacid at atime. Inthe

following

sections, theideas and codebehindthealgorithm areexplained.

TheChromosome

Thefirst_{step in any GA}isto

_develop

aninitialpopulationof what are called

chromosomes[4]. Achromosomeis an object_representingtheparameters usedto optimizethe

problem athand. Forthepurposes ofthis_experiment, a chromosome couldbe definedas an

arrayof

_binary

integervaluesthatrepresent_pKa_values,one foreachdipeptideofinterest. For

example,ifwe wantedtorepresent an arginine whenitoccurs nexttoanother_arginine,or an

argininenexttoan aspartic acid(as

_they

might occurintheprotein_{sequence), the}_arraymight

holdvaluessuch as "001

1"

or"0110". Whenconvertedto

_integers,

these

_binary

strings equal

"3"

and

_"6",

which wouldbecometherespective_pKavalues associated with"AA"and"AD"in

thatchromosome.

Eachchromosomethenholdstheentire set ofpKavalues usedtooptimizethepi prediction

algorithm. The initialpopulationis obtained

_by

_usinga random number_generator,_providinga

numberbetween 0 and 14inclusive torepresenttorepresent each_pIQvalue.

Fitness

Aftertheinitialgenerationisin_place,each chromosomeistestedforwhatis knownasits

"fitness."

Asnoted, each chromosomeholdstheparametersthatareutilized intheprediction

algorithm, outputtinga predicted pi value. In_{this experiment,}achromosome'sfitnesscanbe

definedastheaveragedifferencebetweentheexperimental andpredictedpi valueforeach

protein

_being

tested.

_Therefore,

ifwehave 100chromosomesand are

_testing

on a setof10

proteins, thatmeansthatforeach_{generation, the}fitnessvalueiscalculated 1,000times.

_Testing

ontheentiredataset means 100chromosomestestedon _{170 proteins,}for 17,000calculationsper

(15)

generation.

Foundinthefitness

function,

thepi prediction algorithmis simplytheoriginalalgorithm,

modifiedtolookattwoamino acids at atime. Forexample, theoriginal algorithm would see a

"K"

inthesequenceand assignitapKavalueof10.

_Instead,

themodified algorithm seestheK

andthencheckstheamino acid

_immediately

following. Ifit isan amino acid with a charged side

chain, like arginineforexample, thefunctionlooks atthecurrent chromosome and extractsthe

correspondingpKavaluefor

_K-R,

and assignsitto K. After

_doing

so, thealgorithm steps ahead

one spot and seesthe

_R,

andthenrepeatstheprocess. Theoverallfitnessthendepends onhow

welltheparametersfound inthechromosome_work,orhowclosethe_resultingpi prediction ends

up

being

to the experimental pi.

Afterall chromosomesinthegenerationhave beenassigned afitnessvalue,

they

are sorted.

The

_top

5% fittestchromosomes are called

"survivors,"

and are_{automatically}placedinthenext

generation.

_Remaining

chromosomes are choseninpairstorepresent_parents, and

_they

are

matedtoproducetwonew offspring.

Tournament Selection

Themethod

_by

whichchromosomes are chosen_{for mating is known}as a"tournament"

selection.

_Many

variations oftournamentselection_exist,withthechosen method_mostly

_being

personal preference. In_{this case, the tournament}selection starts out

_by

_{selecting 4}chromosomes

atrandom, excludingthe surviving 5%. Fromthefourselected_chromosomes,thetwowiththe bestfitnessvalues are mated

_by

crossover. For_example,considerthe

_following

parent

(16)

ParentA= _{1010 1100 0011 0101}

ParentB=1111 0000 1 100 001 1

Now,

considerthepossiblechildren_resulting froma cross ofParentAand ParentB:

ChildA=

1010 1000 1 100 001 1

ChildB= _{1111 0100 0011 0101}

Noticetheeffectsthat thiscrossoverhadonthesecond_pKavalues forthesechromosomes.

Initially,

thesecondpKavaluelistedin ParentAhada value of"1100"or

_12,

whileParentB was

"0000"

or0.

_Following

_{the crossover,} ChildAhas"1000"or8while ChildBhas

_"0100",

or4.

By

implementing

this typeof_crossover,as well as

_introducing

random mutation ofindividual

bits,

numerousvariations canbe_quicklyintroducedinto_{the population,}_simulating evolution

(seethesections on crossover and mutationformoreinformation). Thetournamentselection

repeats,again_{selecting four}chromosomes at random and_matingthefittest

_two,

untilthenew

generation containsthedesiredamount of chromosomes(defaultsetto 100forthisexperiment).

Crossover

To

_keep

the_matingprocess_unbiased, crossover and mutation wereboth implemented

randomly. Asmentioned, a crossoverrequirestwoparentchromosomes,and results inthe

creationoftwooffspring.

_First,

a crossover pointis determined_usinga random number

generator. BecauseaChromosomeobjectis _actuallyan_arrayof

_binary

_{strings, this}

determinationmust_{actually be done in}twosteps:

1.

Randomly

selectanindex intheChromosome arraytosetthecrossover pointin. This

shouldbea numberfrom 0-24inclusiveandpointstoonefour bit_pKavalue.

2. Withinthe_stringselected atthat

_index,

choose a pointtocrossover. _{Each string} hasa

(17)

Afterthecrossover pointis _{selected, the}crossoveriscarriedout as_previously

_{demonstrated,}

with secondhalfofone chromosome addedto thefirsthalfof_{the other,} and vice versa.

Mutation

Mutationsare_simplyanother_wayto introducevariationintothepopulationand occur

roughly 5%ofthe time. Although different fromthe crossover,

they

workina similar manner.

Afterthetworandom selections are_{made, the}selectionis_simplyflipped from0to 1 or 1 to

zero.

Forinstance:

ChromosomeAbeforemutation= _{1001 1011 0011 0101 0111}

IfChromosomeA wastobe selectedformutation andthe second positionin_{the array, third}

positioninthat_stringwere_selected,themutation would end_upasfollows:

ChromosomeAafter mutation= _{1001 1001 0011 0101 0111}

The resultingChromosomehasgonefrom

_having

a_pKavalue of 1 1 inthesecond positionto

one

_having

a_pKaof

_9,

which couldhaveasignificantimpactonthe overall pi prediction.

Mutationoffitchromosomes couldhaveadetrimentaleffect on overallpopulationfitness. To

avoidthisproblem mutation rates arekept

_low,

nohigherthan 5%.

Inadditionto _{automatically}

_being

placedintothenext_population,thefittestchromosomes

are saved after eachgeneration. Ifafter apre-determinednumber ofgenerations(always

between50and 150inthis_study), thefittestchromosomehasnot_changed,thatfitnessis

determinedtobethebestpossibleunderthose_conditions, andthe _{corresponding}_pKavalues are

returned.

(18)

JavaClasses

Containing

roughly 800lines of code(comments

_included),

theprogram consisted ofthree

classes, theGeneration class,theChromosomeclass, andtheEvolveclass. See Table 1 foran

explanation of each ofthe threeclasses andtheimportantfunctionswithinthoseclasses.

Class Name Explanation

Chromosome.class Used for representingpKavaluesforthedipeptidesinquestion, a

Chromosomeobjectisan _arrayof

_binary

strings usedtorepresentintegers.

Thisclassisusedtoperform operations suchas:

Randomcreation of new chromosomes

Mating(crossover)

and_offspringcreation

Fitness determination

Information gathering from Chromosomesthemselves

Generation.class The Generationclassisasort of containerforthechromosomesineach

population.

_Availability

ofaGenerationobjectbecomes especiallyuseful

when_passingthe_survivingchromosomes fromone generationto thenext.

Functionality

contained withinthisclassincludes:

Creationof

_initial,

randomgeneration

Creationof a new generationbasedon chromosomes fromthe

previous generation(aforementionedtournament_selection)

Utilities_{for accessing individual}chromosomes withinthe

generation

Sorting

by

fitness level Introductionof mutations

Utilities_{for reporting}results

Evolve.class The smallestclass ofthe_three,Evolveis _simplyusedtogetthealgorithm

runningandtodeterminewhentoendthe_evolvingprocess.

_Mostly

all

actual

_{functionality}

is borrowed fromtheother_classes, sothis class canbe

thoughtof as an organizer oftheentire process.

Table 2. Thelist_ofJavaclasses_comprisingthegenetic algorithm andtheirrole intheprocess.

Theprevious section gives an overallideaofhowthegenetic algorithm works. Forthe

(19)

Results

An Example Genetic Algorithm Run

Figure2 showstheprogress made

by

a genetic algorithm when run on a set offive

proteins. _{This is only}meantto

_display

themannerinwhichtheGAarrivesatitsconclusion,and

doesn't

_directly

correspondtothefinalresults.

An Example Run

of

the

Genetic

Algorithm

H Average Apl 0.25 0.2 h 0.15 Q. < _0.1 0.05 0 5 10 15

20

25

30

35

40 45 50 # of

Generations

Figure 2. Thisgraph shows progress made

_by

thegenetic algorithm on a set_of

five_randomlychosen proteinsfrom theApl>0.7 dataset

Thefiveproteins wereselected at randomfromtheApl>_{0.7 data}setforuseinthis

example. Typicalof mostGA_{runs, the} algorithmmakesquickimprovements _{early in}the run,

and startstoslow as timeprogresses. When consideringthelaterresults, therapid convergence

(20)

seenhere ismost

_likely

anindicationthat the_underlyingtheoriesbehindtheGAneedtobe

strengthened.

Inthis example, thealgorithmwas allowedtorunfor 50generations without_any

improvementonthe

_top

fitnessvalue. Great improvementscanbenotedfortheproteinsinthis

example, astheAplvalues wentfrom

_being

over0.7on averageto

_having

an averageAplof

0.03.

_{Unfortunately,}

resultsliketheseareuncommonwhen_using alargernumberof protein

sequences.

Suggested_pKaValues

Thegenetic algorithm was run onfour differentproteindata setsbefore

_being

run onthe

completeEscherichiacolidata. Eachofthefoursets correspondedtoadifferent levelof

discrepancy

betweenexperimentaland predicted pi values

_(Apl),

andtheresults oftheseruns are

shownbelow in Table 3.

Dipeptide Pair I)ata Set UsedlinGA

Apl<0.1 0.1<Apl<0.3 0.3 <Apl<0.7 Apl>_0.7 C()mplete

HH 6 12 13 7 1 HK 5 5 3 3 3 HR 1 8 7 10 13 HE 1 11 3 5 3 HD 13 9 7 12 12 HC 9 9 1 11 13 HY 10 8 5 11 12 KH 11 3 6 13 1 KK 5 14 11 1 9 KR 11 12 14 13 12 KE 7 13 1 3 13 KD 14 5 1 10 5 KC 11 14 11 14 14 KY 11 14 5 13 14 RH 12 7 1 1 7 RK 8 9 14 11 11

(21)

RR 7 11 8 10 5 RE 12 9 9 13 14 RD 9 14 10 10 14 RC 1 1 7 12 9 RY 13 11 5 10 12 DH 5 12 1 13 5 DK 11 1 9 3 1 DR 3 5 5 11 1 DE 3 8 5 3 7 DD 5 1 3 5 5 DC 7 14 3 13 3 DY 5 3 13 3 12 EH 11 1 1 1 1 EK 3 1 2 3 1 ER 5 5 5 3 5 EE 3 3 5 5 7 ED 5 3 6 11 5 EC 13 3 1 1 5 EY 1 10 11 1 1 CH 6 3 3 1 1 CK 11 7 5 3 1 CR 5 9 9 3 3 CE 7 8 1 8 7 CD 3 13 1 10 12 CC 5 1 5 7 10 CY 9 1 3 13 7 YH 13 1 12 13 13 YK 8 13 14 11 12 YR 11 3 1 13 1 YE 12 14 8 1 1 YD 5 9 13 10 10 YC 14 1 13 10 9 YY 12 14 1 13 11 Table

3.pKa

values suggested

_by

GAforincorporationintothepi predictionalgorithm. Each

column showsthevaluessuggested when _usingthedatasetindicated.

Eachcolumn representsthepKavaluessuggested

_by

the geneticalgorithm when_running

on adifferentset ofdata. For

_instance,

thefirstcolumn ofdatarepresentsthefittestchromosome

fromthe GAruns_usingproteins assignedto theApl<_{0. 1 data}set. Whenusedinthepi

(22)

prediction_algorithm, thesedipeptidepKavaluesresultedinthehighest average_{accuracy level}

forthatgroup.

Atfirst_glance,thereare certain aspects ofTable 3 thatstandout as problem areas. Most

notableisthe

_{inconsistency}

when_comparingone columnto thenext. Anumber oftimesa value

suggestedforuse fromonedatasetis_verydistant fromthatfromanotherdataset. Forexample, theGA suggested apKavalue of6 for histidinewhenitoccurs nextto anotherhistidine inthe Apl<_0.1 _dataset.

_Moving

acrossto the0.1 <Apl<_0.3_{column, the} value suggestedforthe

samedipeptidepairismuch

_higher,

at 12.

In addition, some ofthevalues suggested

_by

the algorithm_{don't entirely}make sense.

Aspartic Acidhasadefaultvalue of

_4.05,

buthas suggested_pKavalues upwards of13 fromthe

genetic algorithm. Ashift ofthismagnitude seemsimprobableandisevidencethat thefitness

functionassociated withthis_{GA may}need alteration.

_Alone,

thisinformation has littleto_say

abouthoweachsuggesteddipeptidepKahas affectedthe_accuracyof pi prediction. Inthe

following

series ofgraphs, thesuggestedpKavalues from Table 3are puttothe testwhenthe

newAplvalues are comparedto thoseoftheoriginal pi prediction.

_Again,

thedifference

betweentheoriginal and new algorithmsisthe incorporationofdipeptidepKavaluesthatwere

expected tohavea positive effect ontheoverall prediction accuracy. Forcompleteexcel

(23)

Effectson piPrediction

UsingApK 0.1 Data Set

Effects

of

Modified

Algorithm on Apl <

0.1 Data

Set

-Apl Using Original Algorithm Apl Using Modified Algorithm

0.40 0.35

0.25 0.

0.05

&

<&

^

& ^

4>

<$

&

(#

4*

&

#V

^

< < 9 <" < <y <*' <r <?v <? <r <P < <$r <r* <? < <5r <2V <3r <r Protein Accession #

Figure3. A comparison_ofApl beforeand afterthe incorporation_ofdipeptidepKa valuesinto

theprediction algorithmfortheApl< 0.1 data.

_Using

pKavalues suggested

_by

theGAforthe

Apl<0.1 dataset, the pink,jaggedlineshowsAplvalues when_usingthemodified algorithm.

Theblue linecorrespondstoAplvaluesforthesame protein set when _usingthe original,

unmodified algorithmforprediction.

Furtherevidenceis found in Figure

_3,

where we see a clearindicationthatnot all proteins

were_positivelyaffected

_by

thenew prediction method. The

_blue,

_gradually

_increasing

line

representstheAplbeforeaddition ofdipeptide_pKavalues andthe

_jagged,

pinklineshowsthe

new

_discrepancy

levels. Whilesomeimprovements canbe seen(wherethepinkline dips below

(24)

the

_blue),

the_majorityoftheresults showa negative impacton_prediction, _{especially in}those

proteinsthat_previouslyshowed a

_fairly

high levelof accuracy. Toexplainthe

_increasing

nature oftheblue line inFigure 3 andthefiguresto follow is_verysimple. Priorto_creatingthese graphs, theproteinswere sorted

_by

theoriginalApl_values,which were calculated_usingthe

original pi predictionmethod.

Using

0.1<ApI< 0.3 Data Set

Effects

of

Modified Algorithm

on

0.1

<

Apl

<

0.3 Data Set

-Apl _Using Original Algorithm Apl _Using Modified Algorithm 0.7 0.6 0.5 m 0.4 o. 0.3 0.2 0.1 r A. / ^-A -" wrr v

V

1 i i \ i i r

J

_^

<$

^

J> jf>

tf

$>

^

$>

tf

J?

J>

_{J? JP}

&

^

</ </

/

</ </4?

/

^ ^

/

/>*

^ ^

</

^

</ Protein Accession #

Figure 4. Acomparison_ofpibeforeand afterthe_{incorporation of dipeptide}pKavaluesintothe

prediction algorithmforthe0.1<pl<0.3 data.

_Using

_pKavalues suggested

_by

theGAforthe

0.1<ApI<0.3dataset, the pink,jaggedlineshowsAplvalues when_usingthemodified

algorithm. The blue linecorrespondstoAplvaluesforthesame protein set when_usingthe

(25)

Similarresults are found inFigures

_4,

5 and

_6,

_showingbothpositive and negative

impactson prediction accuracy.

_However,

it becomes clearthat thedatasets with greater

discrepancy

levels_generallyyieldagreater overallimprovementon prediction accuracy.

Considerthecomparison ofFigures 3 and6. On onlythreeoccasionsdidthenew prediction

accuracy decrease fortheApl>_{0.7 data}set(Figure

6),

whereasthenegativeimpacts seemto

outweighthepositivefortheApl<0.1 data. Thethemecanalsobeseenin comparing Figures 3

and

_5,

wherethereisan_accuracy on allbut fourproteinsinthe0.3<Apl <_{0.7 data} set(Figure

5).

Using

0.3<ApI< 0.7 Data Set

Effects

of

Modified Algorithm

on

0.3

<

Apl

<

0.7 Data Set

-Apl UsingOriginal Algorithm -Apl _Using Modified Algorithm

1.2 0.8 0.6 0.4 0.2

1 (\

r\ / . f, '

\

/v

r

A

/

~VT\

_\

_/

_\

A

_A

J

/N

_A

/

/ / j i r 1 1 1 r v r i i ~i 1 1 r-i 1 r -^

i/

V

X

x/

s>

^

J?

_<f

_^

,<& <&

4>

_#

<? <& <o

^

<f

&

<$

/

_^

/> #*

</

/>

^

/>

^ / / / /

/> />

^

Protein Accession #

Figure 5. Acomparison_ofpibeforeand aftertheincorporation _ofdipeptidepKa valuesintothe

prediction algorithmforthe0.3<pl<0.7data.

_Using

pKa values suggested

_by

the GAforthe

0.3<ApI<0.7dataset, the pink,jagged lineshowsAplvalues when_usingthemodified algorithm.

Theblue linecorrespondstoAplvaluesforthesame protein set when_usingthe_original,

unmodified algorithmforprediction.

(26)

Using

Apl>0.7 Data Set

Effects of Modified Algorithm on Apl > 0.7 Data Set

-Apl_UsingOriginal Algorithm -Apl _UsingModifiedAlgorithm

2.5 31-5 0.5 _r^XX_

^\tv

\

/

-.

_ rl ? o,*1 <> -^ \b ^ * i? ^ o^ f? ^ i1 > i<% ^ 41 *-> AA -v> > S? <?

^

oSP fV

^ ^

-^ _(^ /\< q\?

^ ^

^

/\V y<V CS^ ,^ (^ ProteinAccession # />

^

^VV"<r>v<*V<r^ 4?<f4?4^4^ 4?

^'4?

^>>^4*

Figure6. Acomparison_{of Apl before} and afterthe_{incorporation of dipeptide}pKa valuesinto

theprediction algorithmfortheApl>0.7data.

_Using

_by

theGAforthe

Apl>0.7dataset, the pink,jaggedlineshowsAplvalues when_usingthemodified algorithm.

The blue linecorrespondstoAplvaluesforthesame protein set when_usingthe original,

(27)

Using

Entire E. coliDataSet

Proceeding

theGArunson each ofthefourpartialdata_{sets, the}bestchromosomefrom

each run was usedtoseed onelastrunonthe entireE. coli

_data,

andthose suggestedpKavalues

were pluggedintothealgorithm. Theresults ofthisrun are showninFigure 7.

Again,

itis

evidentthatmostimprovements cameforthoseproteins withhigh Aplvalues,whilethe

modifiedalgorithm faltered forthemore accurate proteins.

Effects

of

Modified

Algorithm on

Complete Data Set

Apl_Using Original Algorithm Apl _Using ModifiedAlgorithm

3.00 2.50 2.00 a _1.50 1.00 0.50 0.00

FT

; i.X ' [ T' 1 1 11 1 1 1 11 1m 1 1 1 1 1 1 1 1ifII 1 1 M 1 1 1 1 M 1 1 1'l u 1 1 1 1 N l

E.coli Protein Data Set

Figure 7. Acomparison_ofApl beforeand afterthe_{incorporation of}dipeptidepKa valuesinto

theprediction algorithmforthecompleteE. colidata.

_Using

_by

theGA

fortheentiredataset, the pink,jaggedlineshowsAplvalues when_usingthemodified algorithm.

Theblue linecorrespondstoAplvaluesforthesame protein set when _using_{the original,}

unmodifiedalgorithmforprediction.

(28)

Furthermore,

theentiredataset was usedtotest themodified algorithm when

_{incorporating}

pKavaluesfrom theApl<0. 1, 0. 1 <Apl <

0.3,

0.3<Apl A 0.7 andApl>0.7 datasets andthe

correspondingresults canbe found infigures

_{8, 9,}

10,

and 11.

_Again,

theinaccuraciestend to

overshadowthepositive effectshadon piprediction.

EffectsonEntire E.coliData Set

_Using

Values Predicted inApl< 0.1 Data

Effects of Modified Algorithm on Complete Data Set

_Using

Values Suggested from ApKO.l Data

-Apl _UsingOriginal Algorithm -Apl _Using Modified Algorithm

3.0 2.5 2.0 M

*

1-5 1.0 0.5 0.0

Proteins in E.coliData Set

f i

k

I I

i

1

, 1 1! A ft ,

Ml

I

₁

i

iinrd-

M

/ Mil/ , , i, ft

d

a

1

'vimVl i --, ir^m ii >i i. r i

Figure8. Acomparison_{of Apl before}and afterthe _{incorporation of dipeptide}pKa valuesinto

_Using

_by

the GA

fortheApl<0.1 dataset, the pink,jaggedlineshowsAplvalueswhen_usingthemodified

algorithm. Theblue line correspondstoAplvaluesforthesame proteinset when_usingthe

(29)

EffectsonEntireE. coliDataSet Using Values Predicted in O.K Apl<0.3 Data

Effects

of Modified Algorithm on

Complete Data Set

Using

Values Suggested from 0.1

< Apl <

0.3 Data

-Apl_Using Original Algorithm Apl Using Modified Algorithm

3.00 2.50 2.00 S-_1-50 1.00 0.50 0.00

wmUaM

II1 1 1 1 1 11II l'1'l11 11 1

A

rrfc

W-A

i ii ii i ii ii i ii minn iriiriiiiiflrrriTiTHTivirn^rfrn'ifirHniTiTiTiTHrniTiTrrTTii umuiiiiu nuMiininiiiiriiniMiiiMiiinnrr

1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163

Proteins in E.coli Data Set

Figure 9. Acomparison _{of Apl before}and afterthe_{incorporation of dipeptide}pKa values into

thepredictionalgorithmforthecompleteE. colidata.

_Using

_by

theGA

forthe0.1<ApI<0.3dataset, the pink,jaggedlineshowsAplvalues when_usingthemodified

algorithm. The bluelinecorrespondstoAplvaluesforthe same protein set when_usingthe

original, unmodifiedalgorithmforprediction.

(30)

EffectsonEntire E. coliData Set

_Using

Values Predicted in 0.3<_Apl<_{0.7 Data}

Effects

of

Modified Algorithm

on

Complete Data Set

Using

Values Suggested

from 0.3<ApI<0.7 Data

Apl _Using Original Algorithm Apl _Using Modified Algorithm

3.0 2.5 2.0 - _{1 5} 1.0 0.5

0.0 rT-rr't'im-^^'i!rTr triii'viti'itittitiii'i i imi

_S ILL

\ .A/IS

iTTl

n

Ink

/

li ml l ii i il i ii i ii in i liili li ii i ill li i ii i in il i ii l li i li 1 1 ll i li ill i li i il l ii l li i iiiill i ii i ii i ill il ill l li l

Proteins in E. coli Data Set

Figure 10. A comparison_{of Apl before}and aftertheincorporation_ofdipeptidepKa valuesinto

_Using

_by

theGA

forthe0.3<ApI<0.7 dataset, thepink,jaggedlineshowsAplvalues when_using themodified

algorithm. The blue line correspondstoAplvaluesforthe same protein set when_usingthe

(31)

Effects onEntire E. coliDataSet

_Using

Values Predictedin Apl >0.7 Data

Effects

of

Modified Algorithm

on

Entire Data Set

Using

Values Suggested

from

Apl>0.7 Data

Apl _Using Original Algorithm Apl _Using Modified Algorithm

3.00 2.50 2.00 1.50 1.00 0.50 0.00

u

fmMuMum

iiiniiiiii'iiiii ii -H

Proteins in E. coli Data Set

Figure 11. Acomparison_{ofApl before}and afterthe_{incorporation of dipeptide}pKa values into

_Using

_by

the GA

fortheApl>0.7 data_{set, the pink,}jaggedline showsAplvalues when_usingthemodified

algorithm. Theblue linecorresponds toAplvaluesforthesame protein set when_usingthe

original, unmodified algorithmforprediction.

(32)

Asanalternativemethodfor

_displaying

_{these results,}averageAplvalues foreachdataset

are shown in Table 4. Althoughthe overall_accuracyappears tohave decreased_slightly,from

0.31 to0.33 on_{average, the}averageAplvalue was decreased

_by

about0% inthe0.3 <Apl<_0.7

data_set,and

_by

_roughly30% forproteins intheApl>_0.7 set. Whiletheproblem of prediction

accuracy clearlystill_{remains, these}results_{may be}a_stepintherightdirection.

Average Apl Values BeforeandAfter

Table 4 showstheaverageAplvaluesforeachdatasetbeforeand afterincorporationof

dipeptidepKa values intheprediction algorithm.

DataSet Original Avg.Apl Modified Avg. Apl ApKO.l 0.0455 0.0970 0.1<ApI<0.3 0.18 0.17 0.3<Apl<0.7 0.5060 0.2782 Apl> _0.7 _1.1148 _0.8403

Complete Set 0.3069 0.3340 Complete Set

_Using

ApKO.l Values 0.3069 0.3583 Complete Set

_Using

0.1<ApI<0.3 Values 0.3069 0.3793 Complete Set

_Using

0.3<ApI<0.7 Values 0.3069 0.3627 Complete Set

_Using

Apl> _{0.7 Values} _0.3069 0.3810 Table 4. Acomparison_ofaverageAplvaluesforeachdataset

(33)

Discussion

Overall,

itappearsthatour

_learning

algorithm wasn't_completely effectivein

_improving

onisoelectricpoint predictioninproteins. Whileonecan _onlyspeculateasto_{exactly why}the

resultsappeared as

_{they did,}

oneideawasthat the

_training

dataset wasinsufficient fortheGAto

produce reasonable results.

Totest this_theory, afinalexperimentwas performedthatis knownas a

"leave-one-out"

approach. This approach addressesthe

_training

set problem

_{by including}

allbutone proteinin

the

_training

data. Forexample, inadatasetthatcontains 170proteins, thefirst GA

training

run

includedprotein#s 2

-170,

while protein#1 was setaside asthe

_testing

_{data. After collecting}

pKa valuesfromtheGArun onthe

_{training data,}

thosevalues wereincorporatedintothepi

predictionalgorithmto test theireffectson predictionforthe

_{testing data,}

or protein#1. This

information,

including

experimentalpi_value,predictedpi_value,and predicted pi valuefromthe

modified _algorithm,wasthenrecordedintoatable.

After recordingthe

_data,

protein#1 wasputbackintotheset of proteins and protein#2

was removed and set aside asthe

_testing

data. AgaintheGAwas run and results were collected

and recordedas

_they

wereinthefirstrun.

_Next,

protein#2was re-introducedintothedataset

and protein#3 was removed andset_aside,andthisprocess wasrepeatedover and over. When

each ofthe 170proteins inthedatasethadat onetimebeen setasideas

_testing

_data,

the

experiment was complete. Thenext_stepwastocompare averageAplvaluesoftheoriginal and

modified pi prediction algorithms.

(34)

Using

theoriginalprediction _{algorithm, the}averageApl was0.31.

_Using

theleave-one-out

approachtooptimizethepi predictionshowedasignificantdecrease inaccuracy,ending up with

anaverageof0.47. Whilethiswasn'ttheresultthatwashoped

_for,

it isconsistentwith results

fromtheprevious experimentwhere we were unabletoimproveon overall prediction accuracies

forthecompletedataset.

Vastpossibilitiesexist for_expandingonthisworkinan attemptto _{significantly}improve

our pi prediction algorithm.

_First,

_cuttingdownthe listofdipeptides inthechromosome might

makethegenetic algorithm more efficientinitsresults. ThroughouttheGA_runs,itbecame

clearthatchromosomes_{containing notably}different_pKavalues could oftentimes resultin very

closefitnessvalues. Ifthosedipeptideswere_greatly

_impacting

theprediction_algorithm,we

would expecttosee consistent results.

_Instead,

theinconsistenciesmightindicatethat the

charges onthesidechainsoftheseadjacent amino acidsdonot affect one anothertoalarge

extent, inwhich case

_trying

to accountforthem_{may actually hurt}prediction accuracy. The studytoaccomplishthismightincludea comparison of sequence characteristicsbetweenthe positively and_negativelyaffected proteins.

_By

_narrowingthesearch space inthis manner, the

chances of

_having

a positive effect withoutthenegative repercussions shouldincrease.

Anotherproblem area inthis_studyand_{possibility for}furtherresearch mightinvolve

limiting

how farthesuggested_pKavaluesare allowedtodeviate fromthe_{default. As previously}

mentioned,someof_{the pKa}values were morethandoubledinthemodified predictionalgorithm.

To illustratethis problem, we might consider_{any randomly} chosen

_dipeptide,

likea

histidine-aspartic acidcombination,for instance.

Histidinehas adefaultpKaof

_5.98,

butwhenitoccurred nexttoasparticacidintheApl>

(35)

times theH-D dipeptideoccurredinthis

_data,

whichis thesmallest setofthefour. Atthestartof

theGAruns, all_pKavaluesare_randomly_generated, soiftherewere alowoccurrence ofH-D

combinations_{in any data}_{set, the}fitnessvalue wouldn'tbeaffected asmuchasit is

_by

_highly

occurring dipeptides. In

turn,

thismeansthatoutrageouspKavalues might notend_up

_being

replaced andcould survive inthefittestpopulation.

Thispoint couldbeusedtoexplaintheresults when_runningonthe completedataset.

Again,

thepKavalue for H-Dwas suggestedtobe 12. Forthis exampleitis importantto

_keep

in

mindthat theGArun_usingthecompletedataset was

_initially

seeded withthe

_top

chromosomes

fromeach ofthefourprevious runs. Thismeansthat the

_top

chromosome fromtheApl>_0.7run

was_used,

_{immediately introducing}

apKavalue of12intothepopulation. Evenintheinstance

thatthechromosome wasn'tinthe

_top

5% fitness levelanddidn't surviveto thenext_generation,

it's

_likely

that thevalue of12 for H-D stayedintactthrougha series of_matingand crossover

events.Ifat some pointthechromosome _containingthat_pKavalue wasinthe

_top

5%offitness

levels,

itwas_{automatically}movedto thenext_generation, _savingthatvalueforthehistidine

-aspartic acid pair.

Over

_time,

thefittestchromosomes end_up

_being

reproduced more_readily,whichinthis

example would meanthatthevalue of12 fortheH-Dpairdominatesthepopulation eventhough

itmight nothavea significant effect on predictionaccuracy.

_Eventually,

thisvalueis

incorporated intothe_algorithm, and could end_up

_having

anegativeimpactontheprediction.

Therefore,

by limiting

how farthepKavalues can

_deviate,

itwoulddecreasethenegativeimpact

incases such as_this, and might notovershadowthesuggested_pKavaluesthat_reallyare

_having

a

positive effect.

Athirdapproach mightbetoexpand onthechargeable groupsandintroduceuncharged

(36)

amino acidsintotheequation. PreviousresearchhasshownthatN-terminalasparaginehada

significantimpactonthepredicted pi value [5]. Althoughthemeans

_by

whichthisoccurs

remain_unclear, one_{possibility may be} that the

_hydrophobic,

uncharged amino acidsinterfere

with_charged,adjacentside chainswhenincontactwith water. Althoughpossibilities are

extensiveforthistypeof_research,one method of_attackingthisproblem mightbetoconsider

events where ahydrophobicamino acid restsbetweencharged side chains. Similartothe

research presented

here,

an_{evolutionary programming}approach couldbeusedinattempttofind

pKavaluesthat_remedytheproblem.

(37)

Conclusion

Thisthesisworkhasinvestigatedthe_possibility of

_improving

isoelectricpoint prediction

by

using evolutionary programmingto accountforcharge-chargeinteractionswithinthe

sequence. Whileanincrease in accuracywasseen on a small_scale,itwas not substantial enough

and was overshadowed

_by

decreases_{in accuracy in}other areas. Forthatreason wecannot_say

our workhasresultedinabetteralgorithm.

_However,

isoelectricpoint predictionis adifficult

problemthatstillhasmuch roomfor investigation. Whiletheresultsfailedtoyieldevidenceto

an overall_accuracy

_increase,

theinformationpresentedhereputs us one small_stepclosertoa

successful pi prediction and provides a genetic algorithmthat_mayprove usefulin futurestudies.

(38)

Bibliography

[1] Hamdan,

H. and

_Righetti,

_P.G.(2005)

"ProteomicsToday: Protein Assessmentand

Biomarkers

_Using

Mass

_{Spectrometry,}

2D

_{Electrophoresis,}

and

_Microarray

Technology".

Hoboken,

NJ,

John

_Wiley

and

_Sons,

Inc.

[2] Fichmann,

J. and

_Westermeier,

R.(1999)"2-D Protein Gel Electrophoresis: AnOverview." Methods in Molecular Biology: Vol.

_112(1-9)

[3] Conte,

M.

₍₂₀₀₅₎

"Isoelectric Point Prediction FromtheAmino Acid Sequenceof a

Protein"

submittedaspart of aMaster's Thesis ProjectatRITin 2004

[4] Mitchell,

M.

₍₁₉₉₈₎

"An IntroductiontoGeneticAlgorithms"

The MIT

_Press,

1999

[5] Cargile,

B.J., Talley, D.L.,

Stephenson,

J.L.

₍₂₀₀₄₎

"ImmobilizedpH gradients as afirst

dimension inshotgun proteomics and analysis ofthe_accuracyof pi_{predictability}of peptides".

Electrophoresis 25: 936-945

[6]

Hortsmann,

C.S.

_{(2001) Big}

Java:

_Programming

andPractice

_Wiley,

1stEdition

[7]

"SWISS-2DPAGE Two-dimensionalpolyacrylamidegel electrophoresis

database"

Foundat

http://us.expasy.org/ch2d/

[8] Bjellqvist, B., Hughes, G., Pasquali, C, Paquet, N., Ravier, F., Sanchez, J.-C,

et al.

₍₁₉₉₃₎

"The

_focusing

positions of polypeptidesin immobilizedpH gradients canbepredicted fromtheir

amino acid sequences".Electrophoresis 14:1023-1031.

[9]

Tonella

L.,

Hoogland

_C,

Binz

P.-A.,

Appel

_R.D.,

Hochstrasser

_D.F.,

Sanchez J.-C. "New

perspectivesintheEscherichiacoliproteomeinvestigation". Proteomics 1:409-423(2001).

[10]

"ComputepI/MxforSwiss-Prot/TrEMBL entriesor a user-enteredsequence".Foundat

http://us.expasy.org/tools/pi_tool.html

[1

_1]

Sillero

A., Ribeiro,

J.M.

₍₁₉₈₉₎

"Isoelectricpoints of proteins:theoreticaldetermination.

AnalyticalBiochemistry" 179: 319-325

[12]

Righetti, P.G., Caravaggio, T.(1976)

"Isoelectricpoints and molecular weights ofproteins.

Journalof

_{Chromatrography}

"

(39)

[13]

Cargile,

B.

_J.,

et al.

₍₂₀₀₄₎

"Gel Based Isoelectric

_Focusing

ofPeptides andthe

_Utility

of

IsoelectricPointinProteinIdentification."

Journalof proteome research3.1 (2004): 1 12-9.

[14]

"Getproteinlistforareference

map."

Foundat

http://www.expasy.org/cgi-bin/get-ch2d-table.pl

[15]

"NCBI Batch Entrezsearch". Foundat

http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Protein

Appendix A

-Escherichia

coli

Data Set

Protein Actualpi Predicted

_{|Actual-Pred|}

Color Codes: P0AE08 5.05 5.050048828 4.88E-05 ApKO.l

P05055 5.13 5.129943848 5.62E-05 0.1<ApK0.3 P45578 5.2 5.200439453 4.39E-04 0.3<Apl<_0.7 P0AEZ9 5.74 5.7421875 0.0021875 Apl>_0.7 P37689 5.15 5.152587891 0.002587891 P0ABB0 5.81 5.806274414 0.003725586 P0A6L0 5.52 5.514892578 0.005107422 P13029 5.16 5.16583252 0.00583252 P61714 5.19 5.183349609 0.006650391 P23869 5.51 5.502929688 0.007070312 P09030 5.8 5.807983398 0.007983398 P0AEZ3 5.28 5.26965332 0.01034668 P0A7A9 5.06 5.049194336 0.010805664 P0A6P1 5.22 5.234619141 0.014619141 P0AFU8 5.67 5.655029297 0.014970703 P0ABB4 4.95 4.932983398 0.017016602 P0A817 5.1 5.121826172 0.021826172 P0AB71 5.56 5.537109375 0.022890625 P36683 5.24 5.263671875 0.023671875 P0ACU7 5 4.973144531 0.026855469 P0A6E4 5.28 5.252563477

_0.027436523

P0A6F5 4.91 4.879150391

_0.030849609

P00509 5.53 5.561035156

0.031035156

P39172 5.58 5.611450195 0.031450195 PI 6703 5.47 5.437988281 0.032011719 33

(40)

P0AE67 4.95 4.915039063 0.034960938 P0ADU2 5.77 5.807128906 0.037128906 P0A9C3 4.9 4.860778809 0.039221191 P0A877 5.38 5.338867188 0.041132812 P0C054 5.63 5.588378906 0.041621094 P62707 5.82 5.861816406 0.041816406 P0A799 5.15 5.107299805 0.042700195 P0AAI9 5.03 4.98425293 0.04574707 P0A6M8 5.21 5.256408691 0.046408691 P26646 5.6 5.648193359 0.048193359 P07004 5.39 5.438842773 0.048842773 P0A6D3 5.34 5.389282227 0.049282227 P0A7Z4 5.04 4.988952637 0.051047363 P0A870 5.08 5.132080078 0.052080078 P63284 5.44 5.383728027 0.056271973 P0A796 5.43 5.487548828 0.057548828 P08312 5.74 5.799438477 0.059438477 POAGEO 5.41 5.472167969 0.062167969 P0A6G7 5.6 5.537109375 0.062890625 P0A6F9 5.23 5.166259766 0.063740234 P0A7F3 7.01 6.941894531 0.068105469 POAE18 5.71 5.638793945 0.071206055 P24216 4.96 4.888549805 0.071450195 P09832 5.48 5.551635742 0.071635742 P23721 5.47 5.391845703 0.078154297 POA850 4.93 4.849884033 0.080115967 P0AB55 5.29 5.208984375 0.081015625 P76149 5.38 5.46105957 0.08105957 P0AG67 4.99 4.908416748 0.081583252 PI6659 5.06 5.146606445 0.086606445 P0AA25 4.8 4.711669922 0.088330078 P0A9A9 5.78 5.688354492 0.091645508 P08142 5.22 5.314086914 0.094086914 PI8843 5.34 5.434570313 0.094570313 POAC55 5.78 5.875488281 0.095488281 P05194 5.31 5.213256836 0.096743164 P0A6Y8 4.96 4.863128662 0.096871338 P0A9M5 5.44 5.538818359 0.098818359 P0A6D7 5.18 5.280761719 0.100761719 POAEDO 5.2 5.097900391 0.102099609 P0A9D2 5.76 5.863525391 0.103525391 P06960 5.73 5.625976563 0.104023438

(41)

P0A8G6 5.51 5.615722656 0.105722656 P0AG78 6.49 6.596679688 0.106679687 P0AEQ3 7.32 7.435791016 0.115791016 P60595 5.24 5.359375 0.119375 P0ABU2 5.02 4.900085449 0.119914551 P0A7L0 8.24 8.115966797 0.124033203 P0ABD8 4.78 4.654846191 0.125153809 P0AF03 4.84 4.965454102 0.125454102 P04949 4.7 4.573669434 0.126330566 P68066 4.98 5.106445313 0.126445312 P0A6E6 5.34 5.46875 0.12875 P0A6A3 5.72 5.8515625 0.1315625 P75797 7.32 7.186279297 0.133720703 P29744 4.82 4.683044434 0.136955566 P09029 5.75 5.612304688 0.137695313 P61889 5.49 5.629394531 0.139394531 P00547 5.33 5.472167969 0.142167969 P0A7E1 7.28 7.13671875 0.14328125 P67910 4.98 4.835571289 0.144428711 P0AGD3 5.45 5.595214844 0.145214844 P0A9A6 4.83 4.680480957 0.149519043 P00946 5.16 5.31237793 0.15237793 P12758 5.66 5.82421875 0.16421875 P0A955 5.43 5.595214844 0.165214844 P0A8M0 5.01 5.195739746 0.185739746 P69783 4.95 4.762939453 0.187060547 P0AFC7 5.42 5.612304688 0.192304688 P0A9Q9 5.2 5.393554688 0.193554687 P25553 5.29 5.095336914 0.194663086 P0AEX9 5.23 5.435424805 0.205424805 P28635 4.95 5.156860352 0.206860352 P0A9G6 4.98 5.189758301 0.209758301 P0A6W5 4.95 4.73815918 0.21184082 P39177 6.25 6.037841797 0.212158203 P0A862 5.02 4.800537109 0.219462891 P0A715 6.1 6.323242188 0.223242188 P0AC69 4.96 4.727050781 0.232949219 P0A7N1 8.3 8.065551758 0.234448242 P0ABU5 4.92 4.685180664 0.234819336 P0A7K2 4.87 4.633056641 0.236943359 P0AES9 4.84 5.0859375 0.2459375 P0AD96 5.31 5.561889648 0.251889648 35

(42)

P38489 5.55 5.812255859 0.262255859 P0AEK4 5.33 5.595214844 0.265214844 P0A6N1 5.58 5.314086914 0.265913086 P0A855 7.05 6.78125 0.26875 P46850 5.65 5.928466797 0.278466797 P0A9C5 5 5.282470703 0.282470703 P35340 5.2 5.485839844 0.285839844 P0A9M2 5.38 5.080810547 0.299189453 P04036 5.11 5.46105957 0.35105957 P37902 7.87 7.516113281 0.353886719 P0A940 5.33 4.967163086 0.362836914 P0A763 5.19 5.557617188 0.367617187 P0A8X2 5.2 5.591796875 0.391796875 P63020 4.96 4.568115234 0.391884766 P76290 5.8 5.373046875 0.426953125 P04816 5.08 5.516601563 0.436601562 P0A8Q6 5.4 4.959472656 0.440527344 P0AG82 6.85 7.293945313 0.443945313 P0ABT2 5.27 5.718261719 0.448261719 P09551 5.17 5.622558594 0.452558594 P08200 4.7 5.17565918 0.47565918 P0A8P3 5.46 5.937011719 0.477011719 P23847 5.71 6.196777344 0.486777344 P37329 6.7 7.187988281 0.487988281 POAEUO 4.99 5.489257813 0.499257812 P0AEE5 5.19 5.697753906 0.507753906 P0AFK9 4.76 5.270507813 0.510507813 P0ADG7 5.49 6.017333984 0.527333984 PI6700 6.58 7.128173828 0.548173828 P69441 5 5.567871094 0.567871094 P0AFZ3 5.01 4.428833008 0.581166992 P23843 5.47 6.052368164 0.582368164 P69797 5.17 5.755859375 0.585859375 P0AGE9 5.73 6.321533203 0.591533203 P0A6P9 4.74 5.344848633 0.604848633 P18335 5.19 5.797729492 0.607729492 P0A858 5.01 5.649047852 0.639047852 P31663 5.25 5.926757813 0.676757813 POCOVO 8.01 7.329833984 0.680166016 P0A879 5.03 5.717407227 0.687407227 tUUBUBSBBBBKNUtt P0ADG4 5.71 6.453125 0.743125 P30859 5.07 5.813964844 0.743964844 36

(43)

P00894 8.11 7.352050781 0.757949219

P0AET2j

5 5.762695313 0.762695313 P0A910 : _5.23 _5.996826172 _0.766826172 P61316 , 5.52 6.306152344 0.786152344

P0ABK51

4.95 5.843017578 0.903017578 P0ADE8

_\

6.11 5.179931641 0.930068359 P0ADA3 8.84 7.902770996 0.937229004 P0AFL3 8.52 7.567382813 0.952617187 P76002 9.2 8.246704102 0.953295898 POAFGO 5.4 6.37109375 0.97109375 P0AD59 5.33 6.306152344 0.976152344 P0AEM9 5.19 6.230957031 1.040957031 P0A7R1 5.1 6.186523438 1.086523438 : P77348 8.55 7.314453125 1.235546875 P0A9B2 5.32 6.583007813 1.263007812 P33136 8.04 6.608642578 1.431357422 P00811 9.06 7.55456543 1.50543457 P0ADV7 10.3 7.978393555 2.321606445 P68919 10.6 8.2578125 2.3421875 37

An Evolutionary Computation Approach to Optimization of Isoelectric Point Prediction in Proteins

Rochester Institute of Technology

Rochester Institute of Technology

RIT Scholar Works

RIT Scholar Works

Theses

2006

An Evolutionary Computation Approach to Optimization of

An Evolutionary Computation Approach to Optimization of

Isoelectric Point Prediction in Proteins

Isoelectric Point Prediction in Proteins

Chris Parkin

Follow this and additional works at:

https://scholarworks.rit.edu/theses

Recommended Citation

Recommended Citation

Parkin, Chris, "An Evolutionary Computation Approach to Optimization of Isoelectric Point Prediction in

Proteins" (2006). Thesis. Rochester Institute of Technology. Accessed from

~~

..

~.rfOrmatlcs

RIT

To:

Head, Department of Biological Sciences

Rochester Institute of Technology

Department of Biological Sciences

Bioinformatics Program

The undersigned state that _ _ _

a

candidate for the Master of Science degree in Bioinformatics, has submitted his/her

thesis and has satisfactorily defended it.

This completes the requirements for the Master of Science degree in Bioinformatics at

Rochester Institute of Technology.

Thesis committee members:

Name

paul Craig

(Committee Chair)

Paul Craig

(Thesis Advisor)

Illegible Signature

Illegible Signature

Illegible Signature

R.

Date

Thesis/Dissertation Author Permission Statement

Title of thesis

or

.

disse~tion

:

12

!£~~~f//£r;r

f/Xft

f

o3°

1{t:s12ach

In

tJp

fun

Za

hf!YJ

a

5'0

f

-e

I

0Y1

0

J'n

k

Name of author:

Chnsdoph<:c

1?uhlO

Degree:

13,

do

em

-hie:>

1I!ll'JV1S

______________________________________________ __

____________________________________________

_{Christopher Parkin}

Signature of Author: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date: _____

_Evolutionary

_by

_Technology

_Evolutionary

_Technology

_Craig,

_biology

_2005,

_involving