• No results found

An Evolutionary Computation Approach to Optimization of Isoelectric Point Prediction in Proteins

N/A
N/A
Protected

Academic year: 2021

Share "An Evolutionary Computation Approach to Optimization of Isoelectric Point Prediction in Proteins"

Copied!
62
0
0

Loading.... (view fulltext now)

Full text

(1)

Rochester Institute of Technology

Rochester Institute of Technology

RIT Scholar Works

RIT Scholar Works

Theses

2006

An Evolutionary Computation Approach to Optimization of

An Evolutionary Computation Approach to Optimization of

Isoelectric Point Prediction in Proteins

Isoelectric Point Prediction in Proteins

Chris Parkin

Follow this and additional works at:

https://scholarworks.rit.edu/theses

Recommended Citation

Recommended Citation

Parkin, Chris, "An Evolutionary Computation Approach to Optimization of Isoelectric Point Prediction in

Proteins" (2006). Thesis. Rochester Institute of Technology. Accessed from

This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact

(2)

.~

~~

..

~.rfOrmatlcs

RIT

"

To:

Head, Department of Biological Sciences

Rochester Institute of Technology

Department of Biological Sciences

Bioinformatics Program

The undersigned state that _ _ _

~Co.!.hll.r.L=is!..!o.to~p.L.!h.l:=.e.!...r .!...P.!=!.aru.k;u.iu.n _ _ _ _ _ _ _ _ _ ,

a

candidate for the Master of Science degree in Bioinformatics, has submitted his/her

thesis and has satisfactorily defended it.

This completes the requirements for the Master of Science degree in Bioinformatics at

Rochester Institute of Technology.

Thesis committee members:

Name

paul Craig

(Committee Chair)

Paul Craig

(Thesis Advisor)

Illegible Signature

Illegible Signature

Illegible Signature

Gary

R.

Skuse, Ph.D. Director of Bioinformatics

Date

475-2532 (voice) [email protected]

(3)

Thesis/Dissertation Author Permission Statement

Title of thesis

or

.

disse~tion

:

12

!£~~~f//£r;r

f/Xft

f

o3°

D

1{t:s12ach

In

tJp

fun

I

Za

hf!YJ

a

5'0

f

-e

I

0Y1

0

J'n

k

Name of author:

Chnsdoph<:c

1?uhlO

Degree:

13,

C i

do

em

IX

-hie:>

1I!ll'JV1S

~o~: ~S~g~\D~

______________________________________________ __

College:

.s

c

i

e.n (

e

I understand that I must submit a print copy of my thesis or dissertation to the RIT Archives, per current

RIT guidelines for the completion of my degree. I hereby grant to the Rochester Institute of Technology

and its agents the non-exclusive license to archive and make accessible my thesis or dissertation in whole

or in part in

all

forms of media in perpetuity. I retain all other ownership rights to the copyright of the

thesis or dissertation. I also retain the right to use in future works (such as articles or books)

all

or part of

this thesis or dissertation.

Print Reproduction Permission Granted:

I,

Christopher Parkin

,

hereby

grant

permission to the Rochester Institute

Technology to reproduce my print thesis or dissertation in whole or in part. Any reproduction will not be

for commercial use or profit.

Signature of Author:

Christopher Parkin

Date:

Print Reproduction Permission Denied:

L

,

hereby deny permission to the RIT Library of the

Rochester Institute of Technology to reproduce my print thesis or dissertation in whole or in part.

Signature of Author: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date: _______ __

Inclusion in the RIT Digital Media Library Electronic Thesis

&

Dissertation (ETD) Archive

L

Christopher Parkin

,

additionally grant to the Rochester Institute of Technology

Digital Media Library

(RIT

DML) the non-exclusive license to archive and provide electronic access to

my thesis or dissertation in whole or in part in

all

forms of media in perpetuity.

I understand that my work, in addition to its bibliographic record and abstract, will be available to the

world-wide community of scholars and researchers through the RIT DML. I retain all other ownership

rights to the copyright of the thesis or dissertation. I also retain the right to use in future works (such as

articles or books) all or part of this thesis or dissertation. I am aware that the Rochester Institute of

Technology does not require registration of copyright for ETDs.

I hereby certify that,

if

appropriate, I have obtained and attached written permission statements from the

owners of each third party copyrighted matter to be included in my thesis or dissertation. I certify that the

version I submitted is the same as that approved by my committee.

(4)

An

Evolutionary

Computation Approach

to

Optimization

of

Isoelectric Point Prediction in Proteins

Submitted

by

Chris Parkin

DepartmentofBiologicalSciences

Inpartialfulfillmentoftherequirements

fortheMasterofSciencedegree in

Bioinformaticsat

RochesterInstitute of

Technology

(5)

Abstract

An

Evolutionary

ComputationApproachtoOptimization

ofIsoelectric Point Prediction in Proteins

by

Christopher Parkin

MasterofSciencein Bioinformatics

Rochester Instituteof

Technology

Professor Paul

Craig,

Chair

Computational

biology

hasattackedtheproblem ofisoelectricpoint prediction withlittle

success, achievinga roughaccuracy levelofonly 30%. In

2005,

MatthewConteperformeda

study focused ontherelationship betweensequence characteristics andisoelectricpoint

prediction accuracy. Results indicatedthatchargesbetweenadjacent amino acids couldhave a

significantimpactontheoverall predicted pi fortheprotein. Inthis studyweintroduce an

evolutionarycomputation approach aimed ataccounting fortheseproblemdipeptides. Foreach

possibledipeptide

involving

charged amino acids(7chargeable groups->49possibilities), the

algorithm predicts apKavaluethat,whenincluded inthe pi predictionalgorithm,should resultin

a moreaccurate prediction.

By

accounting forthesecharged, adjacent aminoacids,thepi

predictionshowedimprovements forthoseproteins withthegreatestdeviation between

experimentaland predicted pi value(Apl>0.7).

However,

theseresults were notgeneralized, as

theincorporationofthesevalueshadthereverse effect onremainingproteins,mostnotablythose

fromthemost accuratedataset(Apl<0.1). Whilethisresearchlaysafoundation for

improving

thepi prediction algorithm,additional exploration remainsnecessary foran overallaccuracy

(6)

Contents

1 Introduction 1

2 Methods 5

2.1

ExPASy

2DPAGE Database 5

2.2

Trimming

theDataSet 5

2.3

Training

&

Testing

7

2.4 The Genetic Algorithm 7 2.4. 1 The Chromosome 8 2.4.2 Fitness 8 2.4.3 Tournament Selection 9 2.4.4 Crossover 10 2.4.5 Mutation 10 2.4.6 JavaClasses 12 3 Results 13 3.1 An Example GA Run 13 3.2 SuggestedpKaValues 13

3.3 Effectson piPrediction 16 3.3.1

Using

Apl<0.1 Data 16 3.3.2

Using

0.1 <Apl<0.3 Data 17 3.3.3

Using

0.3 <Apl<0.7 Data 18 3.3.4

Using

Apl>0.7 Data 19 3.3.4

Using

Complete Data Set 20 3.3.5 Overall Effect

Using

Apl<0.1 21 3.3.6 Overall Effect

Using

0.1<ApI<0.3 22 3.3.7 Overall Effect

Using

0.3<ApI<0.7 23 3.3.8 Overall Effect

Using

Apl>0.7 24 3.3.9 Average Apl Values 25 4 Discussion 26 5 Conclusion 31

(7)

Introduction

Two-dimensionalgel electrophoresis

(2DE)

firstemergedin 1975whenDr. Patrick O'Farrell

displayedtheabilitytoseparate 1,100polypeptides from Escherichiacoli[1]. Withthe

theory

andtechnique

being

slightlyahead ofits time,itwas

initially

practiced

by

onlyahandfulof

scientists aroundtheworld. Sincethen, theemergence of new analyticaltools, combined with

numerous

large-scale,

publicinformation

databases,

has shed a whole newlightonthis once

dormanttechnique [2].

Today,

2DEremains a

leading

techniqueforseparation and

identificationof proteins.

Isoelectric

focusing (IEF)

isthemainfocusofthis studyand makesupthefirst dimensionof

2DE. IEFisa methodinwhich amphoteric molecules are separatedina polyacrylamide gel

accordingtotheirisoelectricpoint values[2]. Whenplacedina pH gradient,a protein will

migrateto theposition whereitsnet chargeisequaltozero. ThepHatthispositionisknownas

theisoelectricpoint

(pi)

value. Isoelectricpointis determined

by

charged groupsinthe protein,

andis oftenbetween 3 and

12,

withmost

falling

between 4and7[1 1,12].

TraditionaltechniquesusedtoformpHgradients involved mixingampholytesthathad been

chemicallyengineeredtoacertainpKavalue[13]. While thismethod workedefficiently,thepH

gradient wasextremely difficulttoreproduce. Since

then,

immobilizedpH gradients

(IPGys)

havebeen introduced. Inan

IPG,

theampholytes areboundinacrylamidegel,

forming

afixed

pH gradient andensuring reproducibility [8,13].

(8)

Theseconddimensionof2DE isa separation

by

molecular mass. Achargeis appliedtoa

bufferthat surroundsthe gel,attractingthemoleculestothe opposite end andcausingthem to

migrate. Thelargerofthesemoleculestravel the slowest and will remain nearthe

top

ofthegel,

whilethesmaller moleculeswilltravelfurtherandbeseentowardthebottomofthegel. After

staining, theend result of2DEisa grid of spots with each spotreferringto thelocationof a

protein molecule inthegel(Figure 1). The Xvalueinthegrid corresponds to thepi valueofthat

protein,whiletheYvalue correspondstothedistancemigratedinthegel.

Theapplication ofthistechniquehasprovento

bea powerfultoolandhasprovided researchers

witha great amount ofdata [5].

However,

the

difficulty

andtimerequirements associated with

performingand

interpreting

2DE correctly have led

totheemergence of computational approaches to

2DE [5]. Whilethebenefitsassociated with

simulations are often quiteattractive, the

limitationsplaced uponthepi prediction portion of

the2DEsimulationhaveprovedtobetheAchilles

=

.

-r

Figure 1. Sampleoutputfrom

2-Dimensional Electrophoresis. Obtained from Swiss 2DPAGE

database,

protein

ID#P16700[7,10]

heel oftheentire simulation.

The isoelectricpoint prediction algorithmtobeoptimizedinthisstudy ispart of a2DE

simulatorthatwas originally developedatthe RochesterInstituteof

Technology

as part of an

honor'sthesisproject[3]. Thisalgorithmwasimplemented tocalculatecharge,basedonside

chains ofaminoacidsfoundinthesequence. Thecharge on each side chainis afunctionofthe

(9)

Amino Acid DefaultpKaValue R- Arginine 12 D- Aspartic Acid 4.05 E- GlutamicAcid 4.45 H- Histidine 5.98 K- Lysine 10 C -Cysteine 9 Y- Tyrosine 10

charge ontheaminoacid side chains isshownin Table 1

Using

thevaluesfrom Table 1 and

startingwith a pH of

7,

the algorithm

looksat eachindividual amino acidin

the sequenceand computesitscharge.

Eachindividualamino acid charge is

thenaddedtoarunningtotalofcharge,

resulting intotal charge fortheprotein.

Ifthe totalchargeisgreaterthan0.005

orlessthan-0.005, thepH value usedin

, , . ,. , , , , Table1. DefaultpKavaluesusedinoriginalpi

thecalculation isadmsted andthe charge ,.

, .f,

J

prediction algorithm

calculationisrepeated. Thiscycle continues untilthe totalcharge isreportedtobebetween

-0.005 and

0.005,

practicallyzero.

Finally,

thepH valueresulting ina net charge of zero onthe

proteinisreturned, and consideredthepi value forthatprotein.Whilethe pKavalues are

heavily

relied oninthis calculation, variables such as post-translational modifications and charge-charge

interactions areleftunaccounted

for,

significantly affectingpredictionaccuracy[3].

In

2005,

Matthew Conteperformed sequential analyses on numerousproteinsfromtheE.

coliproteome,obtained fromthe Swiss 2DPAGE

database[3,

7]. In

doing

so,heuncovered a

correlationbetweentheoccurrence ofchargeddipeptidesinthesequence andthelevelof

discrepancy

betweenexperimental and predictedpi,knownin his studyas well asthisone asApl [3]. Hisresults showedthat thehigherthenumberofcharge-chargedipeptides inthe sequence,

thegreaterthedeviationbetweenactual and predictedpi valueforthatprotein[3].

Thepicalculationisbasedonthe pKavalues fortheamino acid sidechains. Basedonhis

(10)

results, ourhopewastoderivenewpKavaluesusing ageneticalgorithm. AsinConte'swork,

Escherichiacoli wastheproteome ofchoice. E. coliisthought tohave arelativelylownumber

ofpost-transcriptional

modifications suchasmethylation andphosphorylation,andit isone of

themostwidelystudiedbacteria inscience,making itanidealsubject[3].

Furthermore,

experimentalisoelectricpointdata fromonlyonegroupwasused,assuring consistency in lab

practicesanddatasubmission[3,9].

Nowa cornerstonein

biology,

evolution andtheunderlying

theory

of natural selection are

accreditedtoCharlesDarwinafterhisresearchinthemid

19th

century [4]. His

theory

of natural

selectionproposedthatindividuals bestadaptedto theirsurrounding environment are more

likely

tosurviveand mate. Over

time,

thosewiththeless-favorabletraitsdieout,while favorabletraits

are passedon,eventually

introducing

adaptationsintothepopulation.

Evolutionary

computation modelslikethegeneticalgorithm

(GA)

usedinthisstudy

loosely

follow Darwin'stheories. Inthis case,eachindividual inthepopulation isa set ofpKavalues

usedtoaccommodatethecharge-charge amino acid pairsthatnormally hurttheaccuracyof pi

prediction. Ineachgeneration, themost well adaptedindividualsarethose thatleadto themost

accurate piprediction,and areknownasthefittestofthepopulation.

According

toevolution, thefittest individualsarethosemost

likely

tosurvive andmate, so

thefittest fromeach generation automaticallysurvive intothenext. Overtime,simulated

processesofmutation,crossoverand recombination are appliedtoeachgeneration,resulting ina

population ofthebestpossibleindividuals. Further details regardingtheworkings oftheGA are

(11)

Methods

ExPASv 2DPAGE Database

The

ExPASy

server's Swiss2DPAGE database

(http://ca.expasy.org/ch2d/)

contains vast

2DEgelinformationfor

human,

mouse,Arabidopsis thaliana,Dictyostelium

discoideum,

Escherichiacoli,Saccharomycescerevisiae,andStaphylococcusaureus(N315))[7]. Foreach

proteininthe

database,

informationregardingexperimentalpivalue,molecularweight,

experimentalmethods, references, anda photo oftheactualgel runintheexperimentis

available[7].

Becausemanygroupshavecontributedto this

database,

it isnot uncommontofindmultiple

submissionsfor anyone protein. Forthatreasonandto

keep

experimental practices consistent,

onlythoseentries fromTonellawere usedinthisstudy [9]. Forease ofuse, theSwiss 2DPAGE

allows fortheinformationtobe downloaded intoatab delimitedtextfiletobe importedtoa

spreadsheet14]. Thefields availableinthisfile include genename,

description,

Swiss 2DPAGE

accessionnumber, spot

ID,

experimentalpi,experimental molecularweight,mappingmethods,

commenttopics and a referenceto thegroup carryingouttheexperiments.

Trimming

theDataSet

After obtainingtheinitial Tonella datasetcontaining roughly 340proteins,itwas not

uncommonto seeupto eight entries for anyone protein.

Again,

duplicationsare a result of

post-translationalmodificationsthatcause a change inpI/MW onthe protein,

leading

toa unique spot onthegel. Becausemost oftheseduplicatepi values were quite similar(oftenwithin.01 of one

(12)

another), anaveragepi value wastaken torepresenttheproteininthe dataset,andtheremaining

duplicates

removed. Inthe eventthat

drastically

differentpi values wererecorded, onlythefirst

entrywas saved,andthatprotein wasomitted from

training

thegeneticalgorithmlaterinthe

study.

170proteins remainedafter allduplicateswereremovedfromthedata set,which werethen

broken into fourgroupsbasedonthedifference betweenexperimental and predicted pivalue,

knownas jpKO.l, 0.1<_pl<

0.3, 0.3<_pl<0.7,

and (Appendix A). Thegreatest

concern forthisstudywereproteinsfoundinthe and0.3<_pl<0.7datasets,with

expectationsthat

improving

thosepredictions wouldgreatly improvetheoverallaccuracy level

forthealgorithm. Fora complete

listing

oftheproteins used aftertrimming, seeAppendixA.

Sequence

Gathering

All Swiss 2DPAGEproteinentries are cross-linked withthe Swiss-Prot

database,

making it

possibletoacquireFASTAformattedsequencethrough theNCBI Batch Entrezsearch[15]. To

usethistool,a simple listofthe 170protein accession numbers was uploadedto the

NCBI,

which returnedall 170proteins in FASTA format. To easilyassociatetheexperimental pi value

withtheproteinsequence,each experimental value wasmanuallyenteredintothesecondline of

therespectiveFASTAfileofthatprotein. Thisresultedinonelarge FASTA formatted file

containingall 170proteins, completewith accessionnumbers, pKavalue andsequence. Aperl

scriptthenparsedthisfile and saved each proteinsequenceseparately,

basing

thefilename on

theprotein's accession number.

Finally,

another short program was writtentoreadinall 170

protein sequences and sortthemaccordingto thedifference betweenexperimental andpredicted

pi value. Alldata filesusedinthisresearchhave beensavedintoa compressedfolderandcanbe

obtainedathttp://www.rit.edu/~cdp3511/thesis/

(13)

Training

&

Testing

Data

Afterthisorganization wascomplete,eachdataset was runthrough the algorithminthe

following

manner.

First,

afolderwasmadetocontain

"training

data,"which containedfour

proteins chosenfromthedataset. Each oftheseproteins wasdeemedacceptable

(only

one

Swiss 2DPAGEsubmission perprotein)and was automaticallyreadfromthe

directory by

the

GA,

which requires

directory

name asinput. Thealgorithm wasthensettorunstartingwith a

randomlygeneratedpopulation. After 80 generations onaverage, theideal set ofpKavalues that

resultedinthebestoverall pi predictionforthatdataset wasdisplayed.

Next,

a similar run was carried out onfournewproteins,knownas

"testing"

data. This

time,

the initialpopulation was seeded with a chromosomerepresentingthefittestvaluesfrom

thepreviousrun.

Theoretically,

ifthe pKavalues found intheprevious runleadtoanaccuracy

increase forthe

training

data,

they

couldbe expectedtomake a positiveimpactonaccuracy for

similar proteins fromthesameorganism.

Again,

theresults werecollected,comparedto the

original, andthe4proteinspreviouslyknownas

testing

datawereaddedto the

training

data.

Alleight proteins werethenrun atonce, this time

being

seededwithtwochromosomes,one

fromtheoriginal

training

run and onefromthefirst

testing

run. Theresultsfromthiswerethen

usedtoseedfournewproteinsthatbecomethe

testing

set. Thisprocess continueduntil all

acceptable proteins fromthedataset were a part ofthe

training data,

givingthebestoverallpIQ

valuesforthatset.

The GeneticAlgorithm

Thegenetic algorithm,writtenintheJavaprogramming

language,

wasthe

driving

force

behindthisproject. Aspreviously

indicated,

the GAissetupto

loosely

simulate evolution and follows Charles Darwin's

theory

of"survivalofthefittest". Asmentioned, theoriginal

(14)

predictionalgorithm stepsthrough the sequence,

looking

atone aminoacid at atime. Inthe

following

sections, theideas and codebehindthealgorithm areexplained.

TheChromosome

Thefirststep in any GAisto

develop

aninitialpopulationof what are called

chromosomes[4]. Achromosomeis an objectrepresentingtheparameters usedto optimizethe

problem athand. Forthepurposes ofthisexperiment, a chromosome couldbe definedas an

arrayof

binary

integervaluesthatrepresentpKavalues,one foreachdipeptideofinterest. For

example,ifwe wantedtorepresent an arginine whenitoccurs nexttoanotherarginine,or an

argininenexttoan aspartic acid(as

they

might occurintheproteinsequence), thearraymight

holdvaluessuch as "001

1"

or"0110". Whenconvertedto

integers,

these

binary

strings equal

"3"

and

"6",

which wouldbecometherespectivepKavalues associated with"AA"and"AD"in

thatchromosome.

Eachchromosomethenholdstheentire set ofpKavalues usedtooptimizethepi prediction

algorithm. The initialpopulationis obtained

by

usinga random numbergenerator,providinga

numberbetween 0 and 14inclusive torepresenttorepresent eachpIQvalue.

Fitness

Aftertheinitialgenerationisinplace,each chromosomeistestedforwhatis knownasits

"fitness."

Asnoted, each chromosomeholdstheparametersthatareutilized intheprediction

algorithm, outputtinga predicted pi value. Inthis experiment,achromosome'sfitnesscanbe

definedastheaveragedifferencebetweentheexperimental andpredictedpi valueforeach

protein

being

tested.

Therefore,

ifwehave 100chromosomesand are

testing

on a setof10

proteins, thatmeansthatforeachgeneration, thefitnessvalueiscalculated 1,000times.

Testing

ontheentiredataset means 100chromosomestestedon 170 proteins,for 17,000calculationsper

(15)

generation.

Foundinthefitness

function,

thepi prediction algorithmis simplytheoriginalalgorithm,

modifiedtolookattwoamino acids at atime. Forexample, theoriginal algorithm would see a

"K"

inthesequenceand assignitapKavalueof10.

Instead,

themodified algorithm seestheK

andthencheckstheamino acid

immediately

following. Ifit isan amino acid with a charged side

chain, like arginineforexample, thefunctionlooks atthecurrent chromosome and extractsthe

correspondingpKavaluefor

K-R,

and assignsitto K. After

doing

so, thealgorithm steps ahead

one spot and seesthe

R,

andthenrepeatstheprocess. Theoverallfitnessthendepends onhow

welltheparametersfound inthechromosomework,orhowclosetheresultingpi prediction ends

up

being

to the experimental pi.

Afterall chromosomesinthegenerationhave beenassigned afitnessvalue,

they

are sorted.

The

top

5% fittestchromosomes are called

"survivors,"

and areautomaticallyplacedinthenext

generation.

Remaining

chromosomes are choseninpairstorepresentparents, and

they

are

matedtoproducetwonew offspring.

Tournament Selection

Themethod

by

whichchromosomes are chosenfor mating is knownas a"tournament"

selection.

Many

variations oftournamentselectionexist,withthechosen methodmostly

being

personal preference. Inthis case, the tournamentselection starts out

by

selecting 4chromosomes

atrandom, excludingthe surviving 5%. Fromthefourselectedchromosomes,thetwowiththe bestfitnessvalues are mated

by

crossover. Forexample,considerthe

following

parent

(16)

ParentA= 1010 1100 0011 0101

ParentB=1111 0000 1 100 001 1

Now,

considerthepossiblechildrenresulting froma cross ofParentAand ParentB:

ChildA=

1010 1000 1 100 001 1

ChildB= 1111 0100 0011 0101

Noticetheeffectsthat thiscrossoverhadonthesecondpKavalues forthesechromosomes.

Initially,

thesecondpKavaluelistedin ParentAhada value of"1100"or

12,

whileParentB was

"0000"

or0.

Following

the crossover, ChildAhas"1000"or8while ChildBhas

"0100",

or4.

By

implementing

this typeofcrossover,as well as

introducing

random mutation ofindividual

bits,

numerousvariations canbequicklyintroducedintothe population,simulating evolution

(seethesections on crossover and mutationformoreinformation). Thetournamentselection

repeats,againselecting fourchromosomes at random andmatingthefittest

two,

untilthenew

generation containsthedesiredamount of chromosomes(defaultsetto 100forthisexperiment).

Crossover

To

keep

thematingprocessunbiased, crossover and mutation wereboth implemented

randomly. Asmentioned, a crossoverrequirestwoparentchromosomes,and results inthe

creationoftwooffspring.

First,

a crossover pointis determinedusinga random number

generator. BecauseaChromosomeobjectis actuallyanarrayof

binary

strings, this

determinationmustactually be done intwosteps:

1.

Randomly

selectanindex intheChromosome arraytosetthecrossover pointin. This

shouldbea numberfrom 0-24inclusiveandpointstoonefour bitpKavalue.

2. Withinthestringselected atthat

index,

choose a pointtocrossover. Each string hasa

(17)

Afterthecrossover pointis selected, thecrossoveriscarriedout aspreviously

demonstrated,

with secondhalfofone chromosome addedto thefirsthalfofthe other, and vice versa.

Mutation

Mutationsaresimplyanotherwayto introducevariationintothepopulationand occur

roughly 5%ofthe time. Although different fromthe crossover,

they

workina similar manner.

Afterthetworandom selections aremade, theselectionissimplyflipped from0to 1 or 1 to

zero.

Forinstance:

ChromosomeAbeforemutation= 1001 1011 0011 0101 0111

IfChromosomeA wastobe selectedformutation andthe second positioninthe array, third

positioninthatstringwereselected,themutation would endupasfollows:

ChromosomeAafter mutation= 1001 1001 0011 0101 0111

The resultingChromosomehasgonefrom

having

apKavalue of 1 1 inthesecond positionto

one

having

apKaof

9,

which couldhaveasignificantimpactonthe overall pi prediction.

Mutationoffitchromosomes couldhaveadetrimentaleffect on overallpopulationfitness. To

avoidthisproblem mutation rates arekept

low,

nohigherthan 5%.

Inadditionto automatically

being

placedintothenextpopulation,thefittestchromosomes

are saved after eachgeneration. Ifafter apre-determinednumber ofgenerations(always

between50and 150inthisstudy), thefittestchromosomehasnotchanged,thatfitnessis

determinedtobethebestpossibleunderthoseconditions, andthe correspondingpKavalues are

returned.

(18)

JavaClasses

Containing

roughly 800lines of code(comments

included),

theprogram consisted ofthree

classes, theGeneration class,theChromosomeclass, andtheEvolveclass. See Table 1 foran

explanation of each ofthe threeclasses andtheimportantfunctionswithinthoseclasses.

Class Name Explanation

Chromosome.class Used for representingpKavaluesforthedipeptidesinquestion, a

Chromosomeobjectisan arrayof

binary

strings usedtorepresentintegers.

Thisclassisusedtoperform operations suchas:

Randomcreation of new chromosomes

Mating(crossover)

andoffspringcreation

Fitness determination

Information gathering from Chromosomesthemselves

Generation.class The Generationclassisasort of containerforthechromosomesineach

population.

Availability

ofaGenerationobjectbecomes especiallyuseful

whenpassingthesurvivingchromosomes fromone generationto thenext.

Functionality

contained withinthisclassincludes:

Creationof

initial,

randomgeneration

Creationof a new generationbasedon chromosomes fromthe

previous generation(aforementionedtournamentselection)

Utilitiesfor accessing individualchromosomes withinthe

generation

Sorting

by

fitness level Introductionof mutations

Utilitiesfor reportingresults

Evolve.class The smallestclass ofthethree,Evolveis simplyusedtogetthealgorithm

runningandtodeterminewhentoendtheevolvingprocess.

Mostly

all

actual

functionality

is borrowed fromtheotherclasses, sothis class canbe

thoughtof as an organizer oftheentire process.

Table 2. ThelistofJavaclassescomprisingthegenetic algorithm andtheirrole intheprocess.

Theprevious section gives an overallideaofhowthegenetic algorithm works. Forthe

(19)

Results

An Example Genetic Algorithm Run

Figure2 showstheprogress made

by

a genetic algorithm when run on a set offive

proteins. This is onlymeantto

display

themannerinwhichtheGAarrivesatitsconclusion,and

doesn't

directly

correspondtothefinalresults.

An Example Run

of

the

Genetic

Algorithm

H Average Apl 0.25 0.2 h 0.15 Q. < 0.1 0.05 0 5 10 15

20

25

30

35

40 45 50 # of

Generations

Figure 2. Thisgraph shows progress made

by

thegenetic algorithm on a setof

fiverandomlychosen proteinsfrom theApl>0.7 dataset

Thefiveproteins wereselected at randomfromtheApl>0.7 datasetforuseinthis

example. Typicalof mostGAruns, the algorithmmakesquickimprovements early inthe run,

and startstoslow as timeprogresses. When consideringthelaterresults, therapid convergence

(20)

seenhere ismost

likely

anindicationthat theunderlyingtheoriesbehindtheGAneedtobe

strengthened.

Inthis example, thealgorithmwas allowedtorunfor 50generations withoutany

improvementonthe

top

fitnessvalue. Great improvementscanbenotedfortheproteinsinthis

example, astheAplvalues wentfrom

being

over0.7on averageto

having

an averageAplof

0.03.

Unfortunately,

resultsliketheseareuncommonwhenusing alargernumberof protein

sequences.

SuggestedpKaValues

Thegenetic algorithm was run onfour differentproteindata setsbefore

being

run onthe

completeEscherichiacolidata. Eachofthefoursets correspondedtoadifferent levelof

discrepancy

betweenexperimentaland predicted pi values

(Apl),

andtheresults oftheseruns are

shownbelow in Table 3.

Dipeptide Pair I)ata Set UsedlinGA

Apl<0.1 0.1<Apl<0.3 0.3 <Apl<0.7 Apl>0.7 C()mplete

HH 6 12 13 7 1 HK 5 5 3 3 3 HR 1 8 7 10 13 HE 1 11 3 5 3 HD 13 9 7 12 12 HC 9 9 1 11 13 HY 10 8 5 11 12 KH 11 3 6 13 1 KK 5 14 11 1 9 KR 11 12 14 13 12 KE 7 13 1 3 13 KD 14 5 1 10 5 KC 11 14 11 14 14 KY 11 14 5 13 14 RH 12 7 1 1 7 RK 8 9 14 11 11

(21)

RR 7 11 8 10 5 RE 12 9 9 13 14 RD 9 14 10 10 14 RC 1 1 7 12 9 RY 13 11 5 10 12 DH 5 12 1 13 5 DK 11 1 9 3 1 DR 3 5 5 11 1 DE 3 8 5 3 7 DD 5 1 3 5 5 DC 7 14 3 13 3 DY 5 3 13 3 12 EH 11 1 1 1 1 EK 3 1 2 3 1 ER 5 5 5 3 5 EE 3 3 5 5 7 ED 5 3 6 11 5 EC 13 3 1 1 5 EY 1 10 11 1 1 CH 6 3 3 1 1 CK 11 7 5 3 1 CR 5 9 9 3 3 CE 7 8 1 8 7 CD 3 13 1 10 12 CC 5 1 5 7 10 CY 9 1 3 13 7 YH 13 1 12 13 13 YK 8 13 14 11 12 YR 11 3 1 13 1 YE 12 14 8 1 1 YD 5 9 13 10 10 YC 14 1 13 10 9 YY 12 14 1 13 11 Table

3.pKa

values suggested

by

GAforincorporationintothepi predictionalgorithm. Each

column showsthevaluessuggested when usingthedatasetindicated.

Eachcolumn representsthepKavaluessuggested

by

the geneticalgorithm whenrunning

on adifferentset ofdata. For

instance,

thefirstcolumn ofdatarepresentsthefittestchromosome

fromthe GArunsusingproteins assignedto theApl<0. 1 dataset. Whenusedinthepi

(22)

predictionalgorithm, thesedipeptidepKavaluesresultedinthehighest averageaccuracy level

forthatgroup.

Atfirstglance,thereare certain aspects ofTable 3 thatstandout as problem areas. Most

notableisthe

inconsistency

whencomparingone columnto thenext. Anumber oftimesa value

suggestedforuse fromonedatasetisverydistant fromthatfromanotherdataset. Forexample, theGA suggested apKavalue of6 for histidinewhenitoccurs nextto anotherhistidine inthe Apl<0.1 dataset.

Moving

acrossto the0.1 <Apl<0.3column, the value suggestedforthe

samedipeptidepairismuch

higher,

at 12.

In addition, some ofthevalues suggested

by

the algorithmdon't entirelymake sense.

Aspartic Acidhasadefaultvalue of

4.05,

buthas suggestedpKavalues upwards of13 fromthe

genetic algorithm. Ashift ofthismagnitude seemsimprobableandisevidencethat thefitness

functionassociated withthisGA mayneed alteration.

Alone,

thisinformation has littletosay

abouthoweachsuggesteddipeptidepKahas affectedtheaccuracyof pi prediction. Inthe

following

series ofgraphs, thesuggestedpKavalues from Table 3are puttothe testwhenthe

newAplvalues are comparedto thoseoftheoriginal pi prediction.

Again,

thedifference

betweentheoriginal and new algorithmsisthe incorporationofdipeptidepKavaluesthatwere

expected tohavea positive effect ontheoverall prediction accuracy. Forcompleteexcel

(23)

Effectson piPrediction

UsingApK 0.1 Data Set

Effects

of

Modified

Algorithm on Apl <

0.1 Data

Set

-Apl Using Original Algorithm Apl Using Modified Algorithm

0.40 0.35

0.25 0.

0.05

&

<&

^

^

& ^

4>

<$

&

(#

4*

&

#V

^

< < 9 <" < <y <*' <r <?v <? <r <P < <$r <r* <? < <5r <2V <3r <r Protein Accession #

Figure3. A comparisonofApl beforeand afterthe incorporationofdipeptidepKa valuesinto

theprediction algorithmfortheApl< 0.1 data.

Using

pKavalues suggested

by

theGAforthe

Apl<0.1 dataset, the pink,jaggedlineshowsAplvalues whenusingthemodified algorithm.

Theblue linecorrespondstoAplvaluesforthesame protein set when usingthe original,

unmodified algorithmforprediction.

Furtherevidenceis found in Figure

3,

where we see a clearindicationthatnot all proteins

werepositivelyaffected

by

thenew prediction method. The

blue,

gradually

increasing

line

representstheAplbeforeaddition ofdipeptidepKavalues andthe

jagged,

pinklineshowsthe

new

discrepancy

levels. Whilesomeimprovements canbe seen(wherethepinkline dips below

(24)

the

blue),

themajorityoftheresults showa negative impactonprediction, especially inthose

proteinsthatpreviouslyshowed a

fairly

high levelof accuracy. Toexplainthe

increasing

nature oftheblue line inFigure 3 andthefiguresto follow isverysimple. Priortocreatingthese graphs, theproteinswere sorted

by

theoriginalAplvalues,which were calculatedusingthe

original pi predictionmethod.

Using

0.1<ApI< 0.3 Data Set

Effects

of

Modified Algorithm

on

0.1

<

Apl

<

0.3 Data Set

-Apl Using Original Algorithm Apl Using Modified Algorithm 0.7 0.6 0.5 m 0.4 o. 0.3 0.2 0.1 r A. / ^-A -" wrr v

V

1 i i \ i i r

J

^

<$

^

J> jf>

tf

$>

^

$>

tf

J?

J>

J? JP

&

^

</ </

/

</ </4?

/

^ ^

/

/>*

^ ^

</

^

</ Protein Accession #

Figure 4. Acomparisonofpibeforeand aftertheincorporation of dipeptidepKavaluesintothe

prediction algorithmforthe0.1<pl<0.3 data.

Using

pKavalues suggested

by

theGAforthe

0.1<ApI<0.3dataset, the pink,jaggedlineshowsAplvalues whenusingthemodified

algorithm. The blue linecorrespondstoAplvaluesforthesame protein set whenusingthe

(25)

Similarresults are found inFigures

4,

5 and

6,

showingbothpositive and negative

impactson prediction accuracy.

However,

it becomes clearthat thedatasets with greater

discrepancy

levelsgenerallyyieldagreater overallimprovementon prediction accuracy.

Considerthecomparison ofFigures 3 and6. On onlythreeoccasionsdidthenew prediction

accuracy decrease fortheApl>0.7 dataset(Figure

6),

whereasthenegativeimpacts seemto

outweighthepositivefortheApl<0.1 data. Thethemecanalsobeseenin comparing Figures 3

and

5,

wherethereisanaccuracy on allbut fourproteinsinthe0.3<Apl <0.7 data set(Figure

5).

Using

0.3<ApI< 0.7 Data Set

Effects

of

Modified Algorithm

on

0.3

<

Apl

<

0.7

Data Set

-Apl UsingOriginal Algorithm -Apl Using Modified Algorithm

1.2 0.8 0.6 0.4 0.2

1

(\

r\ / . f, '

\

/v

r

A

/

~VT\

\

/

\

A

A

J

/N

A

A

/

/

/ / j i r 1 1 1 r v r i i ~i 1 1 r-i 1 r -^

i/

V

X

x/

s>

^

J?

<f

^

,<& <&

4>

#

#

<? <& <o

^

<f

&

<$

/

^

/> #*

</

/>

^

/>

^ / / / /

/> />

^

Protein Accession #

Figure 5. Acomparisonofpibeforeand aftertheincorporation ofdipeptidepKa valuesintothe

prediction algorithmforthe0.3<pl<0.7data.

Using

pKa values suggested

by

the GAforthe

0.3<ApI<0.7dataset, the pink,jagged lineshowsAplvalues whenusingthemodified algorithm.

Theblue linecorrespondstoAplvaluesforthesame protein set whenusingtheoriginal,

unmodified algorithmforprediction.

(26)

Using

Apl>0.7 Data Set

Effects of Modified Algorithm on Apl > 0.7 Data Set

-AplUsingOriginal Algorithm -Apl UsingModifiedAlgorithm

2.5 31-5 0.5 _r^XX_

^\tv

\

/

-.

_ rl ? o,*1 <> -^ \b ^ * i? ^ o^ f? ^ i1 > i<% ^ 41 *-> AA -v> > S? <?

^

oSP fV

^ ^

-^ (^ /\< q\?

^ ^

^

/\V y<V CS^ ,^ (^ ProteinAccession # />

^

^VV"<r>v<*V<r^ 4?<f4?4^4^ 4?

^'4?

^>>^4*

Figure6. Acomparisonof Apl before and aftertheincorporation of dipeptidepKa valuesinto

theprediction algorithmfortheApl>0.7data.

Using

pKa values suggested

by

theGAforthe

Apl>0.7dataset, the pink,jaggedlineshowsAplvalues whenusingthemodified algorithm.

The blue linecorrespondstoAplvaluesforthesame protein set whenusingthe original,

(27)

Using

Entire E. coliDataSet

Proceeding

theGArunson each ofthefourpartialdatasets, thebestchromosomefrom

each run was usedtoseed onelastrunonthe entireE. coli

data,

andthose suggestedpKavalues

were pluggedintothealgorithm. Theresults ofthisrun are showninFigure 7.

Again,

itis

evidentthatmostimprovements cameforthoseproteins withhigh Aplvalues,whilethe

modifiedalgorithm faltered forthemore accurate proteins.

Effects

of

Modified

Algorithm on

Complete Data Set

AplUsing Original Algorithm Apl Using ModifiedAlgorithm

3.00 2.50 2.00 a 1.50 1.00 0.50 0.00

FT

; i.X ' [ T' 1 1 11 1 1 1 11 1m 1 1 1 1 1 1 1 1ifII 1 1 M 1 1 1 1 M 1 1 1'l u 1 1 1 1 N l

E.coli Protein Data Set

Figure 7. AcomparisonofApl beforeand aftertheincorporation ofdipeptidepKa valuesinto

theprediction algorithmforthecompleteE. colidata.

Using

pKa values suggested

by

theGA

fortheentiredataset, the pink,jaggedlineshowsAplvalues whenusingthemodified algorithm.

Theblue linecorrespondstoAplvaluesforthesame protein set when usingthe original,

unmodifiedalgorithmforprediction.

(28)

Furthermore,

theentiredataset was usedtotest themodified algorithm when

incorporating

pKavaluesfrom theApl<0. 1, 0. 1 <Apl <

0.3,

0.3<Apl A 0.7 andApl>0.7 datasets andthe

correspondingresults canbe found infigures

8, 9,

10,

and 11.

Again,

theinaccuraciestend to

overshadowthepositive effectshadon piprediction.

EffectsonEntire E.coliData Set

Using

Values Predicted inApl< 0.1 Data

Effects of Modified Algorithm on Complete Data Set

Using

Values Suggested from ApKO.l Data

-Apl UsingOriginal Algorithm -Apl Using Modified Algorithm

3.0 2.5 2.0 M

*

1-5 1.0 0.5 0.0

Proteins in E.coliData Set

f i

k

I I

i

1

, 1 1! A ft ,

Ml

I

I

I

1

i

iinrd-

M

/ Mil/ , , i, ft

d

a

1

'vimVl i --, ir^m ii >i i. r i

Figure8. Acomparisonof Apl beforeand afterthe incorporation of dipeptidepKa valuesinto

theprediction algorithmforthecompleteE. colidata.

Using

pKa values suggested

by

the GA

fortheApl<0.1 dataset, the pink,jaggedlineshowsAplvalueswhenusingthemodified

algorithm. Theblue line correspondstoAplvaluesforthesame proteinset whenusingthe

(29)

EffectsonEntireE. coliDataSet Using Values Predicted in O.K Apl<0.3 Data

Effects

of Modified Algorithm on

Complete Data Set

Using

Values Suggested from 0.1

< Apl <

0.3 Data

-AplUsing Original Algorithm Apl Using Modified Algorithm

3.00 2.50 2.00 S-1-50 1.00 0.50 0.00

wmUaM

II1 1 1 1 1 11II l'1'l11 11 1

A

rrfc

W-A

i ii ii i ii ii i ii minn iriiriiiiiflrrriTiTHTivirn^rfrn'ifirHniTiTiTiTHrniTiTrrTTii umuiiiiu nuMiininiiiiriiniMiiiMiiinnrr

1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163

Proteins in E.coli Data Set

Figure 9. Acomparison of Apl beforeand aftertheincorporation of dipeptidepKa values into

thepredictionalgorithmforthecompleteE. colidata.

Using

pKa values suggested

by

theGA

forthe0.1<ApI<0.3dataset, the pink,jaggedlineshowsAplvalues whenusingthemodified

algorithm. The bluelinecorrespondstoAplvaluesforthe same protein set whenusingthe

original, unmodifiedalgorithmforprediction.

(30)

EffectsonEntire E. coliData Set

Using

Values Predicted in 0.3<Apl<0.7 Data

Effects

of

Modified Algorithm

on

Complete Data Set

Using

Values Suggested

from 0.3<ApI<0.7 Data

Apl Using Original Algorithm Apl Using Modified Algorithm

3.0 2.5 2.0 - 1 5 1.0 0.5

0.0 rT-rr't'im-^^'i!rTr triii'viti'itittitiii'i i imi

_S ILL

\ .A/IS

iTTl

n

Ink

/

/

li ml l ii i il i ii i ii in i liili li ii i ill li i ii i in il i ii l li i li 1 1 ll i li ill i li i il l ii l li i iiiill i ii i ii i ill il ill l li l

Proteins in E. coli Data Set

Figure 10. A comparisonof Apl beforeand aftertheincorporationofdipeptidepKa valuesinto

theprediction algorithmforthecompleteE. colidata.

Using

pKa values suggested

by

theGA

forthe0.3<ApI<0.7 dataset, thepink,jaggedlineshowsAplvalues whenusing themodified

algorithm. The blue line correspondstoAplvaluesforthe same protein set whenusingthe

(31)

Effects onEntire E. coliDataSet

Using

Values Predictedin Apl >0.7 Data

Effects

of

Modified Algorithm

on

Entire Data Set

Using

Values Suggested

from

Apl>0.7 Data

Apl Using Original Algorithm Apl Using Modified Algorithm

3.00 2.50 2.00 1.50 1.00 0.50 0.00

u

fmMuMum

iiiniiiiii'iiiii ii -H

Proteins in E. coli Data Set

Figure 11. AcomparisonofApl beforeand aftertheincorporation of dipeptidepKa values into

theprediction algorithmforthecompleteE. colidata.

Using

pKa values suggested

by

the GA

fortheApl>0.7 dataset, the pink,jaggedline showsAplvalues whenusingthemodified

algorithm. Theblue linecorresponds toAplvaluesforthesame protein set whenusingthe

original, unmodified algorithmforprediction.

(32)

Asanalternativemethodfor

displaying

these results,averageAplvalues foreachdataset

are shown in Table 4. Althoughthe overallaccuracyappears tohave decreasedslightly,from

0.31 to0.33 onaverage, theaverageAplvalue was decreased

by

about0% inthe0.3 <Apl<0.7

dataset,and

by

roughly30% forproteins intheApl>0.7 set. Whiletheproblem of prediction

accuracy clearlystillremains, theseresultsmay beastepintherightdirection.

Average Apl Values BeforeandAfter

Table 4 showstheaverageAplvaluesforeachdatasetbeforeand afterincorporationof

dipeptidepKa values intheprediction algorithm.

DataSet Original Avg.Apl Modified Avg. Apl ApKO.l 0.0455 0.0970 0.1<ApI<0.3 0.18 0.17 0.3<Apl<0.7 0.5060 0.2782 Apl> 0.7 1.1148 0.8403

Complete Set 0.3069 0.3340 Complete Set

Using

ApKO.l Values 0.3069 0.3583 Complete Set

Using

0.1<ApI<0.3 Values 0.3069 0.3793 Complete Set

Using

0.3<ApI<0.7 Values 0.3069 0.3627 Complete Set

Using

Apl> 0.7 Values 0.3069 0.3810 Table 4. AcomparisonofaverageAplvaluesforeachdataset

(33)

Discussion

Overall,

itappearsthatour

learning

algorithm wasn'tcompletely effectivein

improving

onisoelectricpoint predictioninproteins. Whileonecan onlyspeculateastoexactly whythe

resultsappeared as

they did,

oneideawasthat the

training

dataset wasinsufficient fortheGAto

produce reasonable results.

Totest thistheory, afinalexperimentwas performedthatis knownas a

"leave-one-out"

approach. This approach addressesthe

training

set problem

by including

allbutone proteinin

the

training

data. Forexample, inadatasetthatcontains 170proteins, thefirst GA

training

run

includedprotein#s 2

-170,

while protein#1 was setaside asthe

testing

data. After collecting

pKa valuesfromtheGArun onthe

training data,

thosevalues wereincorporatedintothepi

predictionalgorithmto test theireffectson predictionforthe

testing data,

or protein#1. This

information,

including

experimentalpivalue,predictedpivalue,and predicted pi valuefromthe

modified algorithm,wasthenrecordedintoatable.

After recordingthe

data,

protein#1 wasputbackintotheset of proteins and protein#2

was removed and set aside asthe

testing

data. AgaintheGAwas run and results were collected

and recordedas

they

wereinthefirstrun.

Next,

protein#2was re-introducedintothedataset

and protein#3 was removed andsetaside,andthisprocess wasrepeatedover and over. When

each ofthe 170proteins inthedatasethadat onetimebeen setasideas

testing

data,

the

experiment was complete. Thenextstepwastocompare averageAplvaluesoftheoriginal and

modified pi prediction algorithms.

(34)

Using

theoriginalprediction algorithm, theaverageApl was0.31.

Using

theleave-one-out

approachtooptimizethepi predictionshowedasignificantdecrease inaccuracy,ending up with

anaverageof0.47. Whilethiswasn'ttheresultthatwashoped

for,

it isconsistentwith results

fromtheprevious experimentwhere we were unabletoimproveon overall prediction accuracies

forthecompletedataset.

Vastpossibilitiesexist forexpandingonthisworkinan attemptto significantlyimprove

our pi prediction algorithm.

First,

cuttingdownthe listofdipeptides inthechromosome might

makethegenetic algorithm more efficientinitsresults. ThroughouttheGAruns,itbecame

clearthatchromosomescontaining notablydifferentpKavalues could oftentimes resultin very

closefitnessvalues. Ifthosedipeptidesweregreatly

impacting

thepredictionalgorithm,we

would expecttosee consistent results.

Instead,

theinconsistenciesmightindicatethat the

charges onthesidechainsoftheseadjacent amino acidsdonot affect one anothertoalarge

extent, inwhich case

trying

to accountforthemmay actually hurtprediction accuracy. The studytoaccomplishthismightincludea comparison of sequence characteristicsbetweenthe positively andnegativelyaffected proteins.

By

narrowingthesearch space inthis manner, the

chances of

having

a positive effect withoutthenegative repercussions shouldincrease.

Anotherproblem area inthisstudyandpossibility forfurtherresearch mightinvolve

limiting

how farthesuggestedpKavaluesare allowedtodeviate fromthedefault. As previously

mentioned,someofthe pKavalues were morethandoubledinthemodified predictionalgorithm.

To illustratethis problem, we might considerany randomly chosen

dipeptide,

likea

histidine-aspartic acidcombination,for instance.

Histidinehas adefaultpKaof

5.98,

butwhenitoccurred nexttoasparticacidintheApl>

(35)

times theH-D dipeptideoccurredinthis

data,

whichis thesmallest setofthefour. Atthestartof

theGAruns, allpKavaluesarerandomlygenerated, soiftherewere alowoccurrence ofH-D

combinationsin any dataset, thefitnessvalue wouldn'tbeaffected asmuchasit is

by

highly

occurring dipeptides. In

turn,

thismeansthatoutrageouspKavalues might notendup

being

replaced andcould survive inthefittestpopulation.

Thispoint couldbeusedtoexplaintheresults whenrunningonthe completedataset.

Again,

thepKavalue for H-Dwas suggestedtobe 12. Forthis exampleitis importantto

keep

in

mindthat theGArunusingthecompletedataset was

initially

seeded withthe

top

chromosomes

fromeach ofthefourprevious runs. Thismeansthat the

top

chromosome fromtheApl>0.7run

wasused,

immediately introducing

apKavalue of12intothepopulation. Evenintheinstance

thatthechromosome wasn'tinthe

top

5% fitness levelanddidn't surviveto thenextgeneration,

it's

likely

that thevalue of12 for H-D stayedintactthrougha series ofmatingand crossover

events.Ifat some pointthechromosome containingthatpKavalue wasinthe

top

5%offitness

levels,

itwasautomaticallymovedto thenextgeneration, savingthatvalueforthehistidine

-aspartic acid pair.

Over

time,

thefittestchromosomes endup

being

reproduced morereadily,whichinthis

example would meanthatthevalue of12 fortheH-Dpairdominatesthepopulation eventhough

itmight nothavea significant effect on predictionaccuracy.

Eventually,

thisvalueis

incorporated intothealgorithm, and could endup

having

anegativeimpactontheprediction.

Therefore,

by limiting

how farthepKavalues can

deviate,

itwoulddecreasethenegativeimpact

incases such asthis, and might notovershadowthesuggestedpKavaluesthatreallyare

having

a

positive effect.

Athirdapproach mightbetoexpand onthechargeable groupsandintroduceuncharged

(36)

amino acidsintotheequation. PreviousresearchhasshownthatN-terminalasparaginehada

significantimpactonthepredicted pi value [5]. Althoughthemeans

by

whichthisoccurs

remainunclear, onepossibility may be that the

hydrophobic,

uncharged amino acidsinterfere

withcharged,adjacentside chainswhenincontactwith water. Althoughpossibilities are

extensiveforthistypeofresearch,one method ofattackingthisproblem mightbetoconsider

events where ahydrophobicamino acid restsbetweencharged side chains. Similartothe

research presented

here,

anevolutionary programmingapproach couldbeusedinattempttofind

pKavaluesthatremedytheproblem.

(37)

Conclusion

Thisthesisworkhasinvestigatedthepossibility of

improving

isoelectricpoint prediction

by

using evolutionary programmingto accountforcharge-chargeinteractionswithinthe

sequence. Whileanincrease in accuracywasseen on a smallscale,itwas not substantial enough

and was overshadowed

by

decreasesin accuracy inother areas. Forthatreason wecannotsay

our workhasresultedinabetteralgorithm.

However,

isoelectricpoint predictionis adifficult

problemthatstillhasmuch roomfor investigation. Whiletheresultsfailedtoyieldevidenceto

an overallaccuracy

increase,

theinformationpresentedhereputs us one smallstepclosertoa

successful pi prediction and provides a genetic algorithmthatmayprove usefulin futurestudies.

(38)

Bibliography

[1] Hamdan,

H. and

Righetti,

P.G.(2005)

"ProteomicsToday: Protein Assessmentand

Biomarkers

Using

Mass

Spectrometry,

2D

Electrophoresis,

and

Microarray

Technology".

Hoboken,

NJ,

John

Wiley

and

Sons,

Inc.

[2] Fichmann,

J. and

Westermeier,

R.(1999)"2-D Protein Gel Electrophoresis: AnOverview." Methods in Molecular Biology: Vol.

112(1-9)

[3] Conte,

M.

(2005)

"Isoelectric Point Prediction FromtheAmino Acid Sequenceof a

Protein"

submittedaspart of aMaster's Thesis ProjectatRITin 2004

[4] Mitchell,

M.

(1998)

"An IntroductiontoGeneticAlgorithms"

The MIT

Press,

1999

[5] Cargile,

B.J., Talley, D.L.,

Stephenson,

J.L.

(2004)

"ImmobilizedpH gradients as afirst

dimension inshotgun proteomics and analysis oftheaccuracyof pipredictabilityof peptides".

Electrophoresis 25: 936-945

[6]

Hortsmann,

C.S.

(2001) Big

Java:

Programming

andPractice

Wiley,

1stEdition

[7]

"SWISS-2DPAGE Two-dimensionalpolyacrylamidegel electrophoresis

database"

Foundat

http://us.expasy.org/ch2d/

[8] Bjellqvist, B., Hughes, G., Pasquali, C, Paquet, N., Ravier, F., Sanchez, J.-C,

et al.

(1993)

"The

focusing

positions of polypeptidesin immobilizedpH gradients canbepredicted fromtheir

amino acid sequences".Electrophoresis 14:1023-1031.

[9]

Tonella

L.,

Hoogland

C,

Binz

P.-A.,

Appel

R.D.,

Hochstrasser

D.F.,

Sanchez J.-C. "New

perspectivesintheEscherichiacoliproteomeinvestigation". Proteomics 1:409-423(2001).

[10]

"ComputepI/MxforSwiss-Prot/TrEMBL entriesor a user-enteredsequence".Foundat

http://us.expasy.org/tools/pi_tool.html

[1

1]

Sillero

A., Ribeiro,

J.M.

(1989)

"Isoelectricpoints of proteins:theoreticaldetermination.

AnalyticalBiochemistry" 179: 319-325

[12]

Righetti, P.G., Caravaggio, T.(1976)

"Isoelectricpoints and molecular weights ofproteins.

Journalof

Chromatrography

"

(39)

[13]

Cargile,

B.

J.,

et al.

(2004)

"Gel Based Isoelectric

Focusing

ofPeptides andthe

Utility

of

IsoelectricPointinProteinIdentification."

Journalof proteome research3.1 (2004): 1 12-9.

[14]

"Getproteinlistforareference

map."

Foundat

http://www.expasy.org/cgi-bin/get-ch2d-table.pl

[15]

"NCBI Batch Entrezsearch". Foundat

http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Protein

Appendix A

-Escherichia

coli

Data Set

Protein Actualpi Predicted

|Actual-Pred|

Color Codes: P0AE08 5.05 5.050048828 4.88E-05 ApKO.l

P05055 5.13 5.129943848 5.62E-05 0.1<ApK0.3 P45578 5.2 5.200439453 4.39E-04 0.3<Apl<0.7 P0AEZ9 5.74 5.7421875 0.0021875 Apl>0.7 P37689 5.15 5.152587891 0.002587891 P0ABB0 5.81 5.806274414 0.003725586 P0A6L0 5.52 5.514892578 0.005107422 P13029 5.16 5.16583252 0.00583252 P61714 5.19 5.183349609 0.006650391 P23869 5.51 5.502929688 0.007070312 P09030 5.8 5.807983398 0.007983398 P0AEZ3 5.28 5.26965332 0.01034668 P0A7A9 5.06 5.049194336 0.010805664 P0A6P1 5.22 5.234619141 0.014619141 P0AFU8 5.67 5.655029297 0.014970703 P0ABB4 4.95 4.932983398 0.017016602 P0A817 5.1 5.121826172 0.021826172 P0AB71 5.56 5.537109375 0.022890625 P36683 5.24 5.263671875 0.023671875 P0ACU7 5 4.973144531 0.026855469 P0A6E4 5.28 5.252563477

0.027436523

P0A6F5 4.91 4.879150391

0.030849609

P00509 5.53 5.561035156

0.031035156

P39172 5.58 5.611450195 0.031450195 PI 6703 5.47 5.437988281 0.032011719 33

(40)

P0AE67 4.95 4.915039063 0.034960938 P0ADU2 5.77 5.807128906 0.037128906 P0A9C3 4.9 4.860778809 0.039221191 P0A877 5.38 5.338867188 0.041132812 P0C054 5.63 5.588378906 0.041621094 P62707 5.82 5.861816406 0.041816406 P0A799 5.15 5.107299805 0.042700195 P0AAI9 5.03 4.98425293 0.04574707 P0A6M8 5.21 5.256408691 0.046408691 P26646 5.6 5.648193359 0.048193359 P07004 5.39 5.438842773 0.048842773 P0A6D3 5.34 5.389282227 0.049282227 P0A7Z4 5.04 4.988952637 0.051047363 P0A870 5.08 5.132080078 0.052080078 P63284 5.44 5.383728027 0.056271973 P0A796 5.43 5.487548828 0.057548828 P08312 5.74 5.799438477 0.059438477 POAGEO 5.41 5.472167969 0.062167969 P0A6G7 5.6 5.537109375 0.062890625 P0A6F9 5.23 5.166259766 0.063740234 P0A7F3 7.01 6.941894531 0.068105469 POAE18 5.71 5.638793945 0.071206055 P24216 4.96 4.888549805 0.071450195 P09832 5.48 5.551635742 0.071635742 P23721 5.47 5.391845703 0.078154297 POA850 4.93 4.849884033 0.080115967 P0AB55 5.29 5.208984375 0.081015625 P76149 5.38 5.46105957 0.08105957 P0AG67 4.99 4.908416748 0.081583252 PI6659 5.06 5.146606445 0.086606445 P0AA25 4.8 4.711669922 0.088330078 P0A9A9 5.78 5.688354492 0.091645508 P08142 5.22 5.314086914 0.094086914 PI8843 5.34 5.434570313 0.094570313 POAC55 5.78 5.875488281 0.095488281 P05194 5.31 5.213256836 0.096743164 P0A6Y8 4.96 4.863128662 0.096871338 P0A9M5 5.44 5.538818359 0.098818359 P0A6D7 5.18 5.280761719 0.100761719 POAEDO 5.2 5.097900391 0.102099609 P0A9D2 5.76 5.863525391 0.103525391 P06960 5.73 5.625976563 0.104023438

(41)

P0A8G6 5.51 5.615722656 0.105722656 P0AG78 6.49 6.596679688 0.106679687 P0AEQ3 7.32 7.435791016 0.115791016 P60595 5.24 5.359375 0.119375 P0ABU2 5.02 4.900085449 0.119914551 P0A7L0 8.24 8.115966797 0.124033203 P0ABD8 4.78 4.654846191 0.125153809 P0AF03 4.84 4.965454102 0.125454102 P04949 4.7 4.573669434 0.126330566 P68066 4.98 5.106445313 0.126445312 P0A6E6 5.34 5.46875 0.12875 P0A6A3 5.72 5.8515625 0.1315625 P75797 7.32 7.186279297 0.133720703 P29744 4.82 4.683044434 0.136955566 P09029 5.75 5.612304688 0.137695313 P61889 5.49 5.629394531 0.139394531 P00547 5.33 5.472167969 0.142167969 P0A7E1 7.28 7.13671875 0.14328125 P67910 4.98 4.835571289 0.144428711 P0AGD3 5.45 5.595214844 0.145214844 P0A9A6 4.83 4.680480957 0.149519043 P00946 5.16 5.31237793 0.15237793 P12758 5.66 5.82421875 0.16421875 P0A955 5.43 5.595214844 0.165214844 P0A8M0 5.01 5.195739746 0.185739746 P69783 4.95 4.762939453 0.187060547 P0AFC7 5.42 5.612304688 0.192304688 P0A9Q9 5.2 5.393554688 0.193554687 P25553 5.29 5.095336914 0.194663086 P0AEX9 5.23 5.435424805 0.205424805 P28635 4.95 5.156860352 0.206860352 P0A9G6 4.98 5.189758301 0.209758301 P0A6W5 4.95 4.73815918 0.21184082 P39177 6.25 6.037841797 0.212158203 P0A862 5.02 4.800537109 0.219462891 P0A715 6.1 6.323242188 0.223242188 P0AC69 4.96 4.727050781 0.232949219 P0A7N1 8.3 8.065551758 0.234448242 P0ABU5 4.92 4.685180664 0.234819336 P0A7K2 4.87 4.633056641 0.236943359 P0AES9 4.84 5.0859375 0.2459375 P0AD96 5.31 5.561889648 0.251889648 35

(42)

P38489 5.55 5.812255859 0.262255859 P0AEK4 5.33 5.595214844 0.265214844 P0A6N1 5.58 5.314086914 0.265913086 P0A855 7.05 6.78125 0.26875 P46850 5.65 5.928466797 0.278466797 P0A9C5 5 5.282470703 0.282470703 P35340 5.2 5.485839844 0.285839844 P0A9M2 5.38 5.080810547 0.299189453 P04036 5.11 5.46105957 0.35105957 P37902 7.87 7.516113281 0.353886719 P0A940 5.33 4.967163086 0.362836914 P0A763 5.19 5.557617188 0.367617187 P0A8X2 5.2 5.591796875 0.391796875 P63020 4.96 4.568115234 0.391884766 P76290 5.8 5.373046875 0.426953125 P04816 5.08 5.516601563 0.436601562 P0A8Q6 5.4 4.959472656 0.440527344 P0AG82 6.85 7.293945313 0.443945313 P0ABT2 5.27 5.718261719 0.448261719 P09551 5.17 5.622558594 0.452558594 P08200 4.7 5.17565918 0.47565918 P0A8P3 5.46 5.937011719 0.477011719 P23847 5.71 6.196777344 0.486777344 P37329 6.7 7.187988281 0.487988281 POAEUO 4.99 5.489257813 0.499257812 P0AEE5 5.19 5.697753906 0.507753906 P0AFK9 4.76 5.270507813 0.510507813 P0ADG7 5.49 6.017333984 0.527333984 PI6700 6.58 7.128173828 0.548173828 P69441 5 5.567871094 0.567871094 P0AFZ3 5.01 4.428833008 0.581166992 P23843 5.47 6.052368164 0.582368164 P69797 5.17 5.755859375 0.585859375 P0AGE9 5.73 6.321533203 0.591533203 P0A6P9 4.74 5.344848633 0.604848633 P18335 5.19 5.797729492 0.607729492 P0A858 5.01 5.649047852 0.639047852 P31663 5.25 5.926757813 0.676757813 POCOVO 8.01 7.329833984 0.680166016 P0A879 5.03 5.717407227 0.687407227 tUUBUBSBBBBKNUtt P0ADG4 5.71 6.453125 0.743125 P30859 5.07 5.813964844 0.743964844 36

(43)

P00894 8.11 7.352050781 0.757949219

P0AET2j

5 5.762695313 0.762695313 P0A910 : 5.23 5.996826172 0.766826172 P61316 , 5.52 6.306152344 0.786152344

P0ABK51

4.95 5.843017578 0.903017578 P0ADE8

\

6.11 5.179931641 0.930068359 P0ADA3 8.84 7.902770996 0.937229004 P0AFL3 8.52 7.567382813 0.952617187 P76002 9.2 8.246704102 0.953295898 POAFGO 5.4 6.37109375 0.97109375 P0AD59 5.33 6.306152344 0.976152344 P0AEM9 5.19 6.230957031 1.040957031 P0A7R1 5.1 6.186523438 1.086523438 : P77348 8.55 7.314453125 1.235546875 P0A9B2 5.32 6.583007813 1.263007812 P33136 8.04 6.608642578 1.431357422 P00811 9.06 7.55456543 1.50543457 P0ADV7 10.3 7.978393555 2.321606445 P68919 10.6 8.2578125 2.3421875 37

References

Related documents

The implementation aims the reduction of project costs, use of free software, in addition to enabling the use of Raspberry hardware with Internet Protocol (IP) for outgoing

The summary resource report prepared by North Atlantic is based on a 43-101 Compliant Resource Report prepared by M. Holter, Consulting Professional Engineer,

For helpful overviews of the global situation, see Steven Hahn, &#34;Class and State in Postemancipation Societies: Southern Planters in Comparative Perspective,&#34;

effect of government spending on infrastructure, human resources, and routine expenditures and trade openness on economic growth where the types of government spending are

The key segments in the mattress industry in India are; Natural latex foam, Memory foam, PU foam, Inner spring and Rubberized coir.. Natural Latex mattresses are

Minors who do not have a valid driver’s license which allows them to operate a motorized vehicle in the state in which they reside will not be permitted to operate a motorized

After you download the Tableau Server installation file from the Customer Center , follow the instructions below to install the server.. Double-click