DNA Coding Patterns and Biom sequences Coding Biomnis

(1)

Correlation

Approach

to

Identify Coding Regions

in DNA

Sequences

S. M. Ossadnik,* S. V. Buldyrev,* A. L. Goldberger,$ S. HavIin,*§ R.N. Mantegna,* C.-K.

Peng,**

M. Simons," and H. E. Stanley*

*Center forPolymerStudies and DepartmentofPhysics,Boston University, Boston, Massachussetts 02215USA,tCardiovascular Division, Harvard Medical School,Beth IsraelHospital, Boston, Massachussetts 02215USA,§DepartmentofPhysics, Bar IlanUniversity,Ramat

Gan,Israel,and1DepartmentofBiology, MassachussettsInstitute ofTechnology, Cambridge, Massachussetts02139 USA

ABSTRACT Recently, itwasobserved that noncoding regions of DNAsequences possesslong-range power-law correlations, whereas coding regions typically display only short-range correlations. We develop an algorithm based on thisfinding that enables investigatorstoperformastatisticalanalysisonlongDNAsequencestolocatepossible coding regions. The algorithm is particularly successful in predicting the location of lengthy coding regions. For example, for the completegenomeof yeast chromosome III (315,344 nucleotides), at least 82% of the predictions correspond toputative coding regions; the algorithm

correctly identified all coding regions larger than 3000 nucleotides, 92% of coding regions between 2000and3000 nucleotides long, and 79% of coding regions between 1000 and 2000 nucleotides. The predictive ability of thisnewalgorithm supports the claim thatthere isafundamental difference in the correlation property betweencoding and noncodingsequences.Thisalgorithm, which isnotspecies-dependent, canbe implemented with other techniques for rapidly and accurately locating relatively long coding regions ingenomic sequences.

INTRODUCTION

One ofthemajor problems facing researchers workingwith long genomic DNA sequences is the need for arapid and

accuratemethodofidentifying coding regions. Currently, a

typical search foracodingregion involves scanningthe DNA sequencefor thepresenceofanopenreading frame(longer

thanacertainarbitrarilydefinedlength)forboth orientations

and forallpossible frame-shift positions.Theidentifiedopen

readingframes are thensearchedforcanonical intron splice sites and for the existence ofcDNA orprotein matches by using appropriate data bases. These methods are

labor-intensive and require considerableoperatorparticipation. In contrast, an idealtechnique wouldbe fast andaccurateand

require only minimal operator input.

Recently,amultiplesensorneuralnetworkapproachwas

developed by Uberbacher and Mural(1991)tolocate protein-coding regions. Their approach involves calculating the

val-uesofagroupofseven sensoralgorithmsover awindowof 99 consecutive bp. Aneural network training procedure is

thenperformed on atraining setof human DNA sequences

foroptimizing the weights of the differentsensoralgorithms.

Thisapproachhas been usedtodetectcoding regionsin

hu-man DNA withgood predictive power. However, because

most ofthose sensor algorithms are species-sensitive, the parameters needto be adjusted for otherorganisms (espe-ciallynonmammalian DNAsequences). Therefore,an

algo-rithm basedon amoregeneral principle that can be applied

acrosstheentirephylogeneticspectrumwithout modification would bedesirable.

Receivedforpublication22February 1994 and in finalform 18April 1994. Addressreprint requeststo H.Eugene Stanley, Center for Polymer Studies,

DepartmentofPhysics,590Commonwealth Avenue, Boston, MA 02215. Tel.:617-353-2617;Fax:617-353-3783; E-mail: hes@buphyk.bu.edu.

i 1994bytheBiophysical Society 0006-3495/94/07/64/07 $2.00

Wehavedevelopedsuchatool forrapid identificationof DNAcodingelements based on ourobservation of the

ex-istence ofpower-law long-range correlationsin noncoding,

but not in coding, sequences (Peng et al., 1992). The key general concept underlying this new technique, which we

call the "coding sequence finder" (CSF) algorithm, is to

"drag"anobservation boxalong theDNA sequenceand to measurecontinuouslythe"signal" fromadevice that quan-tifiesthedegree of power-law long-range correlation.

Non-coding regions aretypically characterized by a correlation

that is long-range in that it decays not exponentially but, rather, as a power law. On the otherhand, coding regions typically display only short-range correlations, which decay exponentially.

WetesttheCSFalgorithmon avarietyoflongDNA se-quences, including the recently sequenced Yeast III

chro-mosome,whichcomprises 315,344 bp (Oliveretal., 1992).

Thealgorithm is foundtowork well when thecoding regions

aremoderately large(over1000bpinlength).Wealso

con-firm its accuracy on long, artificially generated "control" sequencescomprised ofknowncoding and known noncod-ing sub-sequences.

POWER-LAW LONG-RANGE CORRELATIONS

To quantify thecorrelation properties of a DNA sequence, itisconvenienttointroduceagraphicalor"landscape"

rep-resentation, termed a DNA walk (Peng et al., 1992). For the conventional one-dimensional random walk model

(MontrollandShlesinger,1984),awalkermoveseither"up" [u(i)= +1]or"down" [u(i)= -1]oneunitlengthfor each step i of the walk. For thecaseofanuncorrelatedwalk, the directionofeachstepisindependentoftheprevioussteps. Forthe case ofa correlatedrandomwalk, the direction of each stepdependsonthehistory("memory")of the walker.

(2)

Coding DNA Sequences

One possible choice for the DNA walk can be defined asfollows: the walker steps"up" [u(i) = +1] ifa pyrimi-dine (C or T) occurs at position a linear distance i along the DNAchain, whereas the walker steps "down" [u(i) =

-1] if apurine (AorG) occursatposition i.Other

defini-tions are discussed in Discussion and Summary. A key question is ifsuch awalk displaysonly short-range

corre-lations (asin an n-step Markov chain)or power-law

long-range correlations (as in critical phenomena and other

scale-free "fractal"phenomena).

TheDNAwalkprovidesagraphicalrepresentation forany DNA sequenceandpermits the degree of correlation inthe

basepair sequence to bevisualizeddirectly.Toquantify this correlation, one calculates the"net displacement,"

y(f),

of

the walker after fsteps, which is the sumof the unitsteps

u(i) foreach step i. Thus,y(f) u(i).

One difficulty in analyzing DNA sequences by random walk method is that DNA sequences are highly

heteroge-neous.Thus, the problem of howtodistinguish "patchiness" fromtruly fractal(scale-invariant)typeofbehavior needsto

be addressed (Karlin and Brendel, 1993). In Peng et al.

(1992), a"min-max"methodwasproposedtotake into ac-count the nucleotide heterogeneity and changes in strand

bias. Apotentialdrawback of this methodisthat itrequires the investigator to judge how many local maxima and

minima ofalandscapetoutilize in theanalysis. Recently,we

presented a new method: "detrendedfluctuation analysis" (DFA),thatisindependentofinvestigator inputandpermits

thedetection ofpower-law long-range correlations embed-ded inapatchylandscape,and also avoids thespurious

de-tection of apparent power-law long-range correlations that

are anartifact ofnucleotide patchiness (Penget al., 1994). TheDFAmethodis carriedout asfollows:first,wedivide the entire sequence of length N into Nle nonoverlapping boxes, each containing enucleotides, and define the "local

trend" ineachboxtobe the ordinate ofalinearleast-squares fit for the DNA walk displacement in that box. Next we

define the "detrended walk," denoted byye(n), as the dif-ference between theoriginal walky(n) andthelocal trend.

Wecalculate the variance about the local trend for each box

andcalculatethe average ofthesevariancesoveralltheboxes

of size e, denotedF2(e). Thus,

l N

F2()

)-N

I

y2(n).

(1)

n=1

It was shown (Peng et al., 1994) that the calculation of

Fd(f)canclearly distinguishtwodifferenttypesof behavior:

(i)

Fd(4)

-

V1/2

forpatchybut otherwiseuncorrelated(or only

short-range correlated) sequences,and (ii)

Fd(e) 4ea (2)

witha 0 1/2, ifthere is no characteristiclength for the

cor-relations.

Typical data for Fd(e) are linear on double logarithmic plots,confirming that, indeed,Fd(e) -

e3.

Aleast-squares fit

of suchdata produces a straight line with slope a. Itwas

observed that for

coding

sequences, ac 1/2, whereas for

noncodingsequences,aissubstantiallylarger than 1/2(Peng

et al., 1992, 1994).

CODING SEQUENCE FINDER

(CSF)

ALGORITHM

The focus of the CSF algorithm is the calculation ofthe

correlationexponentafordifferentsub-regionsof the DNA sequence. If a,measured from asub-region, iscloseto0.5 it meansthatthis sub-region ismorelikelytobelongto the coding part of the sequence, which is in accord with our

finding that the coding sequences do nothave power-law long-range correlations. If,onthe otherhand,the value ofa

foraregionis muchlargerthan0.5, then thisregion ismore

likely tobelongto the noncodingpart of the sequence. Note, however, that a cannot be calculatedfor a single nucleotide. Instead, theexponent a,defined by the behavior

of thefluctuationFd(f), canbecalculatedonlyfor a

subse-quence of nucleotides withlength w >>4.

Therefore, we have devised the following 6-step proce-dure:

Step 1. CalculateFd(e)forthesubsequence(windowofsize w) from nucleotide n - w/2 to nucleotide n + w/2, for a

continuous sequence ofpositions n ranging from the first nucleotide (n = w/2)tothe last(n= N- w/2),where Nis

the total number ofbp.

Step 2. Constructalog-log plot ofFd(f)vs.4. Theexponent a a(n)is estimated from theslopeof theplot.Tocalculate theslope,we makealinearregression fit for thedatainthe

rangefrom

3I

to

42.

Thus,thelocalvalueofa(n) isafunction ofwindow sizewand fittingrange

[l,

4].

Step 3. Select an appropriate window size w and fitting

range

le,

42].

Thelower cutoffvalue 4 is chosen such that

aisnotaffectedbytheshort-range(Markovian) correlations. Althoughwe prefertohavevery large

42

we musttake 2

much smaller than w, because theerror ofestimation ofa

rapidly increases when4approachesw.The ratio

w/f2

rep-resents the number of statistically independent measure-mentsby which the valueFd(e)is obtained. The error ofa

is, therefore, inversely proportional to the square rootofthis ratio. Indeed,wehave shownrigorously(Pengetal.,1993)that the SDoaof the value ofacanbecalculatedby the formula

(J=CW,

(3)

whereC isacoefficientthatiscloseto0.1.Ourselection criterion forwand 2is that theSDor"error"a.mustbe muchsmaller

thanthe difference ofavalues betweencodingandnoncoding

sequences,i.e.,thesignal-to-noise ratiomustbeaslargeas pos-sible.

Ourunpublished observations, basedonsamplingover a

widerange ofphylogeneticspectrum, reveal that the average value ofaforcoding regionsobtainedbyDFA for thefitting

range

El

= 10,

42

= 100 is 0.51, whereas for noncoding regions it is 0.59. Therefore, wechoose w 2 1042, which from(3) gives or'0.03,an errorconsiderably smaller than theexcursionsinabetweencoding and noncoding regions, 0.59 - 0.51 = 0.08.

(3)

Furthermore, thereis a trade-off in our choice of param-eters: by increasingthe window size and thefitting range,one

increases the accuracy of the value ofabut decreasesthe

accuracy oflocatingthis value alongthe sequence. Step 4. Smoothouttheresultingfunctiona(n).The function

a(n) is a rather irregular oscillatory function with many minima and maxima.Twofactors contributetothisirregular spatial fluctuation: (i) alternating coding and noncoding

re-gions have different exponenta(this is the "signal" thatwe

want); and (ii) the error inestimating a from afinite sub-sequence(this is the "noise"that wedonotwant).Therefore,

our goal is not to smooth arbitrarily but, rather, only to

smooth in suchafashionas tominimize the effectof(ii).The twoeffectsaredistinguishable, because the fluctuations that

are more likely caused by the noise are "high-frequency" comparedwith thefluctuationscausedby alternationof cod-ing andnoncoding regions. For this reason, a simple low-pass filter(Pressetal., 1991) is quite effective. Alternatively, we maysimply average together a(n)for severalnearbyvalues

ofn. Ourpreliminary calculationsshow that bothsmoothing procedures give similar results.

Step 5. Comparethea(n)function with locations of known

coding regions. The smoothed function a(n) usually has minima ofabout 0.5,whichcorrespondtothe local absence ofpower-law long-range correlations (seeFig. 1). Indeed, comparing the function a(n) for the sequence of yeast chromosomeIII (for which many of the coding regionsare

known), wecan seethatminima of a(n) correspond

remark-YeastChromosome III (315344 nt)

a nucleotide position 40000 50000 nucleotideposition nucleotideposition 0.7 0.6 a 0.5 60000 0.430000 40000 50000 nucleotideposition

FIGURE 1 (a) Analysisof section ofyeastchromosome IIIusing the sliding box CSF algorithm. The value of the long-range correlationexponenta

is shownas afunctionofposition along the DNA chain.Inthis figure, the results for about 10% of theDNAareshown(from base pair30,000tobase

pair 60,000). Shownasvertical barsaretheputativegenesandopenreading frames; denoted by the letter "G"arethosegenesthat have beenmorefirmly

identified(March1993 version ofGenBank).Note that thelocal value ofatypically displays minima wheregenesaresuspected, whereas between thegenes

adisplaysmaxima. This behaviorcorrespondstothe fact thatthe DNAsequencesofcoding regions lack power-law long-range correlations (a=0.5 in

the idealizedlimit),whereas the DNA sequences in betweencodingregionspossesspower-law long-rangecorrelations(a 0.6).Parameter values:w= 800,el=8, 42= 64.(b)The solidcurveis thesameasinparta,whereas the dottedcurveis thesameanalysis applied after 0.5% of the bp have in the

samesequencebeenrandomlymutated.(c)The solidcurveisthesame asinparta,whereas the dottedcurveis thesameanalysis applied after 0.5% of thebpinthesamesequencehave beenrandomly shiftedtoarandomly-chosen position. (d) The solidcurveis thesame asinparta,whereasthedotted

curveis thesameanalysis appliedafter both theoperations ofpartsb andchave been carriedout.

0.7

0.6

0.5

0.4 LW

(4)

Coding DNA Sequences

ably wellto thepositions ofputativecoding regions

(iden-tified genes or openreadingframes),whereas intergenomic sequencesusually correspondto thelocal maxima of a(n). Step6. The aboveprocedure(steps 1-5) outlines thebasic CSFalgorithm.Step6demonstrateshowthe CSFalgorithm canbecombinedwith other local criteria formoreprecise identification of coding sequences. For example, for yeast

chromosome III, where thecoding regionsaretypically

un-interrupted by introns,one canincorporate information about

openreading framesand stop codons. Areadingframeisone ofthree possiblewaysofdividingsequences oneachof two

complementary DNA strands into subsequent codons. An openreading frame is areading framewithout thestop sig-nalsTAG,TGA, and TAA. Therefore,topredict the actual

boundaries ofcoding regions from ourcalculated function

a(n),

wecancarry outthe following additional procedure:

(a) Find all local minima ofa(n).

(b) Identify thesix openreading frames(threeoneachof

the complementaryDNAstrands).

(c) Define thelongestopenreading framecorresponding

to aminimumvalueof a.Long open readingframeswithout power-lawlong-range correlations are very likely to be

ac-tualcoding regions.

EVALUATION OFTHE CSF ALGORITHM Test foryeastchromosome III

Tocharacterizequantitatively the goodnessof ouralgorithm,

weconsider the relative positionsoflocalminima,maxima, andthe boundaries ofcoding regions.

For example, the outcome of the CSF algorithm (steps 1-5) for the test case ofyeast chromosome III, using the

parameterchoicesw = 800,

el

= 16,

f2=

64 canbe char-acterizedby the following table:

* Total number ofputative codingregionsknown from work

ofothers (Oliveretal., 1992): 218.

* Fraction of the 315,344 bpbelonging toputative coding regions:p = 0.66.

* Number of minima in a(n): 176.

* Numberofsuch minimabelongingtoputativecoding

re-gions(true positives): 138. * Number offalsepositives: 38.

Thus,of 176minima,all but 38correspondtoputative coding regions. A key statistical test of the CSF algorithm is to

demonstrate that the apparently strikingagreementbetween

theputativecoding regionsand thedipsin

ae(n)

is notsimply

aresult of random coincidence. Therefore, we assume the contrary, i.e., that the dips are occurring at random. Then, becausethere are 176minimain oura(n) plot, 176 Xp =

176 X(0.66)= 116of the minima shouldlieinside putative coding regions,and 176 X (1 -p) = 176 X(0.34) = 60 of

the minimashould lie outside putative coding regions.The

SD forthe above estimation(assumingthatthese176minima

are occurring at random) is given by the formula cr =

V176

XpX (1 -p).Hence,inthe present case, we would expector=

\/176

X 0.66 X0.34 = 6.3.The actual number

offalsepositives is38, three SD smaller than the

expected

value 60. The probability of obtaining this result if the minima didnotcorrespondtothecodingregions,therefore,

is the chance offinding a signal 3 SD from the expected value, or 0.0014.

Thecombination of the CSFalgorithm (based onglobal

criteria ofpower-lawlong-range correlations)withlocal cri-teria (stop codons; see step 6 above) is very successful in

identifyingtheprecise boundariesoflong codingregions. It enablesus toidentifycorrectlyintheyeastchromosome III

100% of theputative coding regions with more than 3000

nucleotides, 92% ofcoding regionswithbetween 2000 and 3000nucleotides long and79% ofcoding regionsbetween 1000 and 2000 nucleotideslong.

The "false positives" identified by the CSF algorithm

might actually indicate the presence of formercoding ma-terial, such aspseudo-genes, jumpinggenes, and retroviral inserts. Forexample,for yeastchromosome III,wefounda

clearminimum in a(n)nearthepositionn = 149200, a re-gionthat isknowntocontainprimarily noncodingsequences. Wesubmitted the sequence from nucleotide position 149120

tonucleotideposition149401 to the experimentalGENINFO

BLAST(Altschuletal., 1990)networkattheNational Center for Biotechnology Information, which indicated a

remark-ably high similarity score ofthesubmitted sequence to the

jumpinggene known asretroelementTy4-476.

We measured"falsenegatives"by focusingon asubset of all open reading frames, those of more than 1000 bp. We

definefalsenegatives to be the absence of anunambiguous

pronounced minimum in a(n). We find that the CSF

algo-rithm failstolocateonly12%(3/25)oftheknown genes, but failsto locate 27% (16/60)of theopen readingframesthat

arenotknowntobegenes.Thus,theCSF algorithm is much

moresuccessfulontheknown genes than on putative genes.

Robustness of the CSF algorithm with respect to

sequencing

errors

Weperformedthree testsdesigned to test therobustnessof theCSFalgorithmwithrespect tosequencingerrors,by in-tentionally "mutating" a small fraction of thebp: 0.5%.The CSFalgorithmwould be most useful if itcould still be able to identify most of thecoding regions. Fig. 1 b shows the sameregion as shown in Fig. 1 a of the paper. The solid curve isidentical to the solid curve inFig. 1 a, namely, the CSF

algorithmresultfor thelong-rangecorrelation exponent aas

afunction of position in yeast chromosome III. The dotted curveis the sameanalysis applied to an"artificiallymutated" yeast chromosome III in which 0.5% of the bp were ran-domly substituted by a differentbp.There isalmostno dif-ference between the two curves whatsoever.

Another test is to randomly shift the reading frame, as shown inFig. 1 c. Thesolid curve is the same as inFig. 1.

Thedottedcurve isthe same analysis appliedto amutated chromosomeIII, in which 0.5% of the bp wererandomlycut outfrom one place and inserted in anotherrandomlyselected

(5)

locus of the chromosomein suchawayastoshift thereading frame. There is almost no changewhatsoever.

Finally, we show in Fig. 1 d what happens when both operationsofFig. 1,b andc,have beencarried out, sothat in fact twice asmany (1%) of the bp are mutated intotal.

Parameter optimization

Wenote thattheaccuracy of thealgorithm depends

impor-tantly on thewindow sizeparameter. Fig. 2 shows the

de-pendenceonwindow size of the fraction of minima thatoccur in coding regions for yeast chromosome III. Reducing the window sizeincreases the number oftruepositives (the sen-sitivity of thealgorithm) but increasesthe number of "false

positives" and "false negatives." This is the reasonthe

al-gorithm initspresent formischallenged forfindinggenes in mammalian sequences, whicharehighly fragmented by introns.Although theaveragesize ofanexoninmammalian DNAisonly about 186 bp (Watsonetal., 1992) (closetothe

lower threshold ofapplicability ofouralgorithm), thereare

many mammalian exons larger than the average that our method shoulddetect readily.

We have also studied how the fluctuations of the DNA

landscapes created by otherrules ofmappingcanbeused for

detectingcoding regionsaswellasvarious two-dimensional

DNAwalks(Berthelsenetal., 1992).In thegeneralized defi-nitionofaone-dimensional DNAwalk, one canassign dif-ferent values

SA,

SC, SG,

orSTtoanincrementof the ith step u(i)dependingonwhich nucleotideA, C, G,orToccurs on the ith place. For example,we studied correlations ofone nucleotidewithitself;inthiscase, one canassign u(i) = +1

ifnucleotideAoccurs onthe ithplace andu(i) = -1

oth-0.85 0 ._C 0 " 0.80 E 0.75 E r 0.70 0 0.65 t 0 500 1000 1500 2000 2500 3000 windowsize

FIGURE 2 Dependence of sensitivity of CSF algorithmonwindow size,

w,foryeastchromosome III.Sensitivity here is definedasthepercentage of theminima ofca that lie withinputative coding regions (see Fig. 1). Window size is defined inthetext(Step 1 of the algorithm).The solid circles

showthe results for the yeastchromosome III. Forwindowsizes less than 600,thefittingrangewaschosentobee,=8, 42=32.For windowsizes

largerorequal 600,wechose 42= 64. Theverticalarrowindicates the optimal nucleotide window size. The dashed line is the expected sensitivity

ofthealgorithm basedonrandom coincidence (see text).

erwise (incaseof C, G,or T). Similarly,westudied corre-lations ofpairs of nucleotides, suchasthepurine-pyrimidine ruleweused above.Except for the definition of u(i), therest

of theanalysisremains thesame asfor theoriginal purine-pyrimidine rule. Our calculations show that the original bi-narypurine-pyrimidine rule (Peng etal., 1992) is themost

robustone for detecting coding regions.

Test of the CSF algorithm on other long genomic sequences

We also applied the CSF algorithm (steps 1-5)to four ad-ditional long genomicsequences and observed comparable predictabilityasfor theyeastchromosome III. Thesequences were:liverwort marchantiapolymorpha chloroplastgenome

(GenBankname:CHMPXX, 121024 bp; 59% coding region, CSFsensitivity=74%withwindowsizew =1000); tobacco chloroplastgenomeDNA(CHNTXX, 155844bp; 53% cod-ingregions, CSF sensitivity = 72% with window sizew =

1200); rice complete chloroplast genome (CHOSXX, 134525bp; 50% coding regions,CSFsensitivity=68% with window size w = 2200); and Epstein-Barr virus (EBV ge-nome,172281bp; 71% coding, CSF sensitivity =90% with

window size w = 1400).

Test of the CSFalgorithm on control sequences

Notall "coding regions" for theyeastchromosome IIIand

othergenomicsequences we testedare confirmed (in fact, they are termed "putative coding regions" (Oliver et al., 1992)). Toobtainadditional evidence about thereliability of

the CSF algorithm, we analyzed "control sequences" that

containonly firmly identified codingandnoncoding regions. To thisend,wehave selectedfrom GenBank40 known cod-ingsequences(includingexonsandcDNAsequences) and 39 known noncoding sequences (including introns and in-tergenomic sequences). These samples (total length 80,000 bp) represent awide phylogenetic spectrum (including

se-quencesfromhuman, chicken,tobacco, bacterial, andviral DNA). The selection criteria for thesesequences are:(i) they areall oflengthgreaterthan 500bp, and (ii) thepercentage ofcodingandnoncodingmaterialapproximatesthat ofyeast chromosome III.

Next we "assembled" an artificial nucleotide sequence ("TypeIcontrols") by randomlysplicing together coding and noncodingsequences(inanalternating fashion) from thetwo

sample pools. We then applied the CSF algorithm to this controlsequenceandcomputedthenumber of minimainside and outside the known coding regions. We found that for window sizew= 800,almost90%of the minimacoincided

with coding regions. The percentage ofcorrect identifica-tions decreased to 60% with increases or decreases in w,

which iscomparablewiththe results obtained for the actual yeastchromosome III sequence(Fig. 3).

This test confirms that for coding and noncoding se-quencesof length larger than 500bp, theCSF algorithm is

highlyaccurate. It also illustrates anothergenericfeature of

I

-- ---I

(6)

Coding DNASequences 1.00

t07I

0 . CD CD 0.80 E 0 0 0.60 0 500 1000 1500 window size

FIGURE 3 Dependence of sensitivityofCSFalgorithmonwindowsize, w,forcontrolsequences.Sensitivityis definedinFig.2caption.The solid

circles showthe results foraTypeIcontrolsequence,and theopencircles

show theaveraged results for three TypeIIcontrolsequences(see text).The errorbars showthe SD. Thehorizontalarrow onthe left indicatesthe

per-centageofcoding regionsintheTypeI controlsequence(66%).Westarted

ouranalysisfor window size100and increasedthe window sizeupto1200.

Forlarger window sizes, the total number of minima decreases (downto15), thus thestatisticalerrorincreases.For smallwindow sizes(-100),thesignal isverynoisy,sothat thedetectionrateis aboutthe valueexpectedfora

randomsignal, i.e., 66%. With increasing window sizes, the sensitivityof

theCSFalgorithmincreases.Themaximumsensitivityof the CSFalgorithm fordetectingcoding regions (90%)for both classesof controlsequencesis obtained with window size 800.

the CSFalgorithm, i.e., that it can, inprinciple, beapplied

to DNAsequences ofvery differentorganisms because the underlyingmechanismfordetecting codingsequencesis the

same.

Finally,wetested theCSFalgorithmon asecondtypeof

control sequences ("Type II controls") constructed as fol-lows: for the artificial chromosome sequence described

above, we replaced each coding part by an uncorrelated computer-generatedsequenceof randomlettersA, C, G,and

T. We also replaced each noncoding sequence by a computer-generated sequence of letters with "built-in" power-law long-range correlations having correlation

pa-rameteraof0.62, using themethod inPengetal.(1993).We thencalculated the percentage ofcorrectpositivesforseveral

independent realizations of sucha sequence, and we com-putedtheSE of this value. The result shows90%sensitivity

for w = 800.

Whenwe compare Fig. 3 with Fig. 2, we find that our highest sensitivityfor theartificial chromosomesequenceis 90%, whereas for the yeastchromosome III sequence, we

achieved asensitivity of 82% with optimumparameter se-lection. Itisnot surprising that the results for the artificial chromosome sequence are somewhat better: (i) we elimi-nated theproblem of the putative coding regions,and (ii)we only considered codingand noncoding regions larger than 500nucleotides.

Wenotethatthe value of the exponent ameasuredover

previously describedquantities suchasthe length distribu-tion of tandem nucleotide repeats (Uberbacher and Mural, 1991). This isnotsurprising becausethe tandem repeatsof

length more than 10 may contribute to the value ofa cal-culated for these smallfitting ranges.However, tandem re-peatsbythemselvesdonotfullyaccountforthepower-law

long-range correlation we observed. Furthermore, because power-lawlong-range correlations witha>0.5 generateatype

ofpersistence (one nucleotide is more likely to be followed

by another of thesameclass), tandem repeatsare more likely

tobe found incorrelated ratherthanuncorrelatedsequences.

DISCUSSION AND SUMMARY

Theresults ofthe CSFanalysispresentedinthisstudyareof interest for twoprimary reasons:

First, these results providethe mostcompelling evidence

todateconfirmingtheclaimthatnoncodingsequences typi-cally possess long-range power law correlations whereas codingsequencesdonot.Theinitialreport(Pengetal.,1992) describing long-range (scale-invariant) correlations only in

noncoding DNA sequences generated contradictory

re-sponses (Karlin and Brendel, 1993; Li and Kaneko, 1992;

Munson et al., 1992; Grosberg et al., 1993; Nee, 1992; Chatzidimitriou-Dreismann and Larhammar, 1993; Voss,

1992; Prabhu and Claverie, 1992). Although some reports

supported this finding(LiandKaneko, 1992;Munson etal., 1992; Grosbergetal., 1993), it has also been challengedon two fronts: (i) by those claiming that no DNA sequences possess power-law long-range correlations (Karlin and

Brendel, 1993; Nee, 1992; Chatzidimitriou-Dreismann and

Larhammar, 1993; Prabhu andClaverie, 1992), and (ii) by thoseclaiming that introns andexonsbothcontain power-law

long-range correlations (Voss, 1992). The data presented

aboveandgraphically displayed in Figs.1-3unambiguously confirm that there is a systematic correspondence between

lower values of thescalingexponent aandcoding sequences,

andbetween higher values ofa and noncoding sequences.

Furthermore, these results applyin astatistically significant

waybothtothe entireyeastchromosome III and otherlong

genomic sequences as well as to control sequences

con-structed by alternating known coding and noncoding

sequences ofvariable lengths. These findings, along with

a recent re-analysis of the patchiness of DNA sequences

(Peng et al., 1994), disprove the contention ofKarlin and Brendel (1993) that power-law long-range correlations are

simplyanartifact of theheterogeneous(mosaic)structureof DNA.Furthermore,the results of the CSFanalysiscontradict Voss' (1992)report that power-law long-range correlations

are found in bothcoding and noncoding sequences. We also note the recent study by Prabhu and Claverie

(1992) claiming that their analysis of the putative coding

regions of the yeast chromosome III produced a wide range

of exponent values, some larger than 0.5. Thus, they too

failedtofindstatistical difference,basedon thecorrelation

exponent, between coding and noncoding regions. In contrast,the CSFanalysisdoes demonstratestatistically

sig-69 Ossadniketal.

(7)

nificantagreementbetweendipsina(n)and the presence of

putativecodingregionsfor yeast chromosome III. This ap-parentdiscrepancy results fromthe fact that Prabhu and and

Claverie (1992) aswell as Karlin and Brendel (1993) and Voss(1992) didnot accountfor thepresenceoflong regions

of strandbiascodingsequences. Asrecentlyreported (Peng

et al., 1994), the detrended fluctuation analysis (DFA) method (used in the CSF algorithm) can successfully dis-tinguishtruepower-lawlong-range correlations (e.g., those in noncoding sequences) from spurious correlations due to DNA"patchiness."

The fact that the value ofaforcoding regions is closeto

thatof randomuncorrelatedsequencesmight be relevantto

thetheoryofproteinfolding and is consistent with therecent

claim of Shakhnovich and Gutin(1990) that the aminoacid sequencesofbiologically activeproteinsarealsostatistically similar to uncorrelated random sequences. In contrast, the correlated properties ofnoncoding sequences might be

re-latedto evolutionary mechanisms involving nucleotide de-letionand insertion (Buldyrevet al., 1993a, b).

Second,we show how thenewalgorithm basedonthese biologic differencesincorrelationproperties canbe usedto screenlongDNA sequences toidentify coding and

noncod-ing regions. The CSF algorithm is ableto detect relatively long coding regions withahighdegreeofreliability.Its

ad-vantages include speed, simplicity ofuse, and operator in-dependence. Furthermore, because it is based primarily on

global statisticalmeasurements,it isnotaffectedby the

par-ticularspeciesexaminedorbysequencingerrors. Its major

limitations relate to the requirement for a relatively large window(>800bp)and theinabilitytoprecisely locateintron/ exon boundaries. Given these limitations, the optimal

ap-plication of the CSFalgorithm might betorapidlyscanlarge genomic sequences, to identify any potential coding sites, andthentoapply standardcoding-sequence finding toolsto

adetailed analysis of these selectedareas. Indeed, identifi-cation ofeven asingle putativeexonwouldimplythenearby location ofa gene that can then besearched for with

con-ventional techniques.

TheCSFalgorithm isparticularly attractive because itcan

be applied to sequences fromorganisms acrossthe phylo-geneticspectrum.Furthermore,becauseitisbasedon a glo-balstatistical measurement, it is notaffectedby local

mu-tationorlabsequencingerrors. On theotherhand, its global statisticalnature, as emphasized above, limits its ability to

locateprecisely the boundaries of codingandnoncoding

re-gions. Therefore, in its present form, CSF canbe used in

concertwith otheralgorithms(see "Step6"in ourprocedure)

that apply local property measurements.

Wewish tothank C. R.Cantor, C. DeLisi, and T. M. Sanders for valuable discussions, and an anonymous referee for having sug-gested the calculations that led toFig. 1 b-d.

Partial supportwas providedtoC.-K. PengbyNational Institutes ofHealth/National Institutes of MentalHealth,toA. L.Goldberger by the G. Harold and Leila Y. Mathers Charitable Foundation,the National Heart, Lung and Blood Institute, andthe National Aero-nautics and Space Administration, toM.Simons by the American HeartAssociation, andtoS. V.Buldyrev, S. Havlin, R. N. Mantegna, S. M. Ossadnik, and H. E. Stanley by the National Science Foun-dation andOffice of Naval Research.

REFERENCES

Altschul, S. F., W.Gish,W.Miller, E. W. Myers, and D. J. Lipman. 1990. Basic localalignment search tool. J. Mol. Bio. 215:403-410. Berthelsen, C. L., J. A. Glazier, and M. H. Skolnick. 1992. Global fractal

dimension of human DNA sequences treatedaspseudorandom walks. Phys.Rev. A. 45:8902-8913.

Buldyrev, S. V., A. L.Goldberger,S.Havlin,C.-K.Peng, H. E.Stanley,

M. H. R.Stanley, and M. Simons. 1993a. Fractallandscapesand

mo-lecular evolution:modeling themyosinheavy chain gene family. Biophys. J. 65:2673-2679.

Buldyrev,S. V.,A.L.Goldberger,S.Havlin,C.-K.Peng,M.Simons,and H. E.Stanley.1993b.GeneralizedLevywalk model for DNA nucleotide sequences.Phys.Rev.E.47:4514-4523.

Chatzidimitriou-Dreismann, C.A., and D. Larhammar. 1993.Long-range correlationsin DNA. Nature. 361:212-213.

Grosberg,A.Yu., Y.Rabin,S.Havlin,and A. Nir. 1993.Self-similarityin DNAstructure.Europhys.Leu. 23:373-378.

Karlin,S., and V. Brendel. 1993. Patchiness andcorrelationsin DNA

se-quences.Science. 259:677-680.

Li, W., and K. Kaneko 1992. Long-range correlations and partial l/f'

spectrum in a noncoding DNA sequence. Europhys. Lett. 17: 655-658.

Montroll,E.W., and M. F.Shlesinger. 1984. Thewonderful world of

ran-dom walks. InNonequilibriumPhenomena II: From Stochastics To

Hy-drodynamics. J. L. Lebowitz and E. W.Montroll, editors.North-Holland,

Amsterdam. 1-121.

Munson, P. J., R. C.Taylor,and G. S.Michaels. 1992.Long range DNA correlations extendoverentire chromosome. Nature. 360:636-636. Nee, S. 1992. Uncorrelated DNA walks. Nature. 357:450-450.

Oliver,S. G.etal. 1992. ThecompleteDNAsequenceof yeastchromosome III. Nature. 357:38-46.

Peng, C.-K., S. V. Buldyrev,A. L. Goldberger,S. Havlin,F. Sciortino,

M.Simons,and H. E.Stanley. 1992.Long-rangecorrelationsin nucle-otide sequences. Nature. 356:168-170.

Peng,C.-K., S. V. Buldyrev, A. L.Goldberger,S.Havlin,M.Simons, and H. E. Stanley. 1993. Finitesize effectsonlong-range correlations: im-plications for analyzing DNA sequences. Phys. Rev. E. 47:3730-3733.

Peng,C.-K., S. V. Buldyrev, S.Havlin, M. Simons, H. E. Stanley,and A. L.Goldberger. 1994. On the mosaicorganization of DNA sequences.

Phys.Rev. E.49:1685-1689.

Prabhu,V.V., and J.-M. Claverie. 1992.Correlationsinintronless DNA. Nature. 359:782-782.

Press, W.H., B. P.Flannery, S. A. Teukolsky, and W. T. Vetterling. 1991. Numerical Recipes in C: TheArtof ScientificComputing.Cambridge

University Press,Cambridge.

Shakhnovich,E.I., and A.M.Gutin. 1990.Implicationofthermodynamics

ofprotein folding for evolution ofprimary sequences. Nature. 346: 773-775.

Uberbacher,E.C.,andR. J.Mural. 1991.Locatingprotein-coding regions

inhuman DNA sequencesbyamultiple sensor-neural network approach.

Proc.Natl.Acad. Sci. USA. 88:11261-11265.

Voss, R. 1992. Evolution oflong-rangefractalcorrelationsandl/fnoise in

DNAbase sequences.Phys.Rev.Lett. 68:3805-3808.

Watson,J.D.,M.Gilman,J.Witkowski,andM.Zoller. 1992. Recombinant DNA.Scientific AmericanBooks,NewYork.