The Analysis and Identification of Protein-coding Sequences for Yeast Using a Free Energy Model

(1)

ABSTRACT

CHUANHUA XING. The Analysis and Identification of Protein-coding Sequences for Yeast Using a Free Energy Model. (Under direction of Dr. Donald L. Bitzer and Dr. Winser E. Alexander.)

(2)

(3)

The Analysis and Identification of Protein-coding Sequences

for Yeast Using a Free Energy Model

by

Chuanhua Xing

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fulfillment of the requirements for the Degree of

Doctor of Philosophy

Electrical and Computer Engineering

Raleigh, North Carolina May 2007

APPROVED BY

Dr. Wesley E. Snyder Dr. Steffen Heber

Member of Advisory Committee Member of Advisory Committee

Dr. Mladen A. Vouk Dr. Anne-Marie Stomp

Member of Advisory Committee Member of Advisory Committee

(4)

DEDICATION

To my husband Yong Wang,

to my parents Houhai Xing and Liqin Guan,

to my sisters Chuanli Xing and Chuanbo Xing,

and to my brother Chuanlong Xing and his family,

Yan Wang, Jialin Xing, and Jiajia Xing.

(5)

BIOGRAPHY

Chuanhua Xing was born in Heilongjiang, China on October 4, 1976. In the early

stages of her life, her family enlightened her through initial inspiration and

perse-verance toward science. She was soaked in the interests of knowledge especially in

mathematics, natural science, and literature. She received numerous awards as being

in the top 1% of students.

She obtained her Bachelor’s degree in Electrical Engineering at Heilongjiang

Uni-versity in July 1998. She served as the Head of her class, the President of the

Stu-dent Association for the department, and the PresiStu-dent of the Association of Science

and Technology during her undergraduate study. She participated in many

activi-ties or clubs including sports, the fashion model team, and literature competition.

She received many awards including the outstanding student award, the outstanding

leadership award, the excellent literature award, and others.

After graduation, she joined China Telecom at Heilongjiang in 1998. She worked

in the core technical department where she participated in the development of

in-novative technologies and macro-projects for information collection and distribution,

and communication system construction. I served as the technical supervisor in year

1999 by supervising the technical progress and affording the technical consultation.

Chuanhua enrolled at NC State University, USA in August 2001 for her

grad-uate study. She received the Master of Science degree in Electrical Engineering at

North Carolina State University in December 2002. She has worked on the

(6)

ysis of protein-coding genes using a free energy model since Fall 2003. Her paper

with the topic “Free Energy Based Analysis of the Coding Region of Saccharomyces

cerevisiae” was awarded as the Second Prize for the student paper contest in “The

Technology4Life on Biotech & Bioinfo Conference, IEEE”, Oct. 2004. She and her

collaborators have developed effective computational methods to analyze and

iden-tify protein-coding sequences with high performance utilizing biological mechanisms,

instead of analysis based on the DNA sequences alone.

She became a member of IEEE in 2001 and IEEE EMB Society in 2006. She

worked as the member of the administrative committee of the NCSU

Chemistry-Biology Interface / RNA Chemistry-Biology Group RNA group, NC State since 2005. She

volunteered to work as the web master for the High Performance DSP Laboratory

(HiPerDSP lab) from 2003 to 2006. She also has interests in many sports and has

benefited from experiences such as the Intramural Flag Football Team and the

Intra-mural Flag Softball Team, ECE Graduate Students’ Association, NC State University

in years 2005 and 2006.

(7)

ACKNOWLEDGEMENTS

This dissertation would not be possible without the help, support, and guidance of

many wonderful people including my Advisory Committee, my friends, and my family.

I would like to express my deepest and sincere gratitude and appreciation to my

advisors Dr. Winser E. Alexander and Dr. Donald L. Bitzer for their guidance,

en-couragement, and support throughout my Ph.D. study. I am indebted to Dr.

Alexan-der who advised and influenced me as an excellent educator. I deeply appreciate his

encouragement and guidance through the difficulties, and he has guided me to be an

independent and successful researcher. I am also indebted to Dr. Bitzer who is the

beacon for my research. His active thought, his spirit, and his attitude to research

will always be valuable treasures to guide my life. Working with Dr. Bitzer on my

passionate topic has been an unforgettable experience. I would also like to thank

other committee members Dr. Mladen A. Vouk, Dr. Anne-Marie Stomp, Dr. Steffen

Heber, and Dr. Wesley E. Snyder for a lot of beneficial suggestions and the consistent

support through the years.

I would like to convey my gratitude to all my colleagues, officemates, and friends,

especially, Cranos Williams, Senanu Ocloo, Josh Starmer, Lalit Ponnala, Ramsey

Hourani, Gary Charles, Youngsoo Kim, Jeff Ligon, Ruben Lobo, Evan Ernst, Robert

Snyder, Scott Vu, Sandeep Hattangady, Maitrik Diwan, Treshauna Wright, Li Li, and

Ying Zhu.

Finally I would like to thank my parents for their unfailing support all the time. I

(8)

will always appreciate my husband who walked with me through these years. Thanks

for his love and his solid support always.

(9)

List of Tables

4.1 Datasets for S. cerevisiae and S. pombe. . . 67 4.2 The results for selecting the training set size. . . 85 4.3 Accuracy of the synchronization based coding-region identification

algo-rithm on different coding/non-coding subsets. Se. is the abbreviation of sensitivity, and Sp. is the abbreviation of specificity. The rows with “*” are the results that are from Gao et al. [53]. . . 94 4.4 Comparison of several measures used DFT from Kotlar et al., 2003 [83]. . 96

5.1 The site statistical results for the identification of the branch sites, where BE in the first row refers to binding energy. . . 109 5.2 The site statistical results for the identification of the 5’ splice sites. . . . 110 5.3 The site statistical results for the identification of the branch sites, where

(13)

List of Figures

1.1 RNA processing and RNA splicing during transcription [158] . . . 6

1.2 Two steps of slicing mechanism [158]. . . 7

1.3 Spliceosome cycle [158] . . . 8

1.4 U1 and 5’ splice site base pairing [158] . . . 9

1.5 U2 and the branch site base pairing [158] . . . 9

1.6 U6 and 5’ splice site base pairing [158] . . . 10

1.7 Intron-bridging interactions [158] . . . 11

1.8 Illustration of 3 frames in one direction. . . 13

2.1 Illustration of free energy calculation between a short sequence and a long sequence. . . 31

3.1 The binding energy around the start codon for yeast chromosome II. . . . 47

3.2 Downstream binding energy for yeast chromosome II. . . 47

3.3 Power spectra of coding region for yeast chromosome II. . . 48

3.4 Power spectra of non-coding region for yeast chromosome II. . . 48

3.5 Power spectrum density for a protein-coding gene with the dominant one-third frequency component . . . 51

3.6 Power spectrum density for a protein-coding gene with the weak one-third frequency component . . . 52

3.7 Power spectrum density for an intron with the non-dominant one-third fre-quency component . . . 52

3.8 The relation of SNR distribution with the length of exons . . . 54

3.9 The relation of SNR distribution with the length of introns . . . 54

3.10 The polar plot for a protein-coding gene . . . 56

3.11 The polar plot for an intron . . . 57

(14)

4.2 The plot of the angle vs. position for a representative protein-coding gene. 69 4.3 The histogram of the terminal angles for 2000 protein-coding sequences from

the training set, where the outliers were eliminated. . . 70 4.4 The histogram of the terminal angles for 2000 non-coding sequences from

the training set. . . 71 4.5 Analysis of angle variations. . . 73 4.6 Illustration of three definitions for angle variation. . . 74 4.7 The histogram of the positions for a given angle variation over 2000

ran-domly selected protein-coding sequences from dataset sc-Coding. . . 76 4.8 The maximum tolerable positions vs. angle variations for three definitions 76 4.9 The plot of the amplitude vs. position for a representative protein-coding

gene . . . 79 4.10 Two groups of amplitude rate plots for the protein-coding sequences and

the non-coding sequences. . . 81 4.11 Sensitivity for specie S. cerevisiae over the different length intervals of

se-quences. . . 87 4.12 Specificity for specie S. cerevisiae over the different length intervals of

se-quences. . . 88 4.13 Sensitivity for specie S. pombe for the different length range of sequences. 91 4.14 Specificity for specie S. pombe for the different length range of sequences.. 91 4.15 The histogram of the terminal angles for 215 ORFs . . . 97 4.16 The histogram of the terminal angles for 215 ORFs including introns . . . 97

5.1 Binding energies for an example gene using the mask ‘3’- ATGATTG-5”. . 108 5.2 The polar plots for three sequences that decided their 3’ splice sites by

(15)

Chapter 1

Introduction

Biological organisms are highly organized, intricately regulated molecular

assem-blages. As such, they are information-rich systems. From an engineering perspective,

this means that they should provide information encoded signals. Therefore, it is

rea-sonable to ask if signal processing analysis can be used to detect, extract and decode

the information of biological systems. An assumption as to the nature of the encoded

signal is required to begin investigating utility of the signal processing approach.

The concept of free energy can be used to describe the interactions of molecules.

If an interaction is favored by convention, we indicate that the interaction is

accom-panied by a change in free energy from a higher (more positive) value in the starting

state to a lower value (more negative) after the interaction has occurred (the final

state). If a biological process consists of molecular interactions along a time or space

(position) continuum, a variable free energy pattern could be produced in which the

(16)

Ex-tracting the information in these signals should be possible using signal processing

techniques. I used signal processing approaches in my dissertation to develop new

tools to identify DNA sequences that encode protein coding information based on

these ideas.

At the molecular level, biological processes are assemblies of many chemical

re-actions. The chemical reactions are controlled by enzymes. Enzymes are catalytic

proteins. Proteins are essential parts of all living organisms and participate in every

process within cells. In addition to enzymes, there are other proteins that have

struc-tural or mechanical functions. The blueprints for all these proteins are encoded in

the DNA sequences called genes. The diversity of biological organisms and organism

responses to the environment result from the exact library of genes present in an

indi-vidual (genotype) and the regulation of gene utilization in response to environmental

stimuli (gene expression). Two fundamental problems of biological research are: 1)

identify the genes and 2) determine what regulates gene expression to understand

information content in molecular processes. The focus of my thesis research is the

utilization of free energy signals and signal processing analysis to identify protein

coding regions in genes and splice sites in pre-mRNA molecules.

In this chapter, the basic understanding of DNA and the relationships to the

protein-coding gene production process are outlined. Then we give the review of

cur-rent methods for the identifications of the protein-coding regions and its boundaries,

(17)

the selection of the model based on the underlying chemical and biological processes

rather than statistical and sequences-based abstractions. The rest of the chapter is

then oriented toward investigating the model and and its potential contributions for

distinguishing protein-coding sequences from non-coding sequences and identifying

splice sites.

1.1

Fundamentals of Gene Structure

This section provides the overview of the fundamental DNA knowledge, and affords

the basic terminologies and mechanisms involved the protein-coding gene

recogni-tion. The materials presented here can be found in the standard molecular genetics

textbooks [79, 91, 145, 158].

1.1.1

The Basic Chemical Structure of Genes

The nucleic acid polymers DNA (deoxyribonucleic acid) and RNA (ribonucleic acid),

and the amino acid polymers called proteins are three macro-molecules. Although the

structures of these three macro-molecules vary, each type of macromolecule is a chain

of a small number of monomers whose linear sequences are the defining characteristic

of each polymer.

Deoxyribonucleic acid (DNA) is a polymeric macro-molecule that encodes each

cell’s genetic information. The monomer units of the DNA and RNA polymers are

(18)

containing base attached to the sugar, and a phosphate molecule. Four different types

of nucleotides are found in DNA, differing only in the nitrogenous base, and are given

one letter abbreviations as shorthand for the four bases: A is for adenine; G is for

guanine; C is for cytosine; T is for thymine. Uracil is substituted for T in RNA.

“Nucleotide” and “base” will be used interchangeably in this work.

DNA is a linear (single stranded) polymer of nucleotides whose most stable state

is in the form of a double stranded helix. The DNA backbone consists of an

alter-nating sugar-phosphate sequence, with the basepairs stacked perpendicular to the

DNA backbone. The deoxyribose sugars (hydrogen instead of hydroxyl at 2’ carbon)

are joined at both the 3’-hydroxyl and 5’-hydroxyl groups to phosphate groups in

“phosphodiester” bonds. In the double stranded molecule, two DNA polymers are

aligned in an anti-parallel structure; one strand is oriented 3’-5’, the other 5’-3’. By

convention, we call the strand with the direction from 5’ to 3’ the plus strand (or

coding strand, Crick strand, sense strand), and the other with the direction from 3’

to 5’ the minus strand (or template strand, Watson strand, anti-sense strand). Most

frequently, RNA exists as a single stranded structure. Some short portions of RNA

are hydrogen bonded, which results in a complex, secondary structure.

Bases are known to interact through Watson-Crick binding, or base pairing. In

Watson-Crick binding, two bases may form either two or three hydrogen bonds. Such

a base pair is called complementary. A forms two hydrogen bonds with T, called A-T

base pair; C forms three hydrogen bonds with G, called C-G base pair. In RNA, A

(19)

two two hydrogen bonds.

Proteins consist of linear sequences of amino acids. Twenty amino acids are used

to make proteins. The information specifying the amino acid sequences of a cell’s total

protein complement is found in the coding sequences of the genes. The Genetic Code

is the set of rules that assigns each amino acid a three nucleotide codeword (codon).

In the Genetic Code there are 64 possible 3-nucleotide codons in which three are

used to terminate protein synthesis and the remaining 61 are used to specify the 20

amino acids. This results in redundancy in that from one to six codons can be used

to specify the same amino acid.

1.1.2

Gene Expression

Gene expression is the process by which a gene’s DNA sequence is converted into the

functional protein structures of the cell. Gene expression is a highly regulated,

multi-step process which has two basic activities. First, the transcription process produces

a copy of the gene’s coding region in the form of a RNA molecule (the messenger

RNA or mRNA). In the translation process, the mRNA molecule is assembled into

the ribosomal complex where its linear sequence of RNA codons is translated into the

linear sequence of amino acids resulting in protein synthesis.

Post-transcriptional Processing: Pre-mRNA Splicing

The transcription process is shown in Figure 1.1. A DNA sequence is transcribed

(20)

process during the post-transcriptional process for eukaryotes.

A DNA sequence is transcribed into pre-mRNA, or nuclear RNA (nRNA) that

consists of interspersed exons and introns for eukaryotic genes. A region of the DNA

sequence that codes for a section of the final functional transcript (mRNA) is called

an “exon”. The regions between exons are called introns. Introns are removed from

pre-mRNA by a process called RNA splicing, or pre-mRNA splicing. The boundaries

of exons and introns are called splice sites. The splice sites are recognized and introns

are removed during a process called pre-mRNA splicing. The “messenger RNA”, or

mRNA is the result of the pre-mRNA splicing.

The splice sites must be precisely recognized with a very low failure rate for

proteins to be made correctly. A large portion of my thesis research is focused on

investigating the mechanism of the pre-splicing process using free energy that could

encode the information used in splice site recognition.

(21)

The pre-mRNA is bound to a complex of protein and RNA molecules called the

spliceosome during splicing process. The typical spliceosome-mediated splicing

pro-cess is shown in Figure 1.2. Each intron is believed to have three recognition sequence

regions for splicing: a) the branch point characterized by the presence of an A

nu-cleotide at the middle of an intron, b) the 5’-splice site at the beginning of an intron,

and c) the 3’-splice site at the end of an intron. In the step 1 of the splicing

pro-cess, the 2’-OH of the branch point A is recognized then it attacks the phosphorous

atom bond of the 5’ exon/intron boundary (5’ splicing site or donor site) cleaving the

bond. This forms a lariat structure (shown in the middle of Figure 1.2) which holds

exon 1 ready for ligation (rebonding) to exon 2. In the second step, the 3’ end of

exon 1 attacks the phosphorous atom at the 3’ intron/exon boundary (3’ splicing site

or acceptor site), releasing the intron (as a lariat form) and splicing the two exons

together.

Exon 1

Exon 2

Exon 2 Exon 2

Exon 1 Exon 1

Figure 1.2: Two steps of slicing mechanism [158]

(22)

spliceosome is a large complex consisting of five small nuclear ribonucleoproteins

(snRNPs) , whose RNAs are referred to as U1, U2, U4, U5 and U6, and many

non-snRNP proteins [59, 96, 107, 111, 131]. It is important to note that base pairings

between the snRNAs and certain regions of the intron are an important feature of

this process.

Figure 1.3: Spliceosome cycle [158]

A brief summary of spliceosome cycle involving the pre-mRNA splicing process is

provided below. An excellent review of splicing is given in [158].

Step 1: (a) U1 binds with the pre-mRNA (the splicing substrate) at the 5’ intron/exon

boundary (5’ splice site or donor site). The first complex is called the

commitment complex (CC), and consists of the pre-mRNA plus U1 and

perhaps other substances. The base pairing between 5’ splice site and U1

(23)

AGGUAAGU UCCAUUCA 5'

5' U1 snRNA

3'

5' splice site

-1 1

Figure 1.4: U1 and 5’ splice site base pairing [158]

(b) Next, U2 joins and binds with the branch site, with help from ATP, to form

the A complex. Genetic analysis shows that the base pairings between

these sequences are essential for splicing [158] as shown in Figure 1.5.

Figure 1.5: U2 and the branch site base pairing [158]

(c) Subsequently, U4-U6 and U5 join to form the B1 complex. The U4/U6

complex binds with U5. Following complex formation, U4 dissociates from

U6 to allow three actions. 1) U6 displaces U1 from the 5’-splice site in an

ATP-dependent reaction that activates the spliceosome as shown in Figure

1.6, 2) U1 and U4 exit the spliceosome, and 3) U6 base pairs with U2.

Step one is then completed, resulting in separation at the 5’ exon/intron

boundary (5’ splice site or donor site) and formation of the lariat of intron

[158]. The activated spliceosome is also known as the B2 complex [158].

(24)

Figure 1.6: U6 and 5’ splice site base pairing [158]

formation of the lariat splicing intermediate with both held in the C1

complex [158].

Step 2: (a) With energy from a second molecule of ATP, the second splicing step occurs

joining both exons and removing the lariat-shaped intron with all held in

the C2 complex [158].

In the second step of splicing, the 3’ splice site is recognized, two exons

join, and the lariat intron is cut off [158].

A protein factor involved in the 3’ splice site recognition is U2AF (U2

associated factor), a multi-subunit protein consisting of 35 and 65 kD

sub-units. The 65 kD subunit binds to polypyrimidine tract near and upstream

of the 3’-splice site, and the 35 kD subunit binds AG at the 3’-splice site

[158]. SF1 (BBP, branchpoint binding protein) bridges between U2AF at

branchpoint (branch site) near the 3’-end of the intron and U1 snRNP at

the 5’-end of the intron, as shown in Figure 1.7. It is believed that SF1

helps to define the intron and brings the two ends of the intron together for

splicing. SF1 (BBP) recognizes the branch site UACUAAC [158]. Later,

(25)

branch point sequence (BPS) via the U2 snRNA, and with the U2AF65

C-terminal domain.

Figure 1.7: Intron-bridging interactions [158]

(b) In the next step, the spliced and mature mRNA forms leaving the intron

bound to the I complex [158].

(c) Finally, the I complex dissociates into its component snRNPs. snRNPs can

be recycled. The intron lariat intermediate is debranched and degraded

[158].

Translation

After transcription, the mRNA (messenger RNA) is transported out of the nucleus

into the cytoplasm of an eukaryotic cell. It is then translated into proteins

(transla-tion) [158].

Although all of the details of the interaction between the 18S rRNA and the mRNA

are not yet defined, it is clear that the 18S rRNA of the 40S ribosomal subunit is

involved in translation initiation, elongation, and termination. A scanning model

(26)

in eukaryotic mRNAs. This model postulates that the 43S initiation complex, that is,

the 40S ribosomal subunit carrying the charged initiator methionine tRNA (transfer

RNA) and initiation factors, binds to factors associated with the cap structure at the

5’ end of mRNA and then scans down the 5’ untranslated region (UTR) to locate the

start codon AUG, resulting in the correct positioning of the start codon in the donor

or P-site of the ribosomal complex [85]. Once the initiation site is recognized, the 60S

ribosomal subunit joins to form the 80S complex, and translation begins. Translation

continues until a stop codon enters the A-site of the ribosomal complex. The release

factor binding triggers the release of the polypeptide chain and dissociation of the

ribosomal complex from the mRNA, leaving both the 40S and 60S subunits ready for

another round of translation.

We are interested in the regions of the DNA sequence that are encoded into

proteins. These regions are called coding regions, or protein-coding regions. The

other regions that are not encoded into proteins are called non-coding regions. It is

important then to decide which nucleotide to start translation, and when to stop.

This is called an open reading frame.

It is important to determine the correct open reading frame (ORF) for a gene

se-quence. Every region of DNA has six possible reading frames, three in each direction,

because the ribosome reads mRNA in groups of three nucleotides, and shifts by three

nucleotides. For example, the sequence of DNA can be read in six reading frames.

Three in the forward and three in the reverse direction. The three reading frames in

(27)

with the “T” and Frame 3 with the “G”. The longest ORF is in Frame 1. When the

frame is the same with the one for the start codon, it is “synchronized” [115]. Frame

1 is in “synchronization”. Frameshifts happen when the ribosome does not move in

three nucleotides. When the frame is one base behind the start codon, it’s called -1

frameshit; when it is one base ahead of the start codon, it is called a +1 frameshift.

Then frame 2 is a +1 frameshift, and frame 3 is a -1 frameshift.

Frame 1: ATG TAC CGC TAC GAA TAA Frame 2: TGT ACC GCT ACG AAT AA Frame 3: GTA CCG CTA CGA ATA A

Figure 1.8: Illustration of 3 frames in one direction.

The reading frame is used to determine which nucleic acids will be encoded by a

gene. Typically only one reading frame is used in translating a gene (in eukaryotes),

and this is often the longest open reading frame. Once the open reading frame

is known, the DNA sequence can be translated into its corresponding amino acid

sequence. An open reading frame starts with a start codon (ATG) in most species

(28)

1.2

Methods Review

1.2.1

Review of Methods for Coding Regions Analysis

Coding Measures

Many new methods for finding distinctive features of protein-coding regions have been

proposed in the past two decades (see reviews by Fickett, 1996 [50]; Claverie, 1997

[28]; Math´e et al., 2002 [98]). These methods are based on different measures for

discriminating protein-coding regions from non-coding regions.

Most coding measures are based on statistical features in protein-coding regions,

which are not present in non-coding regions. Some examples include differences in

codon usage [143], hexamer counts [29, 48, 51], codon position asymmetry [49],

au-tocorrelations, nucleotide frequencies [12, 14, 49, 133], entropy [3], and the

mono-and diamino acid usage values [103]. The base compositional bias in coding

se-quences has also been exploited as a coding measure [130]. Periodicities, especially

the period-3 feature of a nucleotide sequence in the coding regions, have been used

widely [5, 25, 49, 62, 69, 83, 135, 149, 153, 165]. A detailed analysis for the various

coding measures has been given by Fickett and Tung [51].

Highly expressed genes often exhibit a non-random bias in codon usage, referred

to as “codon bias”. Codon bias can serve as one of the variables to determine how

likely the transcription and translation of an open reading frame (ORF) into a protein

product is. Codon bias can also be helpful for finding the codon sequences that are

(29)

“codon adaptation index” (CAI) as a quantitative way of predicting the expression

level of a gene. Karlin et al. (1998) [76] adopted the “codon usage” as an alternative

quantitative indicator. Bennetzen et al. (1982) [10] used the “codon bias index”

(CBI) to predict gene expression level. However, the codon-based expression models

are still based on rather qualitative assumptions about gene expression including

codon composition of only a limited set of highly expressed genes [128]. In 2000,

Zhang et al. [168] proposed the YZ score, and suggested it as a complement of CBI

or CAI. The YZ score is based on the Z curve theory of DNA sequences [168] to

reflect their codingness of an ORF or a fragment of a DNA sequence. In 2000 and

2001, Karlin et al. [74, 75] built the models in predicting gene expression levels using

codon usage differences on a somewhat broader set of highly expressed genes. In

2003, Jansen et al. [72] recapped the CAI and codon usage formalisms, calculated

the new parameters with larger sets of genes with improved expression data from the

organism yeast, and predicted the expression levels from the codon compositions of

genes using an alternative linear model.

Algorithms

The algorithms used to identify genes usually employ one or more coding measures

mentioned previously. These coding measures incorporate a unique feature of the

coding sequence which has a certain ability of identifying protein-coding and

non-coding sequences [18, 41, 108, 170]. Other measures are based on signals of the gene

(30)

of measures have been proposed for gene prediction. Some examples include artificial

neural networks [48, 89, 137, 154, 164], and linguistic methods [34, 97, 126].

Others methods, for example, Markov chains, have also received wide attention

for protein-coding genes analysis. Different variations and improvements over

con-ventional Markov models have been implemented in gene classification algorithms

[13, 38, 45, 92]. Moreover, the coding potential of a DNA sequence (the potential of

a DNA sequence to encode for a protein) is attributed to the local structures present

in the codons, which was believed to be more informative than the global properties

in distinguishing coding and non-coding sequences [83]. Therefore, many methods

are based on local information analysis. Some examples include Kotlar et al. (2003)

[83] and Kulkarni et al. (2005) [87]. Kotlar et al. (2003) [83] proposed a new

dis-criminating feature based on the arguments of the Discrete Fourier Transform (DFT)

for locating short genes and exons. Kulkarni et al. (2005) [87] employed information

about the local H¨older exponents for distinguishing coding sequences from non-coding

sequences using binary support vector classification algorithm.

Multifractal analysis has been exploited for the analysis of coding regions of DNA

sequences (reviewed in [87]). Zhou et al. (2005) [43] used the global features

ob-tained from multifractal analysis of nucleotide sequences to distinguish coding from

non-coding sequences. Similar principles have also been adopted to investigate

long-range correlations in the DNA sequence. Peng et al. (1992) [40] used a DNA walk

model to discover the presence of long-range correlations in non-coding sequences.

(31)

(1992) [110] further proved that such correlations also exist in coding sequences.

Ar-neodo et al. (1998) [37] and Audit et al. (2001) [39] have recently shown the presence

of long-range power law correlations in eukaryotic coding sequences. Meanwhile, the

importance of periodicities of a given DNA for determining the protein-coding

re-gions have attracted wide attention, and was addressed by Fickett (1992). These

periodicities, especially period-3, have been used as discriminant features in several

studies of gene prediction [5, 25, 51, 62, 135, 149]. Silverman and Linsker (1986)

[135] used the periodic patterns in the coding sequences revealed by Fourier

trans-form to distinguish protein-coding sequences from non-coding sequences. Kotlar et

al. (2003) [83] employed the Discrete Fourier Transform (DFT) as a tool for studying

periodicities. Gao et al. (2005) [53] used a “deviation” from the sequence’s fractal

“background” to distinguish the protein-coding sequences from non-coding sequences.

Instead of signals from nucleotides characters, the period-3 signals in protein-coding

regions were investigated using the DFT from the free energy between rRNA and

mRNA interactions during translation in prokaryotic genes [115, 116].

1.2.2

Review of Methods for Splice Sites Identification

Annotation of genomic sequences in prokaryotes is substantially easier because the

coding regions are contiguous. In eukaryotes, the difficulty in identifying coding

se-quences is increased because, for many genes, the coding regions (the major regions

of exons) are non-contiguous, separated by intervening non-coding regions (introns).

(32)

influenced by their accuracy at determining the precise boundaries of exons and

in-trons, since the splice signals are probably the most critical signals for accurate exon

prediction.

Site conservations and site dependencies are two main concerns in splice sites

iden-tification problems. Our knowledge of splice sites ideniden-tification and introns excision

during the pre-mRNA splicing process indicates that the splice sites exhibit strong

features which facilitate the specific interactions between splice sites on pre-mRNA

and spliceosome complex. Such features are manifest as the site conservations and

site dependencies. Splice site sequence conservation can be seen, for example, “GT”

conservation at the 1st and 2nd positions of 5’ splice site. Other potential or hidden

signals may be found by the comparison with other DNA sequences. The site

depen-dencies, include the adjacent and non-adjacent positions, defines the site variation.

The site dependencies are the other major concerns for many programs used to

im-prove the prediction accuracy. Studies have shown that there are strong dependencies

between non-adjacent as well as adjacent positions around splice sites. Almost three

out of four of all base pairs exhibit significant dependence in donor sites [17]. The site

conservations and dependencies are not independent of each other, and they may be

considered at the same time in the different collaboration levels in one program. We

will first review the research concerning dependencies which starts from the methods

for site features. The review of methods about features selection and classification

(33)

Algorithms Concerning to Site Dependencies

Several statistical models for donor and acceptor splice sites have been constructed in

the past 20 years [7, 18, 19, 127, 141, 166, 172]. One of the earliest and most

influen-tial models is the weight matrix model (WMM) [141] that uses the position-specific

compositional biases in splice sites. The WMM weights can be optimized using a

neural network method [16] that was developed for NetPlantGene [60], NetGene2

[150], and NNSplice [112]. It is a component of SpliceMachine [32] as well. Another

method, called the weight array model (WAM) [142, 172], was developed to describe

the dependencies between adjacent base positions by the inhomogeneous first-order

Markov chain (1MC) model, and was later applied using the VEIL [61] and

MOR-GAN [121] software program. WMM and WAM are important components in gene

prediction systems such as GeneSplicer [109] and GenScan [18]. Markov models have

been used as well [120, 172].

Statistically significant dependencies between base positions in the splice sites

have been investigated recently [2, 8, 17, 18, 19, 30, 47]. One explanation for the

observed dependencies between splice site positions is the interactions between the

structure of small nuclear RNPs (snRNPs) and the splice site region of the pre-mRNA

during spliceosome assembly [99].

Several new algorithms have been proposed to improve predictive power by

consid-ering the dependencies between splice site positions. Some examples include the

max-imal dependence decomposition method (MDD) [18], quadratic discriminant analysis

(34)

com-bined with Bahadur expansion [7]. The latter two models have attempted to analyze

splicing sites with pairwise correlations.

The maximal dependence decomposition (MDD) model has been used to capture

the dependencies of splice sites in contrast to using only the dependencies between

the adjacent positions in Genscan [18]. The MDD method is basically a decision tree

bifurcating at the most influential residuals. Since branching occurs only when

sta-tistically significant residuals are detected, this method can suppress the increase of

model parameters compared to higher-order Markov models even with higher-order

dependencies [7]. Such models did not indicate significant improvement compared

with the simpler models with only dependencies between adjacent positions.

How-ever, the combination of the basic statistical models, such as WMM, WAM and MDD,

with other signal/content sensors or/and rule-based filtering may improve the

mod-elling. The MDD model in GeneSplicer [109] combined with two second-order Markov

chain (2MC) models and a local maximal score filter has been used to characterize

splice sites in eukaryotic mRNA. The comparison of GeneSplicer to other splice site

predictors, such as NetPlantGene [60], NetGene2 [16, 60], HSPL [65, 138], NNSplice

[112], GENIO [93, 94], SpliceView [114], etc., indicates that the performance of

Gene-Splicer is comparable with the best predictors for both human and Arabidopsis data.

A Bayes network model [19] has been developed by computing the correlations

between all residuals, finding the maximum spanning tree by linking positions of

high correlations, and computing the conditional probability for each linked position.

(35)

Markov model. Castelo et al. (2004) [21] adopted Bayesian networks, exploiting

the results [22, 21]) that permit reducing the bias in the estimation problem as the

sample size increases. Chen et al. (2005) [26] developed a dependency graph model

and its derivatives for splice sites prediction, and made an attempt to fully capture

the intrinsic interdependency between base positions in a splice site.

Zhang et al. (2003) [171] generalized the diversity increment method and

com-bined it with the quadratic discriminant analysis [170] (called IDQD, increment of

diversity combined with quadratic discriminant analysis) to identify and predict the

splice sites. The diversity increment model was based on the diversity measure that

can synthesize different types of information, the splicing signals, the compositional

and base-correlating features of exons and introns. It employed two types of methods,

intrinsic and extrinsic. The comparison of compositional features and the base

de-pendencies at adjacent or non-adjacent positions of two sequences (for example, one

sequence before exon/intron boundary and one sequence after exon/intron boundary,

or one standard set of exons or introns and another set of sequence whose property

is to be predicted, etc.) can be integrated automatically in the diversity increment.

Algorithms for Site Features Selection and Classification

Feature selection has been used in splice sites analysis and identification. Traditional

Feature Subset Selection (FSS) methods are sequential and are based on a greedy

heuristic [80]). Genetic algorithms (GAs) are the more advanced methods that use

(36)

do-mains [86, 134, 156]. Instead of using the crossover and mutation operators to create

the new population, a more statistical approach is used to estimate the distribution

of the parameters from a selected group of individuals and then derive the new

popu-lation from the estimated distribution. Estimation of distribution algorithms (EDAs)

have proven to outperform the standard genetic algorithms in that multiple

depen-dencies among parameters require fewer fitness evaluations to obtain good solutions.

EDAs were first used for feature subset selection by Inza et al. (1999) [68], and its

ap-plications to FSS in large scale domains produced good results [4]. Cant´uPaz (2002)

[20] compared several EDAs with the simple GA for small scale domains (at most

35 features) using a Naive Bayes classifier (for example, WMM), and concluded that

the EDAs with their complicated dependency learning system are not significantly

better than the GA with the simple compact. The EDA-UMDA (EDA with the

Univariate Marginal Distribution Algorithm as the estimation algorithm) approach

was suggested to be very similar to the compact GA [58] or to a GA with uniform

crossover. Saeys et al. (2003) [119] used a simple EDA as a wrapper for feature subset

selection for splice site prediction with equal or higher relevance than the traditional

sequential methods, resulting in a better classification of the splice sites.

Recent approaches based on discriminant functions such as Winnow [27] or the

support vector machine (SVM) [31, 139, 146] showed significant improvements in

pre-diction performance compared to previously used systems such as NetGene2 [150],

SPL, SplicePredictor [155] and GeneSplicer [109]. The feature distributional

(37)

feature selection by distributional clustering of words via the information bottleneck

method [148]. Degroeve et al. (2002) [31] combined the SVM with feature subset

selection in a wrapper algorithm (SVM) [15, 156] for splice site prediction. An

ex-tensive overview and comparison of splice site recognition were discussed in Math´e et

al. (2002) [98] and Zhang (2002) [169].

1.3

Some Discussion of Existing Methods

1.3.1

Analysis and Recognition of Protein-coding Regions

Current algorithms for recognizing protein-coding regions are based on various

cod-ing measures that find the differences between protein-codcod-ing regions and non-codcod-ing

regions, as reviewed in Section 1.2.1. These coding measures have investigated the

statistical behaviors of protein-coding regions from omnifarious aspects of DNA

se-quences features. Some measured the statistical features of nucleotides, such as

nu-cleotide frequencies and hexamer counts. Some measured the statistical features

of codons, such as differences in codon usage, codon adaptation index, and codon

bias index. Some combined the several features of protein-coding regions, such as

the Z curve that measures the features of purine/pyrimidine nature of nucleotides,

amino/keto nature of nucleotides, and strong/weak hydrogen bonding property of

nucleotides. Some measured the global information of sequences such as correlation,

multifractal features, and periodicities.

(38)

incor-porated the underlying mechanism(s). Highly expressed genes exhibit strong codon

bias for particular codons in many bacteria and small eukaryotes. One suggested

explanation is that there appears to be a relationship between tRNA abundance and

codon bias [67, 76, 128]. The explanations for most of the other statistical behaviors

remain unknown. Although some biological mechanisms of recognizing the

protein-coding regions have been investigated, as briefly reviewed for the translation

pro-cess in Section 1.1.2, the programs for identifying protein-coding regions have rarely

considered the biological processes in which these sequences participate, or give the

biological explanation for the revealed statistical behaviors.

1.3.2

Identification of Splice Sites

Various complex and sophisticated methods have been investigated statistically for

increasing the identification accuracy of the splice sites, as discussed in Section 1.2.2.

We have summarized these investigations from two main aspects, site conservations

and dependencies. Although the observation of the pre-mRNA splicing process

sug-gests that the splicing process exhibits strong features of site conservations [55] and

site dependencies [100], which facilitate the specific interactions between splice sites on

pre-mRNA and spliceosome complex, these methods have rarely utilized pre-mRNA

(39)

1.4

Derivation of A New Approach

We propose an approach of identifying the protein-coding sequences and the splice

sites by measuring their biological processes using free energy. An important issue in

today’s research is identifying the protein-coding regions and locate its exact

bound-aries, especially the splice sites, for eukaryotic genes [36, 138]. We assume that a

variable of free energies pattern could arise from the hybridization of a short RNA

sequence and a DNA sequence. We can measure this biological hybridization

us-ing the bindus-ing free energy. We then can analyze the variable energy pattern as a

non-random signal using signal processing techniques, and extract the information to

identify the protein-coding sequences and the splice sites.

1.4.1

Identification of Protein-coding Sequences

In prokaryotes, free energy based calculations for Watson-Crick hybridization between

the 3’-terminal, single stranded, 13 nucleotides of the 16S rRNA sequence (16S tail)

and mRNAs have been used to analyze the translation process [90, 104, 116, 115, 147].

The literature suggests that 16S rRNA has a role in reading frame synchronization

throughout the translation process [140]. Putting these two ideas together, we

won-dered if, during translation, the 16S tail could repeatedly hybridize to the mRNA

resulting in a free energy signal. If this could occur, the signal could encode

informa-tion to control reading frame. Rosnick (2001) [115] computed the free energy scores

in kCal/mol between the 3’ tail of 16S rRNA of E. coli K-12 and an ensemble of

(40)

In his algorithm, the 16S tail was overlaid on each mRNA, starting 50 nucleotides

upstream of the start codon, a free energy value was calculated and then the 16S tail

was moved downstream one nucleotide. The free energy calculation was done for this

new alignment. This process was repeated until the 16S tail is beyond the gene stop

codon. Using this method, Rosnick obtained one free energy sequence for each of the

2000 genes. The free energy sequences were then averaged to extract the ensemble

behavior for the genes of E.coli. This ensemble free energy signal showed

period-3 behavior in the coding region. Mishra et al. [105] extended Rosnick’s work to a

number of eubacterial and archae species of prokaryotes, and found the same period-3

behavior in gene coding regions. Ponnala et al. (2006) [116] used a procedure similar

to that of Rosnick, and obtained free energy signals for gene sequences. Ponnala et

al. (2006) showed that a periodic energetic pattern of frequency 1/3 exists in the

majority of coding regions of 12 eubacterial species, but not in the noncoding regions

that encode the 16S and 23S rRNAs.

Extending these observations to genes in eukaryotic organisms, we investigate the

interactions between the 3’ tail end of the 18S rRNA and the underlying mRNA using

the free-energy calculations. The success of this approach, i.e. prokaryotic mRNA

sequences have a period-3 encoded signal that can be revealed through hybridization

with the 16S tail, prompts the idea that the free energy signal could be useful for

identifying protein coding regions. This idea is supported in part by the work of

Hagenbuchle et al. (1978) [56] and Sargan et al. (1982) [123] who suggested a

(41)

initiation codon of mRNA. This raises the question as to whether the 18S rRNA tail

in eukaryotic genes, as the reviewed function of 16S rRNA tail in prokaryotic genes,

has a synchronization role with the reading frame throughout the translation process?

Therefore, the first task of my thesis work is to explore whether the protein-coding

sequences for yeast include period-3 signals like prokaryotic genes, and whether we

can use the period-3 signals to identify the coding sequences.

1.4.2

Identification of the Splice Sites

The splice sites are believed to be the most important functional sites for gene

identi-fication [120]. If they could be reliably detected from the genomic DNA, the difficulty

in identifying the coding regions would be greatly reduced. We address the derivation

of our approach for the identification of the splice sites in this section.

As addressed above, the statistical methods endeavored to improve their

identifi-cation performance from two main aspects, site conservations and site dependencies.

Although the observation of the pre-mRNA splicing process suggests that the splicing

process exhibits strong features of site conservations [55] and site dependencies [100],

methods have seldomly utilized this mechanism for the identification of the splice

sites.

Mechanistic studies of splice site recognition involve interactions between the

pre-mRNA sequence and the conserved sequences in the snRNAs (small nuclear RNAs),

which are complexed with the snRNP components of the spliceosome [55]. This has

(42)

inter-actions could be a useful approach for identifying the rules of splice site recognition

[114].

A program that is capable of measuring the biological interactions of snRNAs

and pre-mRNA may capture the best combination of the site conservations and

de-pendencies. Such a program may also reveal features that contribute to the splice

sites identification. Garland and Aalberts [54] used a free energy approach to explore

the mechanism of donor site recognition by U1 during the pre-mRNA splicing

pro-cess to identify the donor sites for a set of 65 human genes [54]. They linked two

sequences into one sequence, and fed it into Mfold to locate the donor site with the

minimum free energy. In my thesis, I expand this approach to develop an improved

program that calculates the free energy specifically between a short sequence and

a DNA sequence which is able to identify all three splice sites. The previous free

energy, period-3 signal may also assist the identification of the splice sites by finding

the consistent period-3 signal on the coding regions across the splice sites. Therefore,

the second task of my thesis work is to identify the splice sites, the main boundaries

of protein-coding/non-coding regions.

Thus, my thesis research focuses on the utilization of free energy signals to

mea-sure the biological interactions for: 1) identifying the protein-coding/non-coding

se-quences, i.e. the annotation problem of identifying open reading frames (ORFs) in

genomic sequence, and 2) a closely related problem of eukaryotic sequences, finding

(43)

boundaries) in eukaryotic gene sequences.

1.5

Outline

Chapter 2 describes two basic algorithms used in our work. Chapter 3 presents the

discovery of the periodic signal in yeast sequences, and the selection of algorithm for

coding/non-coding sequences analysis. Chapter 4 provides a model of

protein-coding/non-coding sequences identification based on the periodical signal discovered

in Chapter 3. Chapter 5 discusses a model to pin-point the exact boundaries of exons

and introns, the splice sites. The summary of the signal discovery, the methods, the

(44)

Chapter 2

Two Basic Algorithms

Two basic algorithms, free energy calculation and cumulative sinusoidal wave, are

specifically described in this chapter to avoid the iteration in the later chapters.

We gave some discussion of the relation of the cumulative sinusoidal wave with the

conventional minimum mean-squared error estimation at the end of this chapter.

2.1

Algorithm for Free Energy Calculation

From the introduction of our approach in Chapter 1, we propose to utilize the

biolog-ical processes to analyze and identify protein-coding sequences and the splice sites.

We then use the free energy to measure the two sequences’ interactions during the

biological processes. In this section, we give the description of free energy calculation

for hybridization between two sequences, a short sequence and a long gene sequence,

in which the short sequence is sliding on the long gene sequence. The short sequence

(45)

The algorithm assumes that the alignment between two sequences adjusts to form

the most stable secondary structure during their interactions [100]. In the first

align-ment of the algorithm, the terminal, 3’-nucleotide of the decoder is aligned with the

terminal, 5’-nucleotide of the gene sequence. The algorithm then determines the

sec-ondary hybridized structure with the most negative, i.e. most stable, free energy, then

calculates that free energy, ∆G◦_{, using a dynamic programming algorithm [136, 147].}

This is illustrated on the top half of Figure 2.1. 3’-ATTACTAG-5’ was the decoder

in this example. A stable double-helical structure can occur when ∆G◦ _{is less than}

zero. The optimal binding energy ∆G◦ _{is recorded as}_e_{(0) for the first alignment.}

ATGTAGATTCTCG ---AGGCCTACTAA ATTACTAG

ATGTAGAT e0

e0

One calculation

The calculations on a Sequence

ATTACTAG

ATTACTAG

e1 _e

n … …

ATA CTAG ATGT GGT

T Find the optimal

2nd_structure

Calculate the free energy

A

Figure 2.1: Illustration of free energy calculation between a short sequence and a long

sequence.

The short sequence is moved downstream (in the 3’ direction) along the gene

sequence one nucleotide at a time from the start codon to the end codon, generating

a series of alignments. For each alignment, a free energy value is calculated. We then

calculate a free energy sequence, or a free energy vector, for all the alignments over

(46)

energy sequence is then recorded as E = [e(0), e(2), ..., e(n)].

A number of researchers have used free energy as a metric for studying the

inter-actions of sequences [38, 63, 90, 106, 113, 115, 118, 125, 147]. RNAcofold [63, 118] is

a program that uses a linker sequence to join the two sequences into a single strand of

RNA prior to folding. Garland and Aalberts (2004) [54] used a similar approach by

linking two sequences into one sequence and then feeding it into MFOLD [173, 174]

to calculate the free energy between the two sequences. RNAhybrid [113] is a tool for

finding the minimum free energy hybridization of a long and a short RNA.

When considering the free energy between two sequences, several types have been

described, as given in Equation 2.1.

∆G◦ _{= ∆}_G◦

init+ ∆G◦stack + ∆G◦Iloop+ ∆G◦Bloop (2.1)

In this formula, ∆G◦

init is the amount of free energy required to initiate a helix

be-tween the two strands of RNA [52]; ∆G◦

stack is the free energy associated with the

stacked bases in a double-stranded, hybridized sequence; ∆G◦

Iloop is the internal loop

penalty [174]; ∆G◦

Bloop is the bulge loop penalty [174]. Dangling 5’ or 3’ ends [11] are

not considered in my algorithm because of ambiguities regarding what constitutes a

dangling end on either the mRNA sequences or the 5’ end of the decoder sequence, for

example the 18S rRNA tail. We obtained free energy parameter values for

Watson-Crick binding from Xia et al. (1998) [161], G/U mismatches from Mathews et al.

(47)

2.2

Algorithm of Cumulative Sinusoidal Wave

The “cumulative sinusoidal wave” algorithm is the method we used to extract the

synchronization signal from a free energy sequence, and is described in this section.

The “synchronization signal” is defined as the period-3 signal associated with the

authentic reading frame.

2.2.1

Mathematical Theory

We can express the period-3 signal using a sinusoidal wave in the time domain

(po-sition in nucleotides in our work).

y(n) = C0+Msin (w0n+φ) +z(n), (2.2)

where C0 is constant, M is the amplitude, φ is the constant phase shift. n =

0, 1, 2, .... z(n) is the noise at the position n.

If the period is 3 nucleotides,

sin (w0n+φ) = sin (w0(n+N) +φ) = sin (w0(n+ 3) +φ)

∴3w0 = 2π ⇒w0 = 2π/3.

Then Equation 2.2 can be written as

y(n) = C0+Msin

µ

2π

3 n+φ

¶

(48)

Equation 2.3 can be further written as 3 different expressions.

y(3k) = C0+Msin (φ) +z(3k), when n= 3k;

y(3k+ 1) = C0+Msin µ

2π

3 +φ

¶

+z(3k+ 2), when n= 3k+ 1;

y(3k+ 2) = C0+Msin µ

4π

3 +φ

¶

+z(3k+ 2), when n= 3k+ 2,

wherek is non-negative integer.

When the signal is embedded in strong background noise, we need a technique

to emphasize the signal relative to the noise. To do this, we utilized a cumulative

calculation that sums up the free energy modulo 3 over the first k codons.

A(k) =

k−1 X

i=0

y(3i) = kC0+kMsin (φ) +

k

X

i=0

z(3i); (2.4)

B(k) =

k−1 X

i=0

y(3i+ 1) =kC0+kMsin µ

2π

3 +φ

¶

+

k

X

i=0

z(3i+ 1);

C(k) =

k−1 X

i=0

y(3i+ 2) =kC0+ µ

4π

3 +φ

¶

+

k

X

i=0

z(3i+ 2),

The average (DC term) is subtracted from A(k), B(k), andC(k).

a(k) = A(k)−CDC; (2.5)

b(k) = B(k)−CDC;

c(k) = C(k)−CDC

CDC =

A(k) +B(k) +C(k)

3 =kC0+

P₃_k₋₃

i=0 z(i)

3

The removal of the DC term is based on the mathematical fact that we can always

(49)

three points equals zero. For a case of a real free energy sequence, the DC term is

non-zero because the free energy values are non-positive.

Then, we assigneda(k),b(k) andc(k) to expressions corresponding to a sinusoidal

wave function.

a(k) = kMsin

µ

φ+ 2π

3

¶

+

k−1 X

i=0

z(3i)−

P₃_k₋₃

i=0 z(i)

3 ; (2.6)

b(k) = kMsin

µ

φ+ 4π

3

¶

+

k−1 X

i=0

z(3i+ 1)−

P₃_k₋₃

i=0 z(i)

3 ;

c(k) = kMsin(φ) +

k−1 X

i=0

z(3i+ 2)−

P₃_k₋₃

i=0 z(i)

3

Ifz is random noise,Pk_i₌₀−1z(3i+ 1)'P_ik₌₀−1z(3i+ 2)'P_ik₌₀−1z(3i+ 3) whenk →

a large number. Then

k−1 X

i=0

z(3i)−

P₃_k₋₃

i=0 z(i)

3 →0,

k−1 X

i=0

z(3i+ 1)−

P₃_k₋₃

i=0 z(i)

3 →0,

k−1 X

i=0

z(3i+ 2)−

P₃_k₋₃

i=0 z(i)

3 →0.

Therefore, we can observe that a(k), b(k) , and c(k) converge to linear functions

that have a trend of linearly increasing amplitude and constant phase, whenk is large

enough.

a(k) ' kM sin(φ) ; (2.7)

b(k) ' kM sin

µ

φ+ 2π

3

¶

;

c(k) ' kMsin

µ

φ+ 4π

3

(50)

2.2.2

Application to a Real Free Energy Signal

We can always fit a sinusoid with period 3 to the equally spaced 3 points, no matter

whether a free energy sequence is period-3 or not. However, when the free energy

sequence is period-3, its cumulative amplitude will be linearly increasing, and its

cumulative phase will tend to be constant, as discussed above. This becomes the

basis to test whether a given sequence is protein-coding sequence or not. We describe

the application of the method of cumulative sinusoidal wave on the real free energy

sequences as follows.

For a real free energy sequence, we first summed up the free energy sequence

modulo 3 over the first 3k nucleotides (k codons), which resembles the DNA walk

[40]. Then, we fitted a sinusoidal wave to the three accumulated free energy values.

We performed the same procedure as given in Equation sets 2.4 and 2.5.

A0₍_k_{) =}

k−1 X

i=0

e(3i); (2.8)

B0₍_k_{) =}

k−1 X

i=0

e(3i+ 1);

C0₍_k_{) =}

k−1 X

i=0

e(3i+ 2),

a0(k) = A0(k)−C_DC0 (k); (2.9)

b0₍_k_{) =} _B0₍_k₎₋_C0

DC;

c0₍_k_{) =} _C0₍_k₎₋_C0

DC(k)

C0

DC(k) =

A0₍_k_{) +}_B0₍_k_{) +}_C0₍_k₎

3 =

P₃_k₋₃

i=0 e(i)

3

(51)

calculated the amplitude and phase by solving two of the following three equations.

a0₍_k_{) =} _M₍_k₎_sin₍_φ₍_k_{)) ;} _(2.10)

b0₍_k_{) =} _M₍_k₎_sin

µ

φ(k) + 2π 3

¶

;

c0₍_k_{) =} _M₍_k₎_sin

µ

φ(k) + 4π 3

¶

Comparing Equation sets 2.10 with Equation sets 2.7, we can determine that

M(k) → k ∗M and φ(k) → φ as k → a large number, only if a dominant period-3

signal exists in the free energy sequence. In other words, the amplitude is linearly

increasing and the phase is constant for a free energy sequence with the dominant

period-3 signal, whenk is large enough.

The cumulative calculation compensates for the small signal to noise ratio (SNR)

due to performing the computations for any single codon. In essence, the use of the

accumulated free energy sequence is a noise reduction technique.

2.3

Some Discussion with Conventional Method

In this section, we evaluate two methods of estimate the bias, the amplitude, and

the phase for the waveform with a period of three, the conventional minimum

mean-squared error and the cumulative sinusoidal wave. We firstly describe the conventional

method of minimum mean-squared error. Then we give some discussion of the method

of minimum mean-squared error and the method of cumulative sinusoidal wave [162,

(52)

2.3.1

Minimum Mean-squared Error

A noise free, period-3 signal can be written as

x(n) = C0+Msin

µ

2π

3 n+φ

¶

We seek to find the values of the parameters C0, M, and φ that best represent the

data values, denoted as y(n), in the minimum mean-squared error. The error is

e(n) = x(n)−y(n).

That is, we seek to minimize the mean squared error.

E = 1

N

X

n=1 µ

C0+Msin

µ

2π

3 n+φ

¶

−y(n)

¶₂

Surprisingly, this non-linear minimization problem has a closed-form solution. First,

we calculate the three derivatives:

∂E ∂C0 = 2 N N X n=1 µ

C0+Msin

µ

2π

3 n+φ

¶

−y(n)

¶ (2.11) ∂E ∂M = 2 N N X n=1 µ

C0+Msin

µ

2π

3 n+φ

¶

−y(n)

¶ µ

sin

µ

2π

3 n+φ

¶¶ (2.12) ∂E ∂φ = 2a N N X n=1 µ

C0+Msin

µ

2π

3 n+φ

¶

−y(n)

¶ µ

cos

µ

2π

3 n+φ

¶¶

(2.13)

Setting the first expression in Equation 2.11

The Analysis and Identification of Protein-coding Sequences for Yeast Using a Free Energy Model

ABSTRACT

APPROVED BY

DEDICATION

ACKNOWLEDGEMENTS

Contents

List of Tables

Chapter 1

Fundamentals of Gene Structure

The Basic Chemical Structure of Genes

Gene Expression

AGGUAAGU UCCAUUCA 5'

Frame 1: ATG TAC CGC TAC GAA TAA Frame 2: TGT ACC GCT ACG AAT AA Frame 3: GTA CCG CTA CGA ATA A

Methods Review

Review of Methods for Splice Sites Identification

Some Discussion of Existing Methods

Identification of Splice Sites

Derivation of A New Approach

Identification of Protein-coding Sequences

Identification of the Splice Sites

Outline

Chapter 2

Algorithm for Free Energy Calculation

ATGTAGATTCTCG ---AGGCCTACTAA ATTACTAG

ATGTAGAT e0

ATTACTAG

Algorithm of Cumulative Sinusoidal Wave

Application to a Real Free Energy Signal

Some Discussion with Conventional Method

Minimum Mean-squared Error