ABSTRACT
CHUANHUA XING. The Analysis and Identification of Protein-coding Sequences for Yeast Using a Free Energy Model. (Under direction of Dr. Donald L. Bitzer and Dr. Winser E. Alexander.)
The Analysis and Identification of Protein-coding Sequences
for Yeast Using a Free Energy Model
by
Chuanhua Xing
A dissertation submitted to the Graduate Faculty of North Carolina State University
in partial fulfillment of the requirements for the Degree of
Doctor of Philosophy
Electrical and Computer Engineering
Raleigh, North Carolina May 2007
APPROVED BY
Dr. Wesley E. Snyder Dr. Steffen Heber
Member of Advisory Committee Member of Advisory Committee
Dr. Mladen A. Vouk Dr. Anne-Marie Stomp
Member of Advisory Committee Member of Advisory Committee
DEDICATION
To my husband Yong Wang,
to my parents Houhai Xing and Liqin Guan,
to my sisters Chuanli Xing and Chuanbo Xing,
and to my brother Chuanlong Xing and his family,
Yan Wang, Jialin Xing, and Jiajia Xing.
BIOGRAPHY
Chuanhua Xing was born in Heilongjiang, China on October 4, 1976. In the early
stages of her life, her family enlightened her through initial inspiration and
perse-verance toward science. She was soaked in the interests of knowledge especially in
mathematics, natural science, and literature. She received numerous awards as being
in the top 1% of students.
She obtained her Bachelor’s degree in Electrical Engineering at Heilongjiang
Uni-versity in July 1998. She served as the Head of her class, the President of the
Stu-dent Association for the department, and the PresiStu-dent of the Association of Science
and Technology during her undergraduate study. She participated in many
activi-ties or clubs including sports, the fashion model team, and literature competition.
She received many awards including the outstanding student award, the outstanding
leadership award, the excellent literature award, and others.
After graduation, she joined China Telecom at Heilongjiang in 1998. She worked
in the core technical department where she participated in the development of
in-novative technologies and macro-projects for information collection and distribution,
and communication system construction. I served as the technical supervisor in year
1999 by supervising the technical progress and affording the technical consultation.
Chuanhua enrolled at NC State University, USA in August 2001 for her
grad-uate study. She received the Master of Science degree in Electrical Engineering at
North Carolina State University in December 2002. She has worked on the
ysis of protein-coding genes using a free energy model since Fall 2003. Her paper
with the topic “Free Energy Based Analysis of the Coding Region of Saccharomyces
cerevisiae” was awarded as the Second Prize for the student paper contest in “The
Technology4Life on Biotech & Bioinfo Conference, IEEE”, Oct. 2004. She and her
collaborators have developed effective computational methods to analyze and
iden-tify protein-coding sequences with high performance utilizing biological mechanisms,
instead of analysis based on the DNA sequences alone.
She became a member of IEEE in 2001 and IEEE EMB Society in 2006. She
worked as the member of the administrative committee of the NCSU
Chemistry-Biology Interface / RNA Chemistry-Biology Group RNA group, NC State since 2005. She
volunteered to work as the web master for the High Performance DSP Laboratory
(HiPerDSP lab) from 2003 to 2006. She also has interests in many sports and has
benefited from experiences such as the Intramural Flag Football Team and the
Intra-mural Flag Softball Team, ECE Graduate Students’ Association, NC State University
in years 2005 and 2006.
ACKNOWLEDGEMENTS
This dissertation would not be possible without the help, support, and guidance of
many wonderful people including my Advisory Committee, my friends, and my family.
I would like to express my deepest and sincere gratitude and appreciation to my
advisors Dr. Winser E. Alexander and Dr. Donald L. Bitzer for their guidance,
en-couragement, and support throughout my Ph.D. study. I am indebted to Dr.
Alexan-der who advised and influenced me as an excellent educator. I deeply appreciate his
encouragement and guidance through the difficulties, and he has guided me to be an
independent and successful researcher. I am also indebted to Dr. Bitzer who is the
beacon for my research. His active thought, his spirit, and his attitude to research
will always be valuable treasures to guide my life. Working with Dr. Bitzer on my
passionate topic has been an unforgettable experience. I would also like to thank
other committee members Dr. Mladen A. Vouk, Dr. Anne-Marie Stomp, Dr. Steffen
Heber, and Dr. Wesley E. Snyder for a lot of beneficial suggestions and the consistent
support through the years.
I would like to convey my gratitude to all my colleagues, officemates, and friends,
especially, Cranos Williams, Senanu Ocloo, Josh Starmer, Lalit Ponnala, Ramsey
Hourani, Gary Charles, Youngsoo Kim, Jeff Ligon, Ruben Lobo, Evan Ernst, Robert
Snyder, Scott Vu, Sandeep Hattangady, Maitrik Diwan, Treshauna Wright, Li Li, and
Ying Zhu.
Finally I would like to thank my parents for their unfailing support all the time. I
will always appreciate my husband who walked with me through these years. Thanks
for his love and his solid support always.
Contents
List of Tables x
List of Figures xi
1 Introduction 1
1.1 Fundamentals of Gene Structure . . . 3
1.1.1 The Basic Chemical Structure of Genes . . . 3
1.1.2 Gene Expression . . . 5
1.2 Methods Review . . . 14
1.2.1 Review of Methods for Coding Regions Analysis . . . 14
1.2.2 Review of Methods for Splice Sites Identification . . . 17
1.3 Some Discussion of Existing Methods . . . 23
1.3.1 Analysis and Recognition of Protein-coding Regions . . . 23
1.3.2 Identification of Splice Sites . . . 24
1.4 Derivation of A New Approach . . . 25
1.4.1 Identification of Protein-coding Sequences . . . 25
1.4.2 Identification of the Splice Sites . . . 27
1.5 Outline . . . 29
2 Two Basic Algorithms 30 2.1 Algorithm for Free Energy Calculation . . . 30
2.2 Algorithm of Cumulative Sinusoidal Wave . . . 33
2.2.1 Mathematical Theory . . . 33
2.2.2 Application to a Real Free Energy Signal . . . 36
2.3 Some Discussion with Conventional Method . . . 37
2.3.1 Minimum Mean-squared Error . . . 38
3 Period-3 Signal Discovery in Protein-coding Regions for Yeast 43
3.1 Period-3 Signal in an Ensemble of Genes . . . 44
3.1.1 Introduction . . . 44
3.1.2 Data and Assumption . . . 45
3.1.3 Methods . . . 46
3.1.4 Results . . . 46
3.1.5 Conclusion . . . 49
3.2 Periodic-3 Signals in Individual Genes . . . 49
3.2.1 Introduction . . . 49
3.2.2 Methods and Material . . . 50
3.2.3 PSD Analysis . . . 51
3.2.4 Cumulative Sinusoidal Wave Method . . . 56
3.2.5 Summary and Contribution . . . 58
4 Identification of Protein-coding and Non-coding Sequences 61 4.1 Introduction . . . 61
4.2 Data and Approach . . . 65
4.2.1 Data Selection . . . 65
4.2.2 The Observation of the Amplitude and the Phase in the Large Sample Set . . . 67
4.2.3 Identification Standards and Model Construction . . . 68
4.2.4 Training Size Analysis . . . 82
4.3 Results . . . 86
4.3.1 Five Fold Cross-validation Tests . . . 86
4.3.2 The Relation of Terminal Angle with Reading Frame . . . 96
4.3.3 False Negative Analysis . . . 98
4.4 Discussion . . . 99
5 Identification of Splice Sites 102 5.1 Introduction . . . 102
5.2 Methods and Experimental Data . . . 103
5.2.1 Branch Site Identification . . . 104
5.2.2 5’ Splice Site Identification . . . 105
5.2.3 3’ Splice Site Identification . . . 106
5.3 Results . . . 108
5.3.1 Identification of Branch Sites . . . 108
5.3.3 Identification of 3’ Splice Sites . . . 110
5.4 Discussion . . . 113
6 Summary, Contribution, and Future Research 117 6.1 Summary . . . 118
6.1.1 Identification of the Protein-coding Genes . . . 118
6.1.2 Identification of the Splice Sites . . . 119
6.2 Contribution . . . 119
6.2.1 Identification of the Protein-coding Genes . . . 119
6.2.2 Identification of the Splice Sites . . . 121
6.3 Broader Impact . . . 122
6.4 Future Research . . . 123
6.4.1 Identification of the Protein-coding Genes . . . 123
6.4.2 Identification of the Splice Sites . . . 124
List of Tables
4.1 Datasets for S. cerevisiae and S. pombe. . . 67 4.2 The results for selecting the training set size. . . 85 4.3 Accuracy of the synchronization based coding-region identification
algo-rithm on different coding/non-coding subsets. Se. is the abbreviation of sensitivity, and Sp. is the abbreviation of specificity. The rows with “*” are the results that are from Gao et al. [53]. . . 94 4.4 Comparison of several measures used DFT from Kotlar et al., 2003 [83]. . 96
5.1 The site statistical results for the identification of the branch sites, where BE in the first row refers to binding energy. . . 109 5.2 The site statistical results for the identification of the 5’ splice sites. . . . 110 5.3 The site statistical results for the identification of the branch sites, where
List of Figures
1.1 RNA processing and RNA splicing during transcription [158] . . . 6
1.2 Two steps of slicing mechanism [158]. . . 7
1.3 Spliceosome cycle [158] . . . 8
1.4 U1 and 5’ splice site base pairing [158] . . . 9
1.5 U2 and the branch site base pairing [158] . . . 9
1.6 U6 and 5’ splice site base pairing [158] . . . 10
1.7 Intron-bridging interactions [158] . . . 11
1.8 Illustration of 3 frames in one direction. . . 13
2.1 Illustration of free energy calculation between a short sequence and a long sequence. . . 31
3.1 The binding energy around the start codon for yeast chromosome II. . . . 47
3.2 Downstream binding energy for yeast chromosome II. . . 47
3.3 Power spectra of coding region for yeast chromosome II. . . 48
3.4 Power spectra of non-coding region for yeast chromosome II. . . 48
3.5 Power spectrum density for a protein-coding gene with the dominant one-third frequency component . . . 51
3.6 Power spectrum density for a protein-coding gene with the weak one-third frequency component . . . 52
3.7 Power spectrum density for an intron with the non-dominant one-third fre-quency component . . . 52
3.8 The relation of SNR distribution with the length of exons . . . 54
3.9 The relation of SNR distribution with the length of introns . . . 54
3.10 The polar plot for a protein-coding gene . . . 56
3.11 The polar plot for an intron . . . 57
4.2 The plot of the angle vs. position for a representative protein-coding gene. 69 4.3 The histogram of the terminal angles for 2000 protein-coding sequences from
the training set, where the outliers were eliminated. . . 70 4.4 The histogram of the terminal angles for 2000 non-coding sequences from
the training set. . . 71 4.5 Analysis of angle variations. . . 73 4.6 Illustration of three definitions for angle variation. . . 74 4.7 The histogram of the positions for a given angle variation over 2000
ran-domly selected protein-coding sequences from dataset sc-Coding. . . 76 4.8 The maximum tolerable positions vs. angle variations for three definitions 76 4.9 The plot of the amplitude vs. position for a representative protein-coding
gene . . . 79 4.10 Two groups of amplitude rate plots for the protein-coding sequences and
the non-coding sequences. . . 81 4.11 Sensitivity for specie S. cerevisiae over the different length intervals of
se-quences. . . 87 4.12 Specificity for specie S. cerevisiae over the different length intervals of
se-quences. . . 88 4.13 Sensitivity for specie S. pombe for the different length range of sequences. 91 4.14 Specificity for specie S. pombe for the different length range of sequences.. 91 4.15 The histogram of the terminal angles for 215 ORFs . . . 97 4.16 The histogram of the terminal angles for 215 ORFs including introns . . . 97
5.1 Binding energies for an example gene using the mask ‘3’- ATGATTG-5”. . 108 5.2 The polar plots for three sequences that decided their 3’ splice sites by
Chapter 1
Introduction
Biological organisms are highly organized, intricately regulated molecular
assem-blages. As such, they are information-rich systems. From an engineering perspective,
this means that they should provide information encoded signals. Therefore, it is
rea-sonable to ask if signal processing analysis can be used to detect, extract and decode
the information of biological systems. An assumption as to the nature of the encoded
signal is required to begin investigating utility of the signal processing approach.
The concept of free energy can be used to describe the interactions of molecules.
If an interaction is favored by convention, we indicate that the interaction is
accom-panied by a change in free energy from a higher (more positive) value in the starting
state to a lower value (more negative) after the interaction has occurred (the final
state). If a biological process consists of molecular interactions along a time or space
(position) continuum, a variable free energy pattern could be produced in which the
Ex-tracting the information in these signals should be possible using signal processing
techniques. I used signal processing approaches in my dissertation to develop new
tools to identify DNA sequences that encode protein coding information based on
these ideas.
At the molecular level, biological processes are assemblies of many chemical
re-actions. The chemical reactions are controlled by enzymes. Enzymes are catalytic
proteins. Proteins are essential parts of all living organisms and participate in every
process within cells. In addition to enzymes, there are other proteins that have
struc-tural or mechanical functions. The blueprints for all these proteins are encoded in
the DNA sequences called genes. The diversity of biological organisms and organism
responses to the environment result from the exact library of genes present in an
indi-vidual (genotype) and the regulation of gene utilization in response to environmental
stimuli (gene expression). Two fundamental problems of biological research are: 1)
identify the genes and 2) determine what regulates gene expression to understand
information content in molecular processes. The focus of my thesis research is the
utilization of free energy signals and signal processing analysis to identify protein
coding regions in genes and splice sites in pre-mRNA molecules.
In this chapter, the basic understanding of DNA and the relationships to the
protein-coding gene production process are outlined. Then we give the review of
cur-rent methods for the identifications of the protein-coding regions and its boundaries,
the selection of the model based on the underlying chemical and biological processes
rather than statistical and sequences-based abstractions. The rest of the chapter is
then oriented toward investigating the model and and its potential contributions for
distinguishing protein-coding sequences from non-coding sequences and identifying
splice sites.
1.1
Fundamentals of Gene Structure
This section provides the overview of the fundamental DNA knowledge, and affords
the basic terminologies and mechanisms involved the protein-coding gene
recogni-tion. The materials presented here can be found in the standard molecular genetics
textbooks [79, 91, 145, 158].
1.1.1
The Basic Chemical Structure of Genes
The nucleic acid polymers DNA (deoxyribonucleic acid) and RNA (ribonucleic acid),
and the amino acid polymers called proteins are three macro-molecules. Although the
structures of these three macro-molecules vary, each type of macromolecule is a chain
of a small number of monomers whose linear sequences are the defining characteristic
of each polymer.
Deoxyribonucleic acid (DNA) is a polymeric macro-molecule that encodes each
cell’s genetic information. The monomer units of the DNA and RNA polymers are
containing base attached to the sugar, and a phosphate molecule. Four different types
of nucleotides are found in DNA, differing only in the nitrogenous base, and are given
one letter abbreviations as shorthand for the four bases: A is for adenine; G is for
guanine; C is for cytosine; T is for thymine. Uracil is substituted for T in RNA.
“Nucleotide” and “base” will be used interchangeably in this work.
DNA is a linear (single stranded) polymer of nucleotides whose most stable state
is in the form of a double stranded helix. The DNA backbone consists of an
alter-nating sugar-phosphate sequence, with the basepairs stacked perpendicular to the
DNA backbone. The deoxyribose sugars (hydrogen instead of hydroxyl at 2’ carbon)
are joined at both the 3’-hydroxyl and 5’-hydroxyl groups to phosphate groups in
“phosphodiester” bonds. In the double stranded molecule, two DNA polymers are
aligned in an anti-parallel structure; one strand is oriented 3’-5’, the other 5’-3’. By
convention, we call the strand with the direction from 5’ to 3’ the plus strand (or
coding strand, Crick strand, sense strand), and the other with the direction from 3’
to 5’ the minus strand (or template strand, Watson strand, anti-sense strand). Most
frequently, RNA exists as a single stranded structure. Some short portions of RNA
are hydrogen bonded, which results in a complex, secondary structure.
Bases are known to interact through Watson-Crick binding, or base pairing. In
Watson-Crick binding, two bases may form either two or three hydrogen bonds. Such
a base pair is called complementary. A forms two hydrogen bonds with T, called A-T
base pair; C forms three hydrogen bonds with G, called C-G base pair. In RNA, A
two two hydrogen bonds.
Proteins consist of linear sequences of amino acids. Twenty amino acids are used
to make proteins. The information specifying the amino acid sequences of a cell’s total
protein complement is found in the coding sequences of the genes. The Genetic Code
is the set of rules that assigns each amino acid a three nucleotide codeword (codon).
In the Genetic Code there are 64 possible 3-nucleotide codons in which three are
used to terminate protein synthesis and the remaining 61 are used to specify the 20
amino acids. This results in redundancy in that from one to six codons can be used
to specify the same amino acid.
1.1.2
Gene Expression
Gene expression is the process by which a gene’s DNA sequence is converted into the
functional protein structures of the cell. Gene expression is a highly regulated,
multi-step process which has two basic activities. First, the transcription process produces
a copy of the gene’s coding region in the form of a RNA molecule (the messenger
RNA or mRNA). In the translation process, the mRNA molecule is assembled into
the ribosomal complex where its linear sequence of RNA codons is translated into the
linear sequence of amino acids resulting in protein synthesis.
Post-transcriptional Processing: Pre-mRNA Splicing
The transcription process is shown in Figure 1.1. A DNA sequence is transcribed
process during the post-transcriptional process for eukaryotes.
A DNA sequence is transcribed into pre-mRNA, or nuclear RNA (nRNA) that
consists of interspersed exons and introns for eukaryotic genes. A region of the DNA
sequence that codes for a section of the final functional transcript (mRNA) is called
an “exon”. The regions between exons are called introns. Introns are removed from
pre-mRNA by a process called RNA splicing, or pre-mRNA splicing. The boundaries
of exons and introns are called splice sites. The splice sites are recognized and introns
are removed during a process called pre-mRNA splicing. The “messenger RNA”, or
mRNA is the result of the pre-mRNA splicing.
The splice sites must be precisely recognized with a very low failure rate for
proteins to be made correctly. A large portion of my thesis research is focused on
investigating the mechanism of the pre-splicing process using free energy that could
encode the information used in splice site recognition.
The pre-mRNA is bound to a complex of protein and RNA molecules called the
spliceosome during splicing process. The typical spliceosome-mediated splicing
pro-cess is shown in Figure 1.2. Each intron is believed to have three recognition sequence
regions for splicing: a) the branch point characterized by the presence of an A
nu-cleotide at the middle of an intron, b) the 5’-splice site at the beginning of an intron,
and c) the 3’-splice site at the end of an intron. In the step 1 of the splicing
pro-cess, the 2’-OH of the branch point A is recognized then it attacks the phosphorous
atom bond of the 5’ exon/intron boundary (5’ splicing site or donor site) cleaving the
bond. This forms a lariat structure (shown in the middle of Figure 1.2) which holds
exon 1 ready for ligation (rebonding) to exon 2. In the second step, the 3’ end of
exon 1 attacks the phosphorous atom at the 3’ intron/exon boundary (3’ splicing site
or acceptor site), releasing the intron (as a lariat form) and splicing the two exons
together.
Exon 1
Exon 2
Exon 2 Exon 2
Exon 1 Exon 1
Figure 1.2: Two steps of slicing mechanism [158]
spliceosome is a large complex consisting of five small nuclear ribonucleoproteins
(snRNPs) , whose RNAs are referred to as U1, U2, U4, U5 and U6, and many
non-snRNP proteins [59, 96, 107, 111, 131]. It is important to note that base pairings
between the snRNAs and certain regions of the intron are an important feature of
this process.
Figure 1.3: Spliceosome cycle [158]
A brief summary of spliceosome cycle involving the pre-mRNA splicing process is
provided below. An excellent review of splicing is given in [158].
Step 1: (a) U1 binds with the pre-mRNA (the splicing substrate) at the 5’ intron/exon
boundary (5’ splice site or donor site). The first complex is called the
commitment complex (CC), and consists of the pre-mRNA plus U1 and
perhaps other substances. The base pairing between 5’ splice site and U1
AGGUAAGU UCCAUUCA 5'
5' U1 snRNA
3'
5' splice site
-1 1
Figure 1.4: U1 and 5’ splice site base pairing [158]
(b) Next, U2 joins and binds with the branch site, with help from ATP, to form
the A complex. Genetic analysis shows that the base pairings between
these sequences are essential for splicing [158] as shown in Figure 1.5.
Figure 1.5: U2 and the branch site base pairing [158]
(c) Subsequently, U4-U6 and U5 join to form the B1 complex. The U4/U6
complex binds with U5. Following complex formation, U4 dissociates from
U6 to allow three actions. 1) U6 displaces U1 from the 5’-splice site in an
ATP-dependent reaction that activates the spliceosome as shown in Figure
1.6, 2) U1 and U4 exit the spliceosome, and 3) U6 base pairs with U2.
Step one is then completed, resulting in separation at the 5’ exon/intron
boundary (5’ splice site or donor site) and formation of the lariat of intron
[158]. The activated spliceosome is also known as the B2 complex [158].
Figure 1.6: U6 and 5’ splice site base pairing [158]
formation of the lariat splicing intermediate with both held in the C1
complex [158].
Step 2: (a) With energy from a second molecule of ATP, the second splicing step occurs
joining both exons and removing the lariat-shaped intron with all held in
the C2 complex [158].
In the second step of splicing, the 3’ splice site is recognized, two exons
join, and the lariat intron is cut off [158].
A protein factor involved in the 3’ splice site recognition is U2AF (U2
associated factor), a multi-subunit protein consisting of 35 and 65 kD
sub-units. The 65 kD subunit binds to polypyrimidine tract near and upstream
of the 3’-splice site, and the 35 kD subunit binds AG at the 3’-splice site
[158]. SF1 (BBP, branchpoint binding protein) bridges between U2AF at
branchpoint (branch site) near the 3’-end of the intron and U1 snRNP at
the 5’-end of the intron, as shown in Figure 1.7. It is believed that SF1
helps to define the intron and brings the two ends of the intron together for
splicing. SF1 (BBP) recognizes the branch site UACUAAC [158]. Later,
branch point sequence (BPS) via the U2 snRNA, and with the U2AF65
C-terminal domain.
Figure 1.7: Intron-bridging interactions [158]
(b) In the next step, the spliced and mature mRNA forms leaving the intron
bound to the I complex [158].
(c) Finally, the I complex dissociates into its component snRNPs. snRNPs can
be recycled. The intron lariat intermediate is debranched and degraded
[158].
Translation
After transcription, the mRNA (messenger RNA) is transported out of the nucleus
into the cytoplasm of an eukaryotic cell. It is then translated into proteins
(transla-tion) [158].
Although all of the details of the interaction between the 18S rRNA and the mRNA
are not yet defined, it is clear that the 18S rRNA of the 40S ribosomal subunit is
involved in translation initiation, elongation, and termination. A scanning model
in eukaryotic mRNAs. This model postulates that the 43S initiation complex, that is,
the 40S ribosomal subunit carrying the charged initiator methionine tRNA (transfer
RNA) and initiation factors, binds to factors associated with the cap structure at the
5’ end of mRNA and then scans down the 5’ untranslated region (UTR) to locate the
start codon AUG, resulting in the correct positioning of the start codon in the donor
or P-site of the ribosomal complex [85]. Once the initiation site is recognized, the 60S
ribosomal subunit joins to form the 80S complex, and translation begins. Translation
continues until a stop codon enters the A-site of the ribosomal complex. The release
factor binding triggers the release of the polypeptide chain and dissociation of the
ribosomal complex from the mRNA, leaving both the 40S and 60S subunits ready for
another round of translation.
We are interested in the regions of the DNA sequence that are encoded into
proteins. These regions are called coding regions, or protein-coding regions. The
other regions that are not encoded into proteins are called non-coding regions. It is
important then to decide which nucleotide to start translation, and when to stop.
This is called an open reading frame.
It is important to determine the correct open reading frame (ORF) for a gene
se-quence. Every region of DNA has six possible reading frames, three in each direction,
because the ribosome reads mRNA in groups of three nucleotides, and shifts by three
nucleotides. For example, the sequence of DNA can be read in six reading frames.
Three in the forward and three in the reverse direction. The three reading frames in
with the “T” and Frame 3 with the “G”. The longest ORF is in Frame 1. When the
frame is the same with the one for the start codon, it is “synchronized” [115]. Frame
1 is in “synchronization”. Frameshifts happen when the ribosome does not move in
three nucleotides. When the frame is one base behind the start codon, it’s called -1
frameshit; when it is one base ahead of the start codon, it is called a +1 frameshift.
Then frame 2 is a +1 frameshift, and frame 3 is a -1 frameshift.
Frame 1: ATG TAC CGC TAC GAA TAA Frame 2: TGT ACC GCT ACG AAT AA Frame 3: GTA CCG CTA CGA ATA A
Figure 1.8: Illustration of 3 frames in one direction.
The reading frame is used to determine which nucleic acids will be encoded by a
gene. Typically only one reading frame is used in translating a gene (in eukaryotes),
and this is often the longest open reading frame. Once the open reading frame
is known, the DNA sequence can be translated into its corresponding amino acid
sequence. An open reading frame starts with a start codon (ATG) in most species
1.2
Methods Review
1.2.1
Review of Methods for Coding Regions Analysis
Coding Measures
Many new methods for finding distinctive features of protein-coding regions have been
proposed in the past two decades (see reviews by Fickett, 1996 [50]; Claverie, 1997
[28]; Math´e et al., 2002 [98]). These methods are based on different measures for
discriminating protein-coding regions from non-coding regions.
Most coding measures are based on statistical features in protein-coding regions,
which are not present in non-coding regions. Some examples include differences in
codon usage [143], hexamer counts [29, 48, 51], codon position asymmetry [49],
au-tocorrelations, nucleotide frequencies [12, 14, 49, 133], entropy [3], and the
mono-and diamino acid usage values [103]. The base compositional bias in coding
se-quences has also been exploited as a coding measure [130]. Periodicities, especially
the period-3 feature of a nucleotide sequence in the coding regions, have been used
widely [5, 25, 49, 62, 69, 83, 135, 149, 153, 165]. A detailed analysis for the various
coding measures has been given by Fickett and Tung [51].
Highly expressed genes often exhibit a non-random bias in codon usage, referred
to as “codon bias”. Codon bias can serve as one of the variables to determine how
likely the transcription and translation of an open reading frame (ORF) into a protein
product is. Codon bias can also be helpful for finding the codon sequences that are
“codon adaptation index” (CAI) as a quantitative way of predicting the expression
level of a gene. Karlin et al. (1998) [76] adopted the “codon usage” as an alternative
quantitative indicator. Bennetzen et al. (1982) [10] used the “codon bias index”
(CBI) to predict gene expression level. However, the codon-based expression models
are still based on rather qualitative assumptions about gene expression including
codon composition of only a limited set of highly expressed genes [128]. In 2000,
Zhang et al. [168] proposed the YZ score, and suggested it as a complement of CBI
or CAI. The YZ score is based on the Z curve theory of DNA sequences [168] to
reflect their codingness of an ORF or a fragment of a DNA sequence. In 2000 and
2001, Karlin et al. [74, 75] built the models in predicting gene expression levels using
codon usage differences on a somewhat broader set of highly expressed genes. In
2003, Jansen et al. [72] recapped the CAI and codon usage formalisms, calculated
the new parameters with larger sets of genes with improved expression data from the
organism yeast, and predicted the expression levels from the codon compositions of
genes using an alternative linear model.
Algorithms
The algorithms used to identify genes usually employ one or more coding measures
mentioned previously. These coding measures incorporate a unique feature of the
coding sequence which has a certain ability of identifying protein-coding and
non-coding sequences [18, 41, 108, 170]. Other measures are based on signals of the gene
of measures have been proposed for gene prediction. Some examples include artificial
neural networks [48, 89, 137, 154, 164], and linguistic methods [34, 97, 126].
Others methods, for example, Markov chains, have also received wide attention
for protein-coding genes analysis. Different variations and improvements over
con-ventional Markov models have been implemented in gene classification algorithms
[13, 38, 45, 92]. Moreover, the coding potential of a DNA sequence (the potential of
a DNA sequence to encode for a protein) is attributed to the local structures present
in the codons, which was believed to be more informative than the global properties
in distinguishing coding and non-coding sequences [83]. Therefore, many methods
are based on local information analysis. Some examples include Kotlar et al. (2003)
[83] and Kulkarni et al. (2005) [87]. Kotlar et al. (2003) [83] proposed a new
dis-criminating feature based on the arguments of the Discrete Fourier Transform (DFT)
for locating short genes and exons. Kulkarni et al. (2005) [87] employed information
about the local H¨older exponents for distinguishing coding sequences from non-coding
sequences using binary support vector classification algorithm.
Multifractal analysis has been exploited for the analysis of coding regions of DNA
sequences (reviewed in [87]). Zhou et al. (2005) [43] used the global features
ob-tained from multifractal analysis of nucleotide sequences to distinguish coding from
non-coding sequences. Similar principles have also been adopted to investigate
long-range correlations in the DNA sequence. Peng et al. (1992) [40] used a DNA walk
model to discover the presence of long-range correlations in non-coding sequences.
(1992) [110] further proved that such correlations also exist in coding sequences.
Ar-neodo et al. (1998) [37] and Audit et al. (2001) [39] have recently shown the presence
of long-range power law correlations in eukaryotic coding sequences. Meanwhile, the
importance of periodicities of a given DNA for determining the protein-coding
re-gions have attracted wide attention, and was addressed by Fickett (1992). These
periodicities, especially period-3, have been used as discriminant features in several
studies of gene prediction [5, 25, 51, 62, 135, 149]. Silverman and Linsker (1986)
[135] used the periodic patterns in the coding sequences revealed by Fourier
trans-form to distinguish protein-coding sequences from non-coding sequences. Kotlar et
al. (2003) [83] employed the Discrete Fourier Transform (DFT) as a tool for studying
periodicities. Gao et al. (2005) [53] used a “deviation” from the sequence’s fractal
“background” to distinguish the protein-coding sequences from non-coding sequences.
Instead of signals from nucleotides characters, the period-3 signals in protein-coding
regions were investigated using the DFT from the free energy between rRNA and
mRNA interactions during translation in prokaryotic genes [115, 116].
1.2.2
Review of Methods for Splice Sites Identification
Annotation of genomic sequences in prokaryotes is substantially easier because the
coding regions are contiguous. In eukaryotes, the difficulty in identifying coding
se-quences is increased because, for many genes, the coding regions (the major regions
of exons) are non-contiguous, separated by intervening non-coding regions (introns).
influenced by their accuracy at determining the precise boundaries of exons and
in-trons, since the splice signals are probably the most critical signals for accurate exon
prediction.
Site conservations and site dependencies are two main concerns in splice sites
iden-tification problems. Our knowledge of splice sites ideniden-tification and introns excision
during the pre-mRNA splicing process indicates that the splice sites exhibit strong
features which facilitate the specific interactions between splice sites on pre-mRNA
and spliceosome complex. Such features are manifest as the site conservations and
site dependencies. Splice site sequence conservation can be seen, for example, “GT”
conservation at the 1st and 2nd positions of 5’ splice site. Other potential or hidden
signals may be found by the comparison with other DNA sequences. The site
depen-dencies, include the adjacent and non-adjacent positions, defines the site variation.
The site dependencies are the other major concerns for many programs used to
im-prove the prediction accuracy. Studies have shown that there are strong dependencies
between non-adjacent as well as adjacent positions around splice sites. Almost three
out of four of all base pairs exhibit significant dependence in donor sites [17]. The site
conservations and dependencies are not independent of each other, and they may be
considered at the same time in the different collaboration levels in one program. We
will first review the research concerning dependencies which starts from the methods
for site features. The review of methods about features selection and classification
Algorithms Concerning to Site Dependencies
Several statistical models for donor and acceptor splice sites have been constructed in
the past 20 years [7, 18, 19, 127, 141, 166, 172]. One of the earliest and most
influen-tial models is the weight matrix model (WMM) [141] that uses the position-specific
compositional biases in splice sites. The WMM weights can be optimized using a
neural network method [16] that was developed for NetPlantGene [60], NetGene2
[150], and NNSplice [112]. It is a component of SpliceMachine [32] as well. Another
method, called the weight array model (WAM) [142, 172], was developed to describe
the dependencies between adjacent base positions by the inhomogeneous first-order
Markov chain (1MC) model, and was later applied using the VEIL [61] and
MOR-GAN [121] software program. WMM and WAM are important components in gene
prediction systems such as GeneSplicer [109] and GenScan [18]. Markov models have
been used as well [120, 172].
Statistically significant dependencies between base positions in the splice sites
have been investigated recently [2, 8, 17, 18, 19, 30, 47]. One explanation for the
observed dependencies between splice site positions is the interactions between the
structure of small nuclear RNPs (snRNPs) and the splice site region of the pre-mRNA
during spliceosome assembly [99].
Several new algorithms have been proposed to improve predictive power by
consid-ering the dependencies between splice site positions. Some examples include the
max-imal dependence decomposition method (MDD) [18], quadratic discriminant analysis
com-bined with Bahadur expansion [7]. The latter two models have attempted to analyze
splicing sites with pairwise correlations.
The maximal dependence decomposition (MDD) model has been used to capture
the dependencies of splice sites in contrast to using only the dependencies between
the adjacent positions in Genscan [18]. The MDD method is basically a decision tree
bifurcating at the most influential residuals. Since branching occurs only when
sta-tistically significant residuals are detected, this method can suppress the increase of
model parameters compared to higher-order Markov models even with higher-order
dependencies [7]. Such models did not indicate significant improvement compared
with the simpler models with only dependencies between adjacent positions.
How-ever, the combination of the basic statistical models, such as WMM, WAM and MDD,
with other signal/content sensors or/and rule-based filtering may improve the
mod-elling. The MDD model in GeneSplicer [109] combined with two second-order Markov
chain (2MC) models and a local maximal score filter has been used to characterize
splice sites in eukaryotic mRNA. The comparison of GeneSplicer to other splice site
predictors, such as NetPlantGene [60], NetGene2 [16, 60], HSPL [65, 138], NNSplice
[112], GENIO [93, 94], SpliceView [114], etc., indicates that the performance of
Gene-Splicer is comparable with the best predictors for both human and Arabidopsis data.
A Bayes network model [19] has been developed by computing the correlations
between all residuals, finding the maximum spanning tree by linking positions of
high correlations, and computing the conditional probability for each linked position.
Markov model. Castelo et al. (2004) [21] adopted Bayesian networks, exploiting
the results [22, 21]) that permit reducing the bias in the estimation problem as the
sample size increases. Chen et al. (2005) [26] developed a dependency graph model
and its derivatives for splice sites prediction, and made an attempt to fully capture
the intrinsic interdependency between base positions in a splice site.
Zhang et al. (2003) [171] generalized the diversity increment method and
com-bined it with the quadratic discriminant analysis [170] (called IDQD, increment of
diversity combined with quadratic discriminant analysis) to identify and predict the
splice sites. The diversity increment model was based on the diversity measure that
can synthesize different types of information, the splicing signals, the compositional
and base-correlating features of exons and introns. It employed two types of methods,
intrinsic and extrinsic. The comparison of compositional features and the base
de-pendencies at adjacent or non-adjacent positions of two sequences (for example, one
sequence before exon/intron boundary and one sequence after exon/intron boundary,
or one standard set of exons or introns and another set of sequence whose property
is to be predicted, etc.) can be integrated automatically in the diversity increment.
Algorithms for Site Features Selection and Classification
Feature selection has been used in splice sites analysis and identification. Traditional
Feature Subset Selection (FSS) methods are sequential and are based on a greedy
heuristic [80]). Genetic algorithms (GAs) are the more advanced methods that use
do-mains [86, 134, 156]. Instead of using the crossover and mutation operators to create
the new population, a more statistical approach is used to estimate the distribution
of the parameters from a selected group of individuals and then derive the new
popu-lation from the estimated distribution. Estimation of distribution algorithms (EDAs)
have proven to outperform the standard genetic algorithms in that multiple
depen-dencies among parameters require fewer fitness evaluations to obtain good solutions.
EDAs were first used for feature subset selection by Inza et al. (1999) [68], and its
ap-plications to FSS in large scale domains produced good results [4]. Cant´uPaz (2002)
[20] compared several EDAs with the simple GA for small scale domains (at most
35 features) using a Naive Bayes classifier (for example, WMM), and concluded that
the EDAs with their complicated dependency learning system are not significantly
better than the GA with the simple compact. The EDA-UMDA (EDA with the
Univariate Marginal Distribution Algorithm as the estimation algorithm) approach
was suggested to be very similar to the compact GA [58] or to a GA with uniform
crossover. Saeys et al. (2003) [119] used a simple EDA as a wrapper for feature subset
selection for splice site prediction with equal or higher relevance than the traditional
sequential methods, resulting in a better classification of the splice sites.
Recent approaches based on discriminant functions such as Winnow [27] or the
support vector machine (SVM) [31, 139, 146] showed significant improvements in
pre-diction performance compared to previously used systems such as NetGene2 [150],
SPL, SplicePredictor [155] and GeneSplicer [109]. The feature distributional
feature selection by distributional clustering of words via the information bottleneck
method [148]. Degroeve et al. (2002) [31] combined the SVM with feature subset
selection in a wrapper algorithm (SVM) [15, 156] for splice site prediction. An
ex-tensive overview and comparison of splice site recognition were discussed in Math´e et
al. (2002) [98] and Zhang (2002) [169].
1.3
Some Discussion of Existing Methods
1.3.1
Analysis and Recognition of Protein-coding Regions
Current algorithms for recognizing protein-coding regions are based on various
cod-ing measures that find the differences between protein-codcod-ing regions and non-codcod-ing
regions, as reviewed in Section 1.2.1. These coding measures have investigated the
statistical behaviors of protein-coding regions from omnifarious aspects of DNA
se-quences features. Some measured the statistical features of nucleotides, such as
nu-cleotide frequencies and hexamer counts. Some measured the statistical features
of codons, such as differences in codon usage, codon adaptation index, and codon
bias index. Some combined the several features of protein-coding regions, such as
the Z curve that measures the features of purine/pyrimidine nature of nucleotides,
amino/keto nature of nucleotides, and strong/weak hydrogen bonding property of
nucleotides. Some measured the global information of sequences such as correlation,
multifractal features, and periodicities.
incor-porated the underlying mechanism(s). Highly expressed genes exhibit strong codon
bias for particular codons in many bacteria and small eukaryotes. One suggested
explanation is that there appears to be a relationship between tRNA abundance and
codon bias [67, 76, 128]. The explanations for most of the other statistical behaviors
remain unknown. Although some biological mechanisms of recognizing the
protein-coding regions have been investigated, as briefly reviewed for the translation
pro-cess in Section 1.1.2, the programs for identifying protein-coding regions have rarely
considered the biological processes in which these sequences participate, or give the
biological explanation for the revealed statistical behaviors.
1.3.2
Identification of Splice Sites
Various complex and sophisticated methods have been investigated statistically for
increasing the identification accuracy of the splice sites, as discussed in Section 1.2.2.
We have summarized these investigations from two main aspects, site conservations
and dependencies. Although the observation of the pre-mRNA splicing process
sug-gests that the splicing process exhibits strong features of site conservations [55] and
site dependencies [100], which facilitate the specific interactions between splice sites on
pre-mRNA and spliceosome complex, these methods have rarely utilized pre-mRNA
1.4
Derivation of A New Approach
We propose an approach of identifying the protein-coding sequences and the splice
sites by measuring their biological processes using free energy. An important issue in
today’s research is identifying the protein-coding regions and locate its exact
bound-aries, especially the splice sites, for eukaryotic genes [36, 138]. We assume that a
variable of free energies pattern could arise from the hybridization of a short RNA
sequence and a DNA sequence. We can measure this biological hybridization
us-ing the bindus-ing free energy. We then can analyze the variable energy pattern as a
non-random signal using signal processing techniques, and extract the information to
identify the protein-coding sequences and the splice sites.
1.4.1
Identification of Protein-coding Sequences
In prokaryotes, free energy based calculations for Watson-Crick hybridization between
the 3’-terminal, single stranded, 13 nucleotides of the 16S rRNA sequence (16S tail)
and mRNAs have been used to analyze the translation process [90, 104, 116, 115, 147].
The literature suggests that 16S rRNA has a role in reading frame synchronization
throughout the translation process [140]. Putting these two ideas together, we
won-dered if, during translation, the 16S tail could repeatedly hybridize to the mRNA
resulting in a free energy signal. If this could occur, the signal could encode
informa-tion to control reading frame. Rosnick (2001) [115] computed the free energy scores
in kCal/mol between the 3’ tail of 16S rRNA of E. coli K-12 and an ensemble of
In his algorithm, the 16S tail was overlaid on each mRNA, starting 50 nucleotides
upstream of the start codon, a free energy value was calculated and then the 16S tail
was moved downstream one nucleotide. The free energy calculation was done for this
new alignment. This process was repeated until the 16S tail is beyond the gene stop
codon. Using this method, Rosnick obtained one free energy sequence for each of the
2000 genes. The free energy sequences were then averaged to extract the ensemble
behavior for the genes of E.coli. This ensemble free energy signal showed
period-3 behavior in the coding region. Mishra et al. [105] extended Rosnick’s work to a
number of eubacterial and archae species of prokaryotes, and found the same period-3
behavior in gene coding regions. Ponnala et al. (2006) [116] used a procedure similar
to that of Rosnick, and obtained free energy signals for gene sequences. Ponnala et
al. (2006) showed that a periodic energetic pattern of frequency 1/3 exists in the
majority of coding regions of 12 eubacterial species, but not in the noncoding regions
that encode the 16S and 23S rRNAs.
Extending these observations to genes in eukaryotic organisms, we investigate the
interactions between the 3’ tail end of the 18S rRNA and the underlying mRNA using
the free-energy calculations. The success of this approach, i.e. prokaryotic mRNA
sequences have a period-3 encoded signal that can be revealed through hybridization
with the 16S tail, prompts the idea that the free energy signal could be useful for
identifying protein coding regions. This idea is supported in part by the work of
Hagenbuchle et al. (1978) [56] and Sargan et al. (1982) [123] who suggested a
initiation codon of mRNA. This raises the question as to whether the 18S rRNA tail
in eukaryotic genes, as the reviewed function of 16S rRNA tail in prokaryotic genes,
has a synchronization role with the reading frame throughout the translation process?
Therefore, the first task of my thesis work is to explore whether the protein-coding
sequences for yeast include period-3 signals like prokaryotic genes, and whether we
can use the period-3 signals to identify the coding sequences.
1.4.2
Identification of the Splice Sites
The splice sites are believed to be the most important functional sites for gene
identi-fication [120]. If they could be reliably detected from the genomic DNA, the difficulty
in identifying the coding regions would be greatly reduced. We address the derivation
of our approach for the identification of the splice sites in this section.
As addressed above, the statistical methods endeavored to improve their
identifi-cation performance from two main aspects, site conservations and site dependencies.
Although the observation of the pre-mRNA splicing process suggests that the splicing
process exhibits strong features of site conservations [55] and site dependencies [100],
methods have seldomly utilized this mechanism for the identification of the splice
sites.
Mechanistic studies of splice site recognition involve interactions between the
pre-mRNA sequence and the conserved sequences in the snRNAs (small nuclear RNAs),
which are complexed with the snRNP components of the spliceosome [55]. This has
inter-actions could be a useful approach for identifying the rules of splice site recognition
[114].
A program that is capable of measuring the biological interactions of snRNAs
and pre-mRNA may capture the best combination of the site conservations and
de-pendencies. Such a program may also reveal features that contribute to the splice
sites identification. Garland and Aalberts [54] used a free energy approach to explore
the mechanism of donor site recognition by U1 during the pre-mRNA splicing
pro-cess to identify the donor sites for a set of 65 human genes [54]. They linked two
sequences into one sequence, and fed it into Mfold to locate the donor site with the
minimum free energy. In my thesis, I expand this approach to develop an improved
program that calculates the free energy specifically between a short sequence and
a DNA sequence which is able to identify all three splice sites. The previous free
energy, period-3 signal may also assist the identification of the splice sites by finding
the consistent period-3 signal on the coding regions across the splice sites. Therefore,
the second task of my thesis work is to identify the splice sites, the main boundaries
of protein-coding/non-coding regions.
Thus, my thesis research focuses on the utilization of free energy signals to
mea-sure the biological interactions for: 1) identifying the protein-coding/non-coding
se-quences, i.e. the annotation problem of identifying open reading frames (ORFs) in
genomic sequence, and 2) a closely related problem of eukaryotic sequences, finding
boundaries) in eukaryotic gene sequences.
1.5
Outline
Chapter 2 describes two basic algorithms used in our work. Chapter 3 presents the
discovery of the periodic signal in yeast sequences, and the selection of algorithm for
coding/non-coding sequences analysis. Chapter 4 provides a model of
protein-coding/non-coding sequences identification based on the periodical signal discovered
in Chapter 3. Chapter 5 discusses a model to pin-point the exact boundaries of exons
and introns, the splice sites. The summary of the signal discovery, the methods, the
Chapter 2
Two Basic Algorithms
Two basic algorithms, free energy calculation and cumulative sinusoidal wave, are
specifically described in this chapter to avoid the iteration in the later chapters.
We gave some discussion of the relation of the cumulative sinusoidal wave with the
conventional minimum mean-squared error estimation at the end of this chapter.
2.1
Algorithm for Free Energy Calculation
From the introduction of our approach in Chapter 1, we propose to utilize the
biolog-ical processes to analyze and identify protein-coding sequences and the splice sites.
We then use the free energy to measure the two sequences’ interactions during the
biological processes. In this section, we give the description of free energy calculation
for hybridization between two sequences, a short sequence and a long gene sequence,
in which the short sequence is sliding on the long gene sequence. The short sequence
The algorithm assumes that the alignment between two sequences adjusts to form
the most stable secondary structure during their interactions [100]. In the first
align-ment of the algorithm, the terminal, 3’-nucleotide of the decoder is aligned with the
terminal, 5’-nucleotide of the gene sequence. The algorithm then determines the
sec-ondary hybridized structure with the most negative, i.e. most stable, free energy, then
calculates that free energy, ∆G◦, using a dynamic programming algorithm [136, 147].
This is illustrated on the top half of Figure 2.1. 3’-ATTACTAG-5’ was the decoder
in this example. A stable double-helical structure can occur when ∆G◦ is less than
zero. The optimal binding energy ∆G◦ is recorded ase(0) for the first alignment.
ATGTAGATTCTCG ---AGGCCTACTAA ATTACTAG
ATGTAGAT e0
e0
One calculation
The calculations on a Sequence
ATTACTAG
ATTACTAG
e1 e
n … …
ATA CTAG ATGT GGT
T Find the optimal
2ndstructure
Calculate the free energy
A
Figure 2.1: Illustration of free energy calculation between a short sequence and a long
sequence.
The short sequence is moved downstream (in the 3’ direction) along the gene
sequence one nucleotide at a time from the start codon to the end codon, generating
a series of alignments. For each alignment, a free energy value is calculated. We then
calculate a free energy sequence, or a free energy vector, for all the alignments over
energy sequence is then recorded as E = [e(0), e(2), ..., e(n)].
A number of researchers have used free energy as a metric for studying the
inter-actions of sequences [38, 63, 90, 106, 113, 115, 118, 125, 147]. RNAcofold [63, 118] is
a program that uses a linker sequence to join the two sequences into a single strand of
RNA prior to folding. Garland and Aalberts (2004) [54] used a similar approach by
linking two sequences into one sequence and then feeding it into MFOLD [173, 174]
to calculate the free energy between the two sequences. RNAhybrid [113] is a tool for
finding the minimum free energy hybridization of a long and a short RNA.
When considering the free energy between two sequences, several types have been
described, as given in Equation 2.1.
∆G◦ = ∆G◦
init+ ∆G◦stack + ∆G◦Iloop+ ∆G◦Bloop (2.1)
In this formula, ∆G◦
init is the amount of free energy required to initiate a helix
be-tween the two strands of RNA [52]; ∆G◦
stack is the free energy associated with the
stacked bases in a double-stranded, hybridized sequence; ∆G◦
Iloop is the internal loop
penalty [174]; ∆G◦
Bloop is the bulge loop penalty [174]. Dangling 5’ or 3’ ends [11] are
not considered in my algorithm because of ambiguities regarding what constitutes a
dangling end on either the mRNA sequences or the 5’ end of the decoder sequence, for
example the 18S rRNA tail. We obtained free energy parameter values for
Watson-Crick binding from Xia et al. (1998) [161], G/U mismatches from Mathews et al.
2.2
Algorithm of Cumulative Sinusoidal Wave
The “cumulative sinusoidal wave” algorithm is the method we used to extract the
synchronization signal from a free energy sequence, and is described in this section.
The “synchronization signal” is defined as the period-3 signal associated with the
authentic reading frame.
2.2.1
Mathematical Theory
We can express the period-3 signal using a sinusoidal wave in the time domain
(po-sition in nucleotides in our work).
y(n) = C0+Msin (w0n+φ) +z(n), (2.2)
where C0 is constant, M is the amplitude, φ is the constant phase shift. n =
0, 1, 2, .... z(n) is the noise at the position n.
If the period is 3 nucleotides,
sin (w0n+φ) = sin (w0(n+N) +φ) = sin (w0(n+ 3) +φ)
∴3w0 = 2π ⇒w0 = 2π/3.
Then Equation 2.2 can be written as
y(n) = C0+Msin
µ
2π
3 n+φ
¶
Equation 2.3 can be further written as 3 different expressions.
y(3k) = C0+Msin (φ) +z(3k), when n= 3k;
y(3k+ 1) = C0+Msin µ
2π
3 +φ
¶
+z(3k+ 2), when n= 3k+ 1;
y(3k+ 2) = C0+Msin µ
4π
3 +φ
¶
+z(3k+ 2), when n= 3k+ 2,
wherek is non-negative integer.
When the signal is embedded in strong background noise, we need a technique
to emphasize the signal relative to the noise. To do this, we utilized a cumulative
calculation that sums up the free energy modulo 3 over the first k codons.
A(k) =
k−1 X
i=0
y(3i) = kC0+kMsin (φ) +
k
X
i=0
z(3i); (2.4)
B(k) =
k−1 X
i=0
y(3i+ 1) =kC0+kMsin µ
2π
3 +φ
¶
+
k
X
i=0
z(3i+ 1);
C(k) =
k−1 X
i=0
y(3i+ 2) =kC0+ µ
4π
3 +φ
¶
+
k
X
i=0
z(3i+ 2),
The average (DC term) is subtracted from A(k), B(k), andC(k).
a(k) = A(k)−CDC; (2.5)
b(k) = B(k)−CDC;
c(k) = C(k)−CDC
CDC =
A(k) +B(k) +C(k)
3 =kC0+
P3k−3
i=0 z(i)
3
The removal of the DC term is based on the mathematical fact that we can always
three points equals zero. For a case of a real free energy sequence, the DC term is
non-zero because the free energy values are non-positive.
Then, we assigneda(k),b(k) andc(k) to expressions corresponding to a sinusoidal
wave function.
a(k) = kMsin
µ
φ+ 2π
3
¶
+
k−1 X
i=0
z(3i)−
P3k−3
i=0 z(i)
3 ; (2.6)
b(k) = kMsin
µ
φ+ 4π
3
¶
+
k−1 X
i=0
z(3i+ 1)−
P3k−3
i=0 z(i)
3 ;
c(k) = kMsin(φ) +
k−1 X
i=0
z(3i+ 2)−
P3k−3
i=0 z(i)
3
Ifz is random noise,Pki=0−1z(3i+ 1)'Pik=0−1z(3i+ 2)'Pik=0−1z(3i+ 3) whenk →
a large number. Then
k−1 X
i=0
z(3i)−
P3k−3
i=0 z(i)
3 →0,
k−1 X
i=0
z(3i+ 1)−
P3k−3
i=0 z(i)
3 →0,
k−1 X
i=0
z(3i+ 2)−
P3k−3
i=0 z(i)
3 →0.
Therefore, we can observe that a(k), b(k) , and c(k) converge to linear functions
that have a trend of linearly increasing amplitude and constant phase, whenk is large
enough.
a(k) ' kM sin(φ) ; (2.7)
b(k) ' kM sin
µ
φ+ 2π
3
¶
;
c(k) ' kMsin
µ
φ+ 4π
3
2.2.2
Application to a Real Free Energy Signal
We can always fit a sinusoid with period 3 to the equally spaced 3 points, no matter
whether a free energy sequence is period-3 or not. However, when the free energy
sequence is period-3, its cumulative amplitude will be linearly increasing, and its
cumulative phase will tend to be constant, as discussed above. This becomes the
basis to test whether a given sequence is protein-coding sequence or not. We describe
the application of the method of cumulative sinusoidal wave on the real free energy
sequences as follows.
For a real free energy sequence, we first summed up the free energy sequence
modulo 3 over the first 3k nucleotides (k codons), which resembles the DNA walk
[40]. Then, we fitted a sinusoidal wave to the three accumulated free energy values.
We performed the same procedure as given in Equation sets 2.4 and 2.5.
A0(k) =
k−1 X
i=0
e(3i); (2.8)
B0(k) =
k−1 X
i=0
e(3i+ 1);
C0(k) =
k−1 X
i=0
e(3i+ 2),
a0(k) = A0(k)−CDC0 (k); (2.9)
b0(k) = B0(k)−C0
DC;
c0(k) = C0(k)−C0
DC(k)
C0
DC(k) =
A0(k) +B0(k) +C0(k)
3 =
P3k−3
i=0 e(i)
3
calculated the amplitude and phase by solving two of the following three equations.
a0(k) = M(k)sin(φ(k)) ; (2.10)
b0(k) = M(k)sin
µ
φ(k) + 2π 3
¶
;
c0(k) = M(k)sin
µ
φ(k) + 4π 3
¶
Comparing Equation sets 2.10 with Equation sets 2.7, we can determine that
M(k) → k ∗M and φ(k) → φ as k → a large number, only if a dominant period-3
signal exists in the free energy sequence. In other words, the amplitude is linearly
increasing and the phase is constant for a free energy sequence with the dominant
period-3 signal, whenk is large enough.
The cumulative calculation compensates for the small signal to noise ratio (SNR)
due to performing the computations for any single codon. In essence, the use of the
accumulated free energy sequence is a noise reduction technique.
2.3
Some Discussion with Conventional Method
In this section, we evaluate two methods of estimate the bias, the amplitude, and
the phase for the waveform with a period of three, the conventional minimum
mean-squared error and the cumulative sinusoidal wave. We firstly describe the conventional
method of minimum mean-squared error. Then we give some discussion of the method
of minimum mean-squared error and the method of cumulative sinusoidal wave [162,
2.3.1
Minimum Mean-squared Error
A noise free, period-3 signal can be written as
x(n) = C0+Msin
µ
2π
3 n+φ
¶
We seek to find the values of the parameters C0, M, and φ that best represent the
data values, denoted as y(n), in the minimum mean-squared error. The error is
e(n) = x(n)−y(n).
That is, we seek to minimize the mean squared error.
E = 1
N
N
X
n=1 µ
C0+Msin
µ
2π
3 n+φ
¶
−y(n)
¶2
Surprisingly, this non-linear minimization problem has a closed-form solution. First,
we calculate the three derivatives:
∂E ∂C0 = 2 N N X n=1 µ
C0+Msin
µ
2π
3 n+φ
¶
−y(n)
¶ (2.11) ∂E ∂M = 2 N N X n=1 µ
C0+Msin
µ
2π
3 n+φ
¶
−y(n)
¶ µ
sin
µ
2π
3 n+φ
¶¶ (2.12) ∂E ∂φ = 2a N N X n=1 µ
C0+Msin
µ
2π
3 n+φ
¶
−y(n)
¶ µ
cos
µ
2π
3 n+φ
¶¶
(2.13)
Setting the first expression in Equation 2.11