Full Search Methods 710
Dynamic programming and branch-and-bound 710
Local Optimization 710
The downhill simplex method 711
The steepest descent method 711
The conjugate gradient method 714
Methods using second derivatives 714 Thermodynamic Simulation and Global Optimization 715 Monte Carlo and genetic algorithms 716
Molecular dynamics 718
Simulated annealing 719
Summary 719
Further Reading 719
List of Symbols 721
Glossary 734
Index 751
Contents
2ZIP method, 453–4, 455F 310-helices, 435
defining, for prediction algorithms, 464–5
3D-Coffee, 203 3DEE library, 574 3Djigsaw, 563 3D-PSSM, 533, 534F 3¢ end, 6, 12T, 17–18 3-patterns, 217 5¢ end, 6, 12T, 14, 19 –10 motif, 16
16S RNA sequences, 249
evolutionary model selection, 254F, 255, 256T
phylogenetic analysis, 249, 251, 255–8, 257F, 258F
–35 motif, 16
123D+ program, 535–6, 536F a/b-fold proteins, 421F, 423F, 573F,
574
a-helices, 33–5, 413F
amino acid preferences, 37
Chou–Fasman propensities, 474–5, 474F, 475F
coiled coil formation, 451, 451F defining, for prediction algorithms,
464–5, 466–7
hydrogen bonding, 34, 35F length distributions, 467, 468F prediction, 413–14, 428–9, 429F
see also secondary structure prediction
based on residue propensities, 477–8
neural network methods, 501, 501F
transmembrane proteins, 438, 439–48
sequence–structure correlations, 487–8, 487F
transmembrane proteins see trans-membrane helices
turns, hairpins and loops connecting, 36–7 a-lactalbumin, 538–9, 539F bab repeat, 40B
b-barrels, transmembrane see transmembrane b-barrels b-bulges, 463, 465
b-lactamase family, 573F b-meander, 40B
b-sheets, 34–6, 36F
defining, for prediction algorithms, 465, 465F
transmembrane proteins, 436 types, 35–6
b-Spider, 466, 467F b-strands, 34–6, 36F, 413F
amino acid preferences, 37
Chou–Fasman propensities, 474–5, 474F
defining, for prediction algorithms, 465–6, 466–7
distortions, 463
length distributions, 467, 468F prediction, 413–14, 428–9, 429F
see also secondary structure prediction
based on residue propensities, 477–8
transmembrane proteins, 448–51, 450F
variability, 467, 467F turns, hairpins and loops
connecting, 36–7 b-turns, 36, 37F, 413F
Chou–Fasman propensities, 475, 476T
defining, prediction algorithms, 465 prediction, 413–14, 478, 503 p-helices, 435
defining, for prediction algorithms, 464–5
f angles see under torsion angles y angles see under torsion angles
A
A (accepted point mutation matrix), 120
AACC, 214–15, 214F AAINDEX, 84 AAindex, 476
AAT program, 331T, 332T, 335, 336 ab initio approach, modeling protein
structure, 522, 523B accepted mutations, 84
accepted point mutation matrix (A), 120
acceptor splice sites, 18F, 380F, 392 acetolactate synthase (ALS) family,
259B, 262 activators, 16–17 adaptive systems, 667–8 additive trees, 228–9, 229F, 230 adenosine (A), 6, 6F
affine gap penalty, 127, 128, 133–4, 139 Affymetrix GeneChip® arrays, 602 Akaike information criterion (AIC),
253–5
ALDH10 gene, 324–5 annotation, 351–2 exon prediction
accuracy, 345, 345–6
different programs, 331–2, 331T, 333–4, 334F, 335, 336
experimental results compared, 327, 328F
using related organisms, 336–7 gene structure, 327B
interspecies comparisons, 353, 353F, 354F
pathway approach to identifying, 348, 349–50F
promoter prediction, 341, 341T start codon, 327, 330F
alignment, sequence see sequence alignment
Alix, Alain, 475
all a-fold proteins, 421F, 422F, 573F, 574 Note: Entries which are simply page numbers refer to the main text. Other entries have the following abbreviations
immediately afer the page number: B, box; F, figure; FD, flow diagram; MM, mind map; T, table.
INDEX
all b-fold proteins, 421F, 422F, 573F, 574
alternative splicing, 19, 380–1 Alu elements, 337B
Alzheimer’s disease, 491 AMAS program, 93 AMBER program, 526, 701
amino acid(s) (residues), 11, 27–33 chemical structure, 28F
conservation, to identify binding sites, 586–7, 587F
conservation values (Zpred), 426, 427F, 428F, 429T
hydrophobicity scales, 437–8, 450, 475, 477T
peptide bonds, 29–33, 31F
physicochemical properties, 28–9, 28T, 30F
amino acid propensities, 37, 472–85, 472FD
see also Chou–Fasman propensities averaged over sequence windows,
476–9
derivation and calculations, 473–6 nearby sequence effects, 479–84,
480F
amino acid sequences, 13, 25, 29 see also protein sequences evolutionary conservation, 38 short segments with structural
correlations, 487–8, 487F amino acid side chains, 28F
modeling, 547–8, 548F, 558–9, 561 physicochemical properties, 28–9,
30F
torsion angles (c1, c2, etc), 547, 548F
amino (N) terminus, 29 amphipathic helix, 439–41 amyloidogenic proteins, 486, 487,
491–2, 492F, 493F analogous enzymes, 244, 244F analysis of covariance (ANCOVA), 659 analysis of variance (ANOVA), 659 ancestral states, 226
anchor points, 546, 546F Anfinsen, Christian, 412, 412F annotation, 357
automated, 64–5 database, 53
data errors or omissions, 64 gene, 348–52
genome see genome annotation manual, 65
ANOLEA program, 550–1, 551T antibiotic synthesis, 643B antibodies, 381, 555B
modeling, 555–6B anticoding strand, 11 anticodons, 13–14, 14F
antigen-binding site, 555–6B antigens, 555B
antisense strand, 11 apoptotic pathway, 681F
approximate correlation coefficient (AC), 366B
Arabidopsis thaliana, 328, 330B gene duplications, 241B Rha1 gene prediction, 393F splice sites, 380F, 396 vs rice, 335B
Archaea, 21, 21F
horizontal gene transfer, 246F, 247 sequenced genomes, 324T architecture
database, 45 network, 676, 677F Argos, Patrick, 171 ArrayExpress, 58, 606, 611 ArrayExpress Data Warehouse, 58 arrhythmia, cardiac, modeling, 677,
678F
ATG start codons see start codons atomic charges, 704
atomic mean force potential (AMFP), 551
AUG codon, 13, 19, 367
AU (approximately unbiased) method, 309
average conditional probability (ACP), 366B
B
backbone (protein), 29, 32 models, 39, 39F
back-propagation method, 497B backward algorithm, 190–1 bacteria, 21, 21F
see also Escherichia coli; prokaryotes 16S RNA, 249
horizontal gene transfer, 246F, 247 sequenced genomes, 324T balanced training, 498B Baldi, Pierre, 191 BAliBase, 92, 93F
balloting probabilities, 501 Barton, Geoff, 206
base-pairing, 7–9, 8F RNA, 456
wobble, 14 bases, 5–7, 6F
base sequences see nucleotide sequences
Baum–Welch expectation
maximization algorithm, 191–3 Bayesian information criterion (BIC),
254–5
Bayesian methods, 697–8
dealing with lack of replicates, 657B
phylogenetic tree reconstruction, 250, 251T, 253, 306–7
Bayes’ theorem, 697–8 Benjamini, Yoav, 659
Berkeley Drosophila Genome Project (BDGP), 340, 341T
Betaturns method, 503 biased mutation pressure, 239 biclustering, 649–50, 650F
bidirectional recurrent neural network (BRNN), 504, 505F
Bifidobacterium longum, 348, 350F bifurcating (branching) pattern, 226–7 binding sites, protein see protein
binding sites
biochemical pathways see metabolic pathways
BioEdit program, 260 bioinformatics, 3
protein structure and, 37–9, 38FD BioModels Database, 692
Biomolecular Interaction Network (BIND), 58, 671, 673F bistable switches, 688–9, 689F BLAST program, 95–6
algorithmic approximations, 141 comparing nucleotide with protein
sequences, 150–3
Conserved Domain Database (CDD) search, 99F, 100
dealing with low-complexity regions, 101–2
E-values, 98–100, 99F, 156 gapped method, 147–50, 178T GenScan modification using, 397 restriction of matrix coverage, 140 suffix trees, 141–3
use of finite-state automata, 147–50, 147F, 148F versions available, 95–7
whole genome alignments, 157–9 blastx program, 96, 97, 150, 343 BLAT program, 158
BLOCKS database, 58
Dirichlet mixture from, 174–5, 174F
searching, 105–7, 106F
substitution matrices from, 122 BLOSUM matrices, 83F, 84
alignment scoring, 82 derivation, 122–5, 123F, 124F selection, 84, 85
summary score measures, 125F, 126 Blundell, Tom, 532
Boltzmann factor, 706 bond angle energy, 703 bond energy, 702
bonding terms, 525–6, 701, 702–4, 702F
Bonferroni correction, 658 Index
bootstrap analysis, 310B
assessing tree topology, 309–10 comparing tree topologies, 233–4,
233F
comparing two or more trees, 311 parametric, 310B
practical example, 258, 259F bootstrap interior branch test, 310 bottom-up approach, modeling
biological systems, 674–6, 676F bovine spongiform encephalopathy
(BSE), 37B, 101B
branch-and-bound method, 288, 710 branches, 226, 227F
branch length calculations, 293–7, 295F, 296F
assessing reliability, 309–10 parsimony methods, 299–300 branch swapping techniques, 289–91,
290F BRCA2, 78, 79F Brenner, Steven, 480 Brudno, Michael, 209 Bryant, David, 296, 296F BTPRED method, 503
Bucher weight matrix method, 383–4, 384F
Burset, Moises, 365–6B, 392B BVSPS program, 551T
C
C2-like domain, Dictyostelia, 535–7, 536F, 537F
Caatoms, 28, 28F, 29, 417
analysis of geometry, for prediction algorithms, 466, 466F
torsion angles see under torsion angles
Camodels, 39, 39F
Caenorhabditis elegans, 399
CAFASP (Critical Assessment of Fully Automated Structure Prediction), 419, 554–6
cAMP PK see cyclic AMP-dependent protein kinase
canonical ensemble, 718 Cantor, Charles, 271 capping, RNA, 18
cap signal (initiator signal, Inr), 389 Bucher weight matrix, 383, 384,
384F
GenScan prediction method, 385, 385F
NNPP prediction method, 385–6, 386F
carboxy (C) terminus, 29 Casadio, Rita, 479–80
cascade-correlation neural network, 503–4
CASP (Critical Asssessment of
Structure Prediction), 419, 554–6 CATH database, 531, 574
causal dependencies, 668 Cbl protein, 575–80, 576F
CCAAT box, detection algorithms, 383, 384–5
CDK10 gene, 324–5 DNA sequence, 326–7B
exon prediction, 329F, 330–1, 332T, 336–7
translation of predicted exons, 344F cDNA (complementary DNA)
exon prediction using, 397 gene-prediction programs using,
334, 335 microarrays, 602 sequence databases, 56 Celera, 376B
cell-division cycle, 688–9
Cell Markup Language (CellML), 692 CellML Model Repository, 692 cellular modeling
heart, 685T
international projects, 668 programs, 691–2, 691F
CE (Combinatorial Extension) method, 576–7, 578F
central dogma, 10–14, 10F, 10FD centroid, 711
centroid method, hierarchical clustering, 640, 641F chaining, 144–6
chameleon sequences, 37B, 488 CHAOS algorithm, 209
CHARMM program, 526, 701 ChiClust program, 617, 618–19 ChiMap program, 618–20, 619F chloroplasts, 22, 292B
Chou, Peter, 472
Chou–Fasman propensities, 414, 415F, 472, 474–6
applied to GOR, 483
calculated values, 474F, 476T measures of accuracy, 424T nearest-neighbor methods, 489 periodic variation, 474–5, 475F transmembrane helices, 475–6,
478F
window sizes, 477–8 chromatography, 600, 623 chromosomes, 10, 21–2
rearrangements, 248 Churchill, Gary, 275 chymosin B, 486, 487F, 490F chymotrypsin, 243–4, 244F CINEMA program, 93 cis conformation, 32, 33F clades, 256
Cladist program, 608–9, 609F
cladogram, 228, 229F ClustalW, 90, 91–2
progressive alignment method, 205 scoring scheme, 201–2, 201F, 202F vs other alignment methods, 92,
93F
cluster analysis, 625–64, 626MM data preparation, 626–33, 627F,
627FD
defining distances, 633–7, 634FD, 636F
evaluating validity of clusters, 650–1
hierarchical see hierarchical clustering
hydrophobic (HCA), 110–11, 110F sequence alignment, 90–1, 90F, 126 clustering methods
see also specific methods comparison between, 643B gene expression microarray data,
606–11, 611F
identifying expression patterns, 637–51, 637FD
phylogenetic tree construction, 276–9, 277FD
protein expression data, 615–17, 617F, 618F
Clusters of Orthologous Groups (COG) database, 103, 243, 245B
CMISS modeling tool, 692 COACH method, 195, 203 coding, 11, 12–13 coding strand, 11–12 codon-pairs see dicodons codons, 13
see also start codons; stop codons frequency of occurrence, 367, 367F genetic code, 12T
mutation rates at different, 238–9, 238F
statistics, use by ORPHEUS, 372–3 co-expressed genes or proteins, 600,
638
COFFEE scoring system, 200, 203, 204F
COG (Clusters of Orthologous Groups) database, 103, 243, 245B
Cohen, Stanley, 643B coiled coils, 413, 435 geometry, 451, 451F
prediction, 451–4, 452FD, 478–9, 510, 510F
COILS program, 452–3, 454F, 478–9 collagen, 452
common evolutionary ancestor, measuring likelihood, 117–19 comparative modeling see homology
modeling
COMPASS method, 195
Index
complementary DNA see cDNA complementary DNA strands, 7–8 complete linkage clustering, 640,
641F complexity
see also low-complexity regions biological systems, 684–5 compositional, 151–2B COMPOSER program, 546, 553–4 compositional complexity, 151–2B concatamers, 605
condensation reaction, 29, 31F condensed trees, 233–4, 233F conditioned reconstruction, 292B confidence index, 432
conformation, 27, 41
see also quaternary conformation energies, 524–9, 524FD
side chains, 547–8
conformational flexible docking, 590
conformers, 547
conjugate gradient method, 528, 713F, 714
conjugate prior, 698 consensus features, 234
consensus method, pattern or motif creation, 105
consensus sequences, 16 consensus trees, 234–5, 234F, 291 Conserved Domain Database (CDD)
search, 99F, 100 CONSOLV program, 593 ConSurf program, 587, 587F
contact capacity potential (CCP), 533, 707–8, 708F
context strings, 371
control circuits, biological systems, 680, 680F
convergent evolution, 74–5, 75B, 243–4, 244F
cooperativity, 701
COPASI modeling tool, 692 Corbin, Kendall, 270
CorePromoter program, 340, 341T, 388, 389F
core promoters, 17, 319 see also promoter prediction detection of binding signals, 339,
381–9
models designed to locate, 383–7 Cost, Scott, 489, 491
covalent bonds, 32B, 33B energetics, 525–6, 701, 702–4 CPHmodels, 554, 563
creatine kinase, 42F, 43
Creutzfeldt–Jakob disease (CJD), 101, 101B
variant (vCJD), 101B Crick, Francis, 7
Critical Assessment of Fully Automated Structure Prediction (CAFASP), 419, 554–6
Critical Assessment of Structure Prediction (CASP), 419, 554–6 Crooks, Gavin, 480
C terminus, 29
Cy5/Cy3 label gene expression microarrays, 602–3, 603F cyclic AMP-dependent protein kinase
(cAMP PK)
inserting gaps, 86, 86F
local and global alignment, 89, 89F
multiple alignment, 91–2, 92F cytochrome c oxidase I, 249 cytosine (C), 6, 6F
D
Dali library, 574
DALI program, 578–9, 579F
Darwinian concept of evolution, 235 DAS (Distributed Annotation System),
348–51, 351F
DAS (dense alignment surface) program, 442F, 444–5, 445F, 447 data, 53
checking for consistency, 63–4 derived (secondary), 53–4 log transformation, 629–30, 630F normalization, 627–31, 628F, 630F primary, 53–4
quality, 61–6, 62FD
database management system (DBMS), 48
Database of Interacting Proteins (DIP), 58
databases, 45–66, 46MM access to, 52
categories (by content), 55–61, 56F
centers, 55
content of entries, 53 data quality, 61–6, 62FD distributed, 48, 52
entry identifiers/version numbers, 65–6
first computerized, 48, 48F flat-file, 47, 47F, 48–9 links between, 52, 53 looking for, 55–61 nonredundancy, 62–3 ontologies, 54–5, 54F relational, 48, 49–50, 49F structure, 46–52, 47FD
for systems biology, 671–2, 675T training and test, 416–17 types, 52–5, 53FD, 55FD updating, 65–6
data classification, 637–8, 638F see also sample classification secondary structure prediction,
510–14, 511FD
data warehouses, 48, 51F, 52 Davies, Graham P., 420B Dayhoff, Margaret, 82, 119 Dayhoff mutation data matrices
(MDMs) see PAM matrices dbEST, 56, 321B
DEAD-box motif, 420B decision trees
detection of functional RNA molecules, 361–3, 363F sample classification, 661 splice site prediction, 394 DEFINE, 417
degenerate (genetic code), 13 degrees of freedom (df ), 654, 655 deletions
accounting for, in sequence alignment, 85–7
alignment scoring schemes, 117, 126–7
homology modeling, 542, 545–6, 545F
threading and, 532, 537 denatured proteins, 42 dendrograms, 636, 636F
gene expression data, 606F, 607, 607F, 608
hierarchical cluster analysis, 639, 640, 640F, 641F
dense alignment surface (DAS) program, 442F, 444–5, 445F, 447 deoxyribonucleic acid see DNA deoxyribonucleotides, 6 deoxyribose, 5–6
DESTRUCT method, 503–4, 505F deterministic finite-state automaton,
147F, 148–50 diagonals
DIALIGN method, 92, 207–9, 208F
FASTA scoring, 95
labeling of matrix, 144F, 145 restricting matrix coverage to,
139–41, 139F, 140F
DIALIGN program, 92, 93F, 207–9 DIAL program, 575, 576, 576F dichotomous (branching) pattern,
226–7
dicodons (hexamers), 328, 367 exon prediction using, 390 gene detection methods using,
368–72
promoter prediction using, 387–8 Dictyostelia, C2-like domain, 535–7,
536F, 537F dielectric constant, 704 Index
differential equations, modeling biological systems, 680–3, 682F digital differential display (DDD),
605–6, 605F
dihedral angles see torsion angles dihydrofolate reductase (DHFR)
ligand docking, 592, 592F
pocket identification, 585–6, 586F dimers, 43
directed acyclic graph (DAG), 512 directional information, 423, 482 Dirichlet distribution densities, 174 Dirichlet mixture, 174–5, 174F, 176F discriminant analysis
see also linear discriminant analysis;
quadratic discriminant analysis gene prediction, 340, 388, 389F,
396–7
sample classification, 661 secondary structure prediction,
512–13 distance, 81
see also evolutionary distance;
p-distance
definitions for cluster analysis, 633–7, 634FD, 636F
phylogenetic tree reconstruction, 249–50, 251, 251T
distance correction, 236
Distributed Annotation System (DAS), 348–51, 351F
distributed databases, 48, 52 divergent evolution, 75B
divide-and-conquer method (multiple alignment), 91, 91F
vs other alignment methods, 92, 93F
DNA, 4
central dogma concept, 10, 10F, 10FD
complementary see cDNA double helix formation, 7–9, 8F mutations see mutations noncoding see junk DNA strands, 7–9, 8F, 11–12 structure, 5–9, 5FD, 8F transcription see transcription DNA gyrases (GyrA and GyrB), 249 DNA microarrays, 9, 600, 601–4
basic principle, 602
databases see microarray databases data clustering methods, 606–10,
643B
data sharing and integration, 606 gene expression studies, 602–4,
603F
principal component analysis of data, 618
two-color, 602–3, 603F
uses of clustered data, 610–11, 611F
DNA polymerase, 8 DNA repeats, 22B
see also repeat sequences detection, 152B
exclusion from analysis, 319–21 DNA replication, 8, 8F
DNA sequence databases, 56, 57F nomenclature for base uncertainty,
63, 63T DNA sequences
alignment scoring matrices, 124F, 125
detecting homology, 75–6 gene prediction from see gene
prediction
multiple alignments, 92 nucleotide bias, 275–6
phylogenetic tree reconstruction, 249
preliminary examination, 318–22, 319FD
searching with, 97 docking, 587–93, 588FD
accounting for water molecules, 592–3
conformational flexible, 590 fragment, 591
scoring functions, 590 simple strategies, 588
specialized programs, 588–92, 592F DOCK program, 590–1
domains protein, 41
see also multidomain proteins families, 259B
identifying, 574–6, 576F shuffling, 570
taxonomic, 21
donor splice sites, 18F, 380F, 392 dot-plots, 77–8, 77F, 79F
low-complexity regions, 101–2, 102F
double dynamic programming, 534 downhill simplex method, 711, 712F downstream sequences, 16
d-patterns, 217
drawhca program, 110F, 111 drug design, rational, 588, 589B DSC method, 512–13
DSSP program, 417
defining secondary structures, 464–6, 465F, 465T, 467, 467F length distributions of secondary
structures, 467, 468F
nearby sequence effects, 479–80, 480F
duplication
chromosome and genome, 248 gene see gene duplication sequence, 158F, 245
Durbin, Richard, 363 DUST program, 152B
dynamic programming algorithms double, 534
gene model, 399, 402F global–local, 533
pairwise alignment, 86–7 database searching, 95–7 discarding intermediate
calculations, 138B
extension to multiple alignment, 198
function optimization, 710 local and suboptimal, 135–9 optimal global, 129–35
principles and methods, 127–41, 128FD
time methods, 139–41, 139F, 140F
Sankoff algorithm for weighted parsimony, 300–2, 301F threading, 533–4, 534F
E
E-Cell Project, 668
EcoCyc database, 671, 673F EcoKI restriction enzyme, 420B EcoParse gene model, 375F, 376–7 Eddy, Sean, 293, 362, 363
edges see branches Efron, Bradley, 310B
EGFR see epidermal growth factor receptor
eigensamples, 633
Eisenberg hydrophobicity scale, 450
Elber, Ron, 532
electronic resonance, 31
electrostatic interactions, 33B, 704 EMAP modeling tool, 692
emergent properties, 669 emissions, 179, 181–2 eMOTIF, 213–15, 214F
end state, 179, 180, 182–3, 183F energies
free see free energy molecular, 700–8
potential see potential energy energy gradient, 528
energy minima, global, 524, 528–9 energy minimization, 527–8, 528F applied to homology modeling,
548, 559–60 Ensembl, 103, 403
enthalpy see potential energy entropy, 695–7
component of free energy, 525 relative, 125F, 126, 697 Shannon, 695–6
Index
enzymes, 40
analogous, 244, 244F
convergent evolution, 243–4, 244F phylogenetic analysis, 259–63 simulation modeling, 690F, 691–2,
691F
epidermal growth factor receptor (EGFR), 436, 436B
mitogen-activated protein kinase system, 683F
pathway modeling, 681, 682F, 690 epitope, 555B
ergodic systems, 717, 718–19 errors
random, 627–8 systematic, 625, 627–8 type I, 653, 658 types and rates, 657–8 Erwinia carotovora, 262 Escherichia coli, 21, 378
detection of tRNA genes, 320–1, 320F
EcoCyc database, 671, 673F EcoParse gene model, 375F, 376–7 engineered OROlacpromoter, 676,
676F
gene classification by codon usage, 370
GeneMark.hmm gene model, 375–6
genome segment annotations, 322, 323F
heat shock response, 680, 680F length distributions of
coding/noncoding regions, 374F, 375
promoters, 339–40
pyruvate formate-lyase, 467F pyruvate kinase, 480F robustness, 684 start codons, 366F, 367 ESPript, 93
ESTs see expressed sequence tags ESyPred3D, 554, 563, 563T Euclidean distance, 634–5, 636F Eukarya see eukaryotes
eukaryotes, 14, 21–2, 21F control of translation, 19
exon prediction see exon prediction gene detection, 323–37, 323FD, 360 finding correct start codon, 327,
330F
homology searching, 322 with only query sequence,
327–32
with query sequence and gene model, 332–4
sequence features used, 377–81, 378FD
series of steps, 346T
using correct reading frame, 325–7, 325T, 328F, 329F using gene control signals,
381–9, 382FD
using gene model and sequence similarity, 334–6
using genomes of related organisms, 336–7
variety of approaches, 324–5 vs methods used in prokaryotes,
377–9
gene models, 397–9, 398FD gene structure, 319, 325F intron prediction see intron
prediction
mRNA modifications, 18–19 origins, 292B
promoter prediction, 339, 340–2 indefinite nature of results, 341,
341T
online methods, 340–1 theoretical basis, 381–9 regulation of transcription, 15,
17–18, 17F
splice site detection see splice sites, detection
tRNA gene detection, 362–3 Eukaryotic Promoter Database (EPD),
339, 340
European Bioinformatics Institute (EMBL-EBI), 52, 55, 606 databases, 55–6, 60 E-values, 98
cut-off thresholds, 98–100, 99F, 101F
PSSM construction, 176 statistical significance, 156 EVA program, 551T
evolution, 5, 20–3, 20FD aiding sequence analysis, 38 basic concepts of molecular,
235–48, 235FD
convergent, 74–5, 75B, 243–4, 244F Darwinian concept, 235
divergent, 75B gene level, 239–47 genome level, 247–8
minimum see minimum evolution nucleotide level, 236–9
evolutionary clustering algorithms, 646–7, 646F
evolutionary distance, 81, 199, 224–5 see also p-distance
additive phylogenetic trees, 228, 229F
calculation, 268–76, 269F evaluating tree topologies using,
293–7
PAM matrices and, 84 sources of errors, 277
tree construction, 251–2, 276–9, 277FD
evolutionary history
phylogenetic trees see phylogenetic trees
recovering, 223–64, 224MM evolutionary models
practical application, 251–5, 253T selection of appropriate, 253–5,
254F, 256T
sequence alignment, 117–19 theoretical basis, 268–76 time-reversible, 302
evolutionary trace method, identifying binding sites, 586–7, 587F exclusive classification, 637–8, 638F exon prediction, 319, 323–37
assessing accuracy, 343–6, 343F, 344F, 392B
with only query sequence, 327–32
with query sequence and gene model, 332–4
theoretical basis, 379–81, 389–97, 391FD
using correct reading frame, 325–7, 325T, 328F, 329F, 391–2
using gene model and sequence similarity, 334–6
using general sequence properties, 390–2
using genomes of related organisms, 336–7
using homology searches, 397 variety of approaches, 324–5 exons, 18, 18F, 19
initial and terminal, detection, 390, 396–7
length distributions, 379, 379F translating predicted, 343, 344F use of term, 379–80
ExPASy program, 345, 412, 620 expectation maximization (EM), 191,
216
expectation values see E-values expected number of offspring (EO),
209
expected score, 119, 126 see also E-values
explicit state duration hidden Markov model (HMM), 374
expressed (genes), 11 see also gene expression
expressed sequence tags (ESTs), 321B databases, 56, 103
digital differential display (DDD), 605–6, 605F
exon prediction using, 397 gene-prediction methods using,
334–5 Index
expression level ratios, 628–30, 629F, 630F
in different samples, 652
log transformation, 629–30, 630F eXtensible Markup Language (XML),
50–1
external nodes, 226, 227F
extracellular matrix (ECM), modeling tumor invasion, 677, 677F Extreme Pathways, 678
extreme-value distribution, 97–8, 155–6, 155F
extrinsic classification, 638
extrinsic gene detection methods, 361, 368FD
eye, gene expression patterns, 607F, 608
F
false discovery error rate (FDR), 658, 659
false negatives
in gene prediction, 365B in sequence analysis, 212 false positives
in gene prediction, 365B in sequence analysis, 212 statistical tests, 653
families, protein see protein families family-wise error rate (FWER), 658,
659
Fano definition of mutual information, 481
Fasman, Gerald, 472 FASTA program, 95
algorithmic approximations, 141 chaining, 144–6
comparing nucleotide with protein sequences, 150–3
database searching method, 143, 144–6, 145F
E-values, 98, 100, 101F, 156 restriction of matrix coverage, 140 versions available, 95–6, 96T whole genome alignments, 157–9 fast Fourier transform (FFT), 206 FATCAT program, 579–80, 580F feedback control, 680, 680F feedforward control, 680, 680F Felsenstein, Joseph, 253, 275
Felsenstein 81 (F81) model, 253, 253T, 254F, 256T
Felsenstein zone (long-branch attraction), 292, 308–9, 309F Ferrell, J.E., 689F
FGENESH program, 332, 333–4, 334F comparative results, 331T, 332T,
333F
rice genome prediction, 335B
fibrin, 451–2
fibrous proteins, 41, 435 fields (database), 46–7
fingerprints, multiple motif, 109 finite-state automata (FSA), 147–50,
147F, 148F
vs hidden Markov models, 147, 179, 180–1
FirstEF, 332, 396–7
Fitch algorithm see post-order traversal Fitch–Margoliash method, 250, 251T
evaluating tree topologies, 293. 297 generating single trees, 279–80,
280F, 281F
vs neighbor-joining, 282, 284F, 285 fitness, 235
evolutionary clustering, 646–7, 646F
flavin adenine dinucleotide (FAD), 259B, 260, 261F, 262
flavodoxin family, 573F Fletcher–Reeves formula, 714 Flicker program, 614, 620, 620F Flux Balance Analysis (FBA), 678 FoldIndex method, 513
folding, protein see protein folding folding funnel, 525
fold recognition see threading folds, protein see protein folds force fields, 522, 524–9, 701–5
additive, 701 class I and II, 702 nonadditive, 701 forward algorithm, 190
fractional alignment difference, 269 frameshift, 150
Franklin, Rosalind, 7, 7F free energy
folded proteins, 41–2
RNA secondary structures, 456, 457–8
surface, molecular systems, 525, 525F
free insertion modules (FIMs), 184–5 fructose-1,6-bisphosphate aldolases
(FBPAs), 569F, 570, 570F FSSP database, 574, 578–9 Fuchs, Patrick, 475
FUGUE program, 532, 535–6, 536F fully resolved trees, 227
function (protein and gene), 40–1 see also structure–function
relationships
conservation, 568–74, 568FD evolution, 242, 243–4 genome annotation, 400–3 orthologs, 239, 243 patterns and, 109–11
phylogenetic trees for predicting, 262
protein folding and, 40–1, 41F using orthologs to predict, 245 functional homology, 569–70, 569F,
570F
function optimization see optimization, function FunSiteP algorithm, 340, 341, 341T fusion
gene, 72 genome, 292B
G
Gamma distance (correction), 239, 269F, 270
Gamma distribution (G), 269F, 270 evolutionary model variation, 253T,
254F
gap extension penalty (GEP), 85, 127 gap insertion operator, 210–11, 211F gap opening penalty (GOP), 127, 202,
202F
gap penalties, 85–6, 87, 126–7 global alignments, 131F, 132–5,
132F, 134F
local alignments, 137 manual adjustment, 93
multiple alignments, 202, 205, 206 position-specific scoring matrices,
170, 177
suboptimal alignments, 137F, 139 gaps, 74
inserting, 85–7
in multiple alignments, 204, 205F scoring, 126–7
Garnier, J, 422
Gaussian distributions see normal distributions
GAZE program, 399, 402F
GC box, detection algorithms, 383, 384–5
GC content
bacterial genomes, 238F, 239 evolutionary models and, 273 promoter prediction using, 386,
387F
regions of different (isochores), 275, 378
GenBank, 55–6, 102–3 flat-file format, 47, 47F sample extract, 57F gene(s), 5, 10–11
evolution, 239–47
families see protein families function see function fusion, 72
nested, 399 nonfunctional, 242 overlapping, 12, 12F, 360 prokaryotic vs eukaryotic, 377–9
Index
structure and control, 14–20, 15FD, 318–19
structure in eukaryotes, 319, 324 GeneBee program, 457F, 458
GeneBuilder program, 331T, 332T, 335, 336
GeneCluster2 program, 608 gene duplication, 73, 239–42, 242F
acetolactate synthase (ALS), 262, 263F
effects on phylogenetic analyses, 245
identified from synonymous mutations, 241B
phylogenetic trees, 226, 231F structure–function relationships,
570
use for rooting trees, 292–3 gene expression, 11
co-expression, 600 databases, 58
digital differential display (DDD), 605–6, 605F
microarrays, 602–4, 603F see also DNA microarrays patterns, 638, 639F
SAGE method, 604–5, 604F sample classification, 659–62,
660FD
uses of clustered data, 610–11, 611F
gene expression analysis, 599–600, 600MM, 601–11, 601FD clustering methods see clustering
methods
data preparation for, 626–33, 627F, 627FD
statistics, 652–9 gene loss, 242–3, 243F
effects on phylogenetic analyses, 245
GeneMark algorithm, 328–9, 368–70 comparative results, 331–2, 331T,
332T
GeneMark.hmm algorithm, 373–6, 374F
gene models, eukaryotic, 397–9, 398FD
Gene Ontology, 54, 348 gene ontology
evaluating validity of clusters, 651 genome annotation, 348–52, 402 gene prediction (detection), 317–46,
318MM
assessing accuracy, 342–6, 342FD at exon level, 343, 344F, 392B at nucleotide level, 343, 343F,
365–6B
at protein level, 343–6, 345F eukaryotes see under eukaryotes
evaluation and reevaluation of methods, 405
exon prediction see exon prediction further analysis, 399–405, 400FD intrinsic and extrinsic methods,
361, 368FD
intron prediction see intron prediction
potential for errors, 65
preliminary steps, 318–22, 319FD prokaryotes see under prokaryotes promoter region, 338–42, 381–9 splice site detection see splice sites,
detection
theoretical basis, 357–99, 358MM general time-reversible model (GTR or
REV), 253T, 255, 262
general transcription initiation factors, 17
see also transcription factors generation, 209
GeneSplicer program, 394–5 genetic algorithms
cluster analysis, 646–7, 646F docking, 591–2, 592F
function optimization, 709, 716–18, 716F
multiple sequence alignment (SAGA), 209–11, 210F, 211F genetic code, 11, 12–13, 12T
degeneracy, 13
genetic distance, 224–5, 232F see also evolutionary distance gene (phylogenetic) trees, 226, 230,
231F
combined with species trees, 243, 244F
reconstruction example, 259–63, 261F, 263F
GeneWalker program, 331T, 332T, 335–6
GeneWise program, 345–6 Genie program, 329F, 386 Geno3D program, 554, 563, 563T genome(s), 4, 10
comparisons see genome sequence alignments
completely sequenced, 71 databases, 56, 103 evolution, 247–8 fusion, 292B
identifying features, 317–54, 318MM
known prokaryotic, 324T problems of defining, 23B genome annotation, 65, 399–405
see also gene prediction comparing genomes to check
accuracy, 353–4, 353F, 354F, 403–5, 403F, 404F
E. coli segment, 322, 323F evaluation and reevaluation, 405 functional, 400–3
pathway information aiding, 348, 349–50F
pipeline approach, 319
practical aspects, 346–52, 347FD quality of information used, 403 role of gene ontology, 348–52, 402 theoretical basis, 357–9, 358MM Genome Browser, 352, 352F GenomeNet, 84
GenomeScan, 397
genome sequence alignments to verify annotation, 353–4, 353F,
354F, 403–5, 403F, 404F whole genomes, 156–9, 157FD genome sequences
excluding noncoding regions, 319–21
gene prediction from see gene prediction
preliminary examination, 318–22, 319FD
splitting, 319 genome sequencing, 71
multiple genomes, 376B shotgun procedure, 376B genomic imprinting, 7 genomics
functional, 600
role in systems biology, 668 structural, 569
GenScan program, 334
comparative results, 331T, 332T, 336 exon detection, 390
promoter detection, 385, 385F splice site prediction, 394, 395F,
396
transcription stop signal detection, 389
translation start site detection, 389 use of gene models, 398–9, 401F use of homology searches, 397 GenTHREADER, 532–3, 534–5, 535F,
536F
GEPASI, 691–2, 691F
GES (Goldman, Engelman and Steitz) hydrophobicity scale, 438, 475, 477T
Gibbs program, 215–17 Gleevec®, 593
GLIMMER program, 323, 371–2 global alignments, 88–9, 89F
large genome sequences, 352F, 353 optimal, 128, 129–35, 129F, 130F,
131F
score significance, 154
time saving methods of deriving, 139–41, 139F, 140F
Index
global–local dynamic programming, 533
globular proteins, 41
length distributions of secondary structures, 467, 468F
secondary structure prediction, 509
secondary structures, 463 gluconeogenesis pathway, 348,
349–50F
glycolytic pathway, 671, 672F E. coli, 673F
interactions, 673F modularity, 686F, 687F
glycosylphosphatidylinositol (GPI) anchors, 513–14, 513F Godzik, Adam, 491
Gojobori, Takashi, 240B GOLD program, 591–2, 592F
GOR methods, 414, 422–5, 425F, 472–3 accuracy, 422, 423, 424T, 484 derivation, 480–4, 482F version III, 483, 484F version IV, 423–5, 427F, 483 version V, 423–5, 425–6, 426F, 483 Gotoh, Osamu, 206
GPI-SOM method, 513–14, 513F G-protein-coupled receptors, 436,
436B
GrailEXP program, 331T, 332T, 334–5, 336
Grail program, 323, 386, 387F, 389, 399
greedy alignment methods, 199 greedy permutation encoding method,
646–7
Greek Key structure, 40B GRID program, 591 GRIN program, 591 Grishin, Nick, 466
growth factors, 616–17, 617F guanine (G), 6, 6F
guide tree, 90, 199–200 construction, 204–6, 205F
multiple alignment from, 206, 206F pattern discovery, 214
Guigo, Roderic, 365–6B, 392B
Gumbel extreme-value distribution see extreme-value distribution
H
HbP method, 491–2, 492F, 493F Haemophilus influenzae, 371 hairpins, 36–7
harmonic approximation, 526, 702–3, 702F
hashing, 95
theoretical basis, 143–6 whole genome sequences, 158
heart
cellular modeling, 685T modeling of function, 677, 678F heat shock response, E. coli, 680, 680F helical wheels, 439F, 440–1, 448 helices, 435
see also 310-helices; a-helices;
p-helices; transmembrane helices
helix tails, 441
hemagglutinin, 34, 486, 486F hemoglobin, 43, 43F
Henikoff, Steven and Jorja, 122, 171F heptads, 451, 451F, 510
Hessian, 714–15
hexamers (hexanucleotides) see dicodons
HHsearch, 195F, 196
hidden layers, 431, 431F, 494, 499 hidden Markov models (HMMs), 166,
179, 179FD
with duration, or explicit state duration, 374–6
EcoParse gene model, 375F, 376–7 exon prediction, 328, 332
GAZE gene model, 402F
GeneMark.hmm algorithm, 374–6, 374F
genome annotation, 359 GenScan gene model, 399, 401F multiple sequence alignments, 200,
203–4
profile see profile hidden Markov models
secondary structure prediction, 504–10, 506FD
secondary structure prediction, 504–10, 506FD