Appendix C: Function Optimization - Understanding Bioinformatics

Full Search Methods 710

Dynamic programming and branch-and-bound 710

Local Optimization 710

The downhill simplex method 711

The steepest descent method 711

The conjugate gradient method 714

Methods using second derivatives 714 Thermodynamic Simulation and Global Optimization 715 Monte Carlo and genetic algorithms 716

Molecular dynamics 718

Simulated annealing 719

Summary 719

A

A (accepted point mutation matrix), 120

AACC, 214–15, 214F AAINDEX, 84 AAindex, 476

AAT program, 331T, 332T, 335, 336 ab initio approach, modeling protein

structure, 522, 523B accepted mutations, 84

accepted point mutation matrix (A), 120

acceptor splice sites, 18F, 380F, 392 acetolactate synthase (ALS) family,

259B, 262 activators, 16–17 adaptive systems, 667–8 additive trees, 228–9, 229F, 230 adenosine (A), 6, 6F

affine gap penalty, 127, 128, 133–4, 139 Affymetrix GeneChip® arrays, 602 Akaike information criterion (AIC),

253–5

ALDH10 gene, 324–5 annotation, 351–2 exon prediction

accuracy, 345, 345–6

different programs, 331–2, 331T, 333–4, 334F, 335, 336

experimental results compared, 327, 328F

using related organisms, 336–7 gene structure, 327B

interspecies comparisons, 353, 353F, 354F

pathway approach to identifying, 348, 349–50F

promoter prediction, 341, 341T start codon, 327, 330F

alignment, sequence see sequence alignment

Alix, Alain, 475

all a-fold proteins, 421F, 422F, 573F, 574 Note: Entries which are simply page numbers refer to the main text. Other entries have the following abbreviations

immediately afer the page number: B, box; F, figure; FD, flow diagram; MM, mind map; T, table.

INDEX

all b-fold proteins, 421F, 422F, 573F, 574

alternative splicing, 19, 380–1 Alu elements, 337B

Alzheimer’s disease, 491 AMAS program, 93 AMBER program, 526, 701

amino acid(s) (residues), 11, 27–33 chemical structure, 28F

conservation, to identify binding sites, 586–7, 587F

conservation values (Zpred), 426, 427F, 428F, 429T

hydrophobicity scales, 437–8, 450, 475, 477T

peptide bonds, 29–33, 31F

physicochemical properties, 28–9, 28T, 30F

amino acid propensities, 37, 472–85, 472FD

see also Chou–Fasman propensities averaged over sequence windows,

476–9

derivation and calculations, 473–6 nearby sequence effects, 479–84,

480F

amino acid sequences, 13, 25, 29 see also protein sequences evolutionary conservation, 38 short segments with structural

correlations, 487–8, 487F amino acid side chains, 28F

modeling, 547–8, 548F, 558–9, 561 physicochemical properties, 28–9,

30F

torsion angles (c₁, c₂, etc), 547, 548F

amino (N) terminus, 29 amphipathic helix, 439–41 amyloidogenic proteins, 486, 487,

491–2, 492F, 493F analogous enzymes, 244, 244F analysis of covariance (ANCOVA), 659 analysis of variance (ANOVA), 659 ancestral states, 226

anchor points, 546, 546F Anfinsen, Christian, 412, 412F annotation, 357

automated, 64–5 database, 53

data errors or omissions, 64 gene, 348–52

genome see genome annotation manual, 65

ANOLEA program, 550–1, 551T antibiotic synthesis, 643B antibodies, 381, 555B

modeling, 555–6B anticoding strand, 11 anticodons, 13–14, 14F

antigen-binding site, 555–6B antigens, 555B

antisense strand, 11 apoptotic pathway, 681F

approximate correlation coefficient (AC), 366B

Arabidopsis thaliana, 328, 330B gene duplications, 241B Rha1 gene prediction, 393F splice sites, 380F, 396 vs rice, 335B

Archaea, 21, 21F

horizontal gene transfer, 246F, 247 sequenced genomes, 324T architecture

database, 45 network, 676, 677F Argos, Patrick, 171 ArrayExpress, 58, 606, 611 ArrayExpress Data Warehouse, 58 arrhythmia, cardiac, modeling, 677,

678F

ATG start codons see start codons atomic charges, 704

atomic mean force potential (AMFP), 551

AUG codon, 13, 19, 367

AU (approximately unbiased) method, 309

average conditional probability (ACP), 366B

B

backbone (protein), 29, 32 models, 39, 39F

back-propagation method, 497B backward algorithm, 190–1 bacteria, 21, 21F

see also Escherichia coli; prokaryotes 16S RNA, 249

horizontal gene transfer, 246F, 247 sequenced genomes, 324T balanced training, 498B Baldi, Pierre, 191 BAliBase, 92, 93F

balloting probabilities, 501 Barton, Geoff, 206

base-pairing, 7–9, 8F RNA, 456

wobble, 14 bases, 5–7, 6F

base sequences see nucleotide sequences

Baum–Welch expectation

maximization algorithm, 191–3 Bayesian information criterion (BIC),

254–5

Bayesian methods, 697–8

dealing with lack of replicates, 657B

phylogenetic tree reconstruction, 250, 251T, 253, 306–7

Bayes’ theorem, 697–8 Benjamini, Yoav, 659

Berkeley Drosophila Genome Project (BDGP), 340, 341T

Betaturns method, 503 biased mutation pressure, 239 biclustering, 649–50, 650F

bidirectional recurrent neural network (BRNN), 504, 505F

Bifidobacterium longum, 348, 350F bifurcating (branching) pattern, 226–7 binding sites, protein see protein

binding sites

biochemical pathways see metabolic pathways

BioEdit program, 260 bioinformatics, 3

protein structure and, 37–9, 38FD BioModels Database, 692

Biomolecular Interaction Network (BIND), 58, 671, 673F bistable switches, 688–9, 689F BLAST program, 95–6

algorithmic approximations, 141 comparing nucleotide with protein

sequences, 150–3

Conserved Domain Database (CDD) search, 99F, 100

dealing with low-complexity regions, 101–2

E-values, 98–100, 99F, 156 gapped method, 147–50, 178T GenScan modification using, 397 restriction of matrix coverage, 140 suffix trees, 141–3

use of finite-state automata, 147–50, 147F, 148F versions available, 95–7

whole genome alignments, 157–9 blastx program, 96, 97, 150, 343 BLAT program, 158

BLOCKS database, 58

Dirichlet mixture from, 174–5, 174F

searching, 105–7, 106F

substitution matrices from, 122 BLOSUM matrices, 83F, 84

alignment scoring, 82 derivation, 122–5, 123F, 124F selection, 84, 85

summary score measures, 125F, 126 Blundell, Tom, 532

Boltzmann factor, 706 bond angle energy, 703 bond energy, 702

bonding terms, 525–6, 701, 702–4, 702F

Bonferroni correction, 658 Index

bootstrap analysis, 310B

assessing tree topology, 309–10 comparing tree topologies, 233–4,

233F

comparing two or more trees, 311 parametric, 310B

practical example, 258, 259F bootstrap interior branch test, 310 bottom-up approach, modeling

biological systems, 674–6, 676F bovine spongiform encephalopathy

(BSE), 37B, 101B

branch-and-bound method, 288, 710 branches, 226, 227F

branch length calculations, 293–7, 295F, 296F

assessing reliability, 309–10 parsimony methods, 299–300 branch swapping techniques, 289–91,

290F BRCA2, 78, 79F Brenner, Steven, 480 Brudno, Michael, 209 Bryant, David, 296, 296F BTPRED method, 503

Bucher weight matrix method, 383–4, 384F

Burset, Moises, 365–6B, 392B BVSPS program, 551T

C

C2-like domain, Dictyostelia, 535–7, 536F, 537F

Caatoms, 28, 28F, 29, 417

analysis of geometry, for prediction algorithms, 466, 466F

torsion angles see under torsion angles

Camodels, 39, 39F

Caenorhabditis elegans, 399

CAFASP (Critical Assessment of Fully Automated Structure Prediction), 419, 554–6

cAMP PK see cyclic AMP-dependent protein kinase

canonical ensemble, 718 Cantor, Charles, 271 capping, RNA, 18

cap signal (initiator signal, Inr), 389 Bucher weight matrix, 383, 384,

384F

GenScan prediction method, 385, 385F

NNPP prediction method, 385–6, 386F

carboxy (C) terminus, 29 Casadio, Rita, 479–80

cascade-correlation neural network, 503–4

CASP (Critical Asssessment of

Structure Prediction), 419, 554–6 CATH database, 531, 574

causal dependencies, 668 Cbl protein, 575–80, 576F

CCAAT box, detection algorithms, 383, 384–5

CDK10 gene, 324–5 DNA sequence, 326–7B

exon prediction, 329F, 330–1, 332T, 336–7

translation of predicted exons, 344F cDNA (complementary DNA)

exon prediction using, 397 gene-prediction programs using,

334, 335 microarrays, 602 sequence databases, 56 Celera, 376B

cell-division cycle, 688–9

Cell Markup Language (CellML), 692 CellML Model Repository, 692 cellular modeling

heart, 685T

international projects, 668 programs, 691–2, 691F

CE (Combinatorial Extension) method, 576–7, 578F

central dogma, 10–14, 10F, 10FD centroid, 711

centroid method, hierarchical clustering, 640, 641F chaining, 144–6

chameleon sequences, 37B, 488 CHAOS algorithm, 209

CHARMM program, 526, 701 ChiClust program, 617, 618–19 ChiMap program, 618–20, 619F chloroplasts, 22, 292B

Chou, Peter, 472

Chou–Fasman propensities, 414, 415F, 472, 474–6

applied to GOR, 483

calculated values, 474F, 476T measures of accuracy, 424T nearest-neighbor methods, 489 periodic variation, 474–5, 475F transmembrane helices, 475–6,

478F

window sizes, 477–8 chromatography, 600, 623 chromosomes, 10, 21–2

rearrangements, 248 Churchill, Gary, 275 chymosin B, 486, 487F, 490F chymotrypsin, 243–4, 244F CINEMA program, 93 cis conformation, 32, 33F clades, 256

Cladist program, 608–9, 609F

cladogram, 228, 229F ClustalW, 90, 91–2

progressive alignment method, 205 scoring scheme, 201–2, 201F, 202F vs other alignment methods, 92,

93F

cluster analysis, 625–64, 626MM data preparation, 626–33, 627F,

627FD

defining distances, 633–7, 634FD, 636F

evaluating validity of clusters, 650–1

hierarchical see hierarchical clustering

hydrophobic (HCA), 110–11, 110F sequence alignment, 90–1, 90F, 126 clustering methods

see also specific methods comparison between, 643B gene expression microarray data,

606–11, 611F

identifying expression patterns, 637–51, 637FD

phylogenetic tree construction, 276–9, 277FD

protein expression data, 615–17, 617F, 618F

Clusters of Orthologous Groups (COG) database, 103, 243, 245B

CMISS modeling tool, 692 COACH method, 195, 203 coding, 11, 12–13 coding strand, 11–12 codon-pairs see dicodons codons, 13

see also start codons; stop codons frequency of occurrence, 367, 367F genetic code, 12T

mutation rates at different, 238–9, 238F

statistics, use by ORPHEUS, 372–3 co-expressed genes or proteins, 600,

638

COFFEE scoring system, 200, 203, 204F

COG (Clusters of Orthologous Groups) database, 103, 243, 245B

Cohen, Stanley, 643B coiled coils, 413, 435 geometry, 451, 451F

prediction, 451–4, 452FD, 478–9, 510, 510F

COILS program, 452–3, 454F, 478–9 collagen, 452

common evolutionary ancestor, measuring likelihood, 117–19 comparative modeling see homology

modeling

COMPASS method, 195

Index

complementary DNA see cDNA complementary DNA strands, 7–8 complete linkage clustering, 640,

641F complexity

see also low-complexity regions biological systems, 684–5 compositional, 151–2B COMPOSER program, 546, 553–4 compositional complexity, 151–2B concatamers, 605

condensation reaction, 29, 31F condensed trees, 233–4, 233F conditioned reconstruction, 292B confidence index, 432

conformation, 27, 41

see also quaternary conformation energies, 524–9, 524FD

side chains, 547–8

conformational flexible docking, 590

conformers, 547

conjugate gradient method, 528, 713F, 714

conjugate prior, 698 consensus features, 234

consensus method, pattern or motif creation, 105

consensus sequences, 16 consensus trees, 234–5, 234F, 291 Conserved Domain Database (CDD)

search, 99F, 100 CONSOLV program, 593 ConSurf program, 587, 587F

contact capacity potential (CCP), 533, 707–8, 708F

context strings, 371

control circuits, biological systems, 680, 680F

convergent evolution, 74–5, 75B, 243–4, 244F

cooperativity, 701

COPASI modeling tool, 692 Corbin, Kendall, 270

CorePromoter program, 340, 341T, 388, 389F

core promoters, 17, 319 see also promoter prediction detection of binding signals, 339,

381–9

models designed to locate, 383–7 Cost, Scott, 489, 491

covalent bonds, 32B, 33B energetics, 525–6, 701, 702–4 CPHmodels, 554, 563

creatine kinase, 42F, 43

Creutzfeldt–Jakob disease (CJD), 101, 101B

variant (vCJD), 101B Crick, Francis, 7

Critical Assessment of Fully Automated Structure Prediction (CAFASP), 419, 554–6

Critical Assessment of Structure Prediction (CASP), 419, 554–6 Crooks, Gavin, 480

C terminus, 29

Cy5/Cy3 label gene expression microarrays, 602–3, 603F cyclic AMP-dependent protein kinase

(cAMP PK)

inserting gaps, 86, 86F

local and global alignment, 89, 89F

multiple alignment, 91–2, 92F cytochrome c oxidase I, 249 cytosine (C), 6, 6F

D

Dali library, 574

DALI program, 578–9, 579F

Darwinian concept of evolution, 235 DAS (Distributed Annotation System),

348–51, 351F

DAS (dense alignment surface) program, 442F, 444–5, 445F, 447 data, 53

checking for consistency, 63–4 derived (secondary), 53–4 log transformation, 629–30, 630F normalization, 627–31, 628F, 630F primary, 53–4

quality, 61–6, 62FD

database management system (DBMS), 48

Database of Interacting Proteins (DIP), 58

databases, 45–66, 46MM access to, 52

categories (by content), 55–61, 56F

centers, 55

content of entries, 53 data quality, 61–6, 62FD distributed, 48, 52

entry identifiers/version numbers, 65–6

first computerized, 48, 48F flat-file, 47, 47F, 48–9 links between, 52, 53 looking for, 55–61 nonredundancy, 62–3 ontologies, 54–5, 54F relational, 48, 49–50, 49F structure, 46–52, 47FD

for systems biology, 671–2, 675T training and test, 416–17 types, 52–5, 53FD, 55FD updating, 65–6

data classification, 637–8, 638F see also sample classification secondary structure prediction,

510–14, 511FD

data warehouses, 48, 51F, 52 Davies, Graham P., 420B Dayhoff, Margaret, 82, 119 Dayhoff mutation data matrices

(MDMs) see PAM matrices dbEST, 56, 321B

DEAD-box motif, 420B decision trees

detection of functional RNA molecules, 361–3, 363F sample classification, 661 splice site prediction, 394 DEFINE, 417

degenerate (genetic code), 13 degrees of freedom (df ), 654, 655 deletions

accounting for, in sequence alignment, 85–7

alignment scoring schemes, 117, 126–7

homology modeling, 542, 545–6, 545F

threading and, 532, 537 denatured proteins, 42 dendrograms, 636, 636F

gene expression data, 606F, 607, 607F, 608

hierarchical cluster analysis, 639, 640, 640F, 641F

dense alignment surface (DAS) program, 442F, 444–5, 445F, 447 deoxyribonucleic acid see DNA deoxyribonucleotides, 6 deoxyribose, 5–6

DESTRUCT method, 503–4, 505F deterministic finite-state automaton,

147F, 148–50 diagonals

DIALIGN method, 92, 207–9, 208F

FASTA scoring, 95

labeling of matrix, 144F, 145 restricting matrix coverage to,

139–41, 139F, 140F

DIALIGN program, 92, 93F, 207–9 DIAL program, 575, 576, 576F dichotomous (branching) pattern,

226–7

dicodons (hexamers), 328, 367 exon prediction using, 390 gene detection methods using,

368–72

promoter prediction using, 387–8 Dictyostelia, C2-like domain, 535–7,

536F, 537F dielectric constant, 704 Index

differential equations, modeling biological systems, 680–3, 682F digital differential display (DDD),

605–6, 605F

dihedral angles see torsion angles dihydrofolate reductase (DHFR)

ligand docking, 592, 592F

pocket identification, 585–6, 586F dimers, 43

directed acyclic graph (DAG), 512 directional information, 423, 482 Dirichlet distribution densities, 174 Dirichlet mixture, 174–5, 174F, 176F discriminant analysis

see also linear discriminant analysis;

quadratic discriminant analysis gene prediction, 340, 388, 389F,

396–7

sample classification, 661 secondary structure prediction,

512–13 distance, 81

see also evolutionary distance;

p-distance

definitions for cluster analysis, 633–7, 634FD, 636F

phylogenetic tree reconstruction, 249–50, 251, 251T

distance correction, 236

Distributed Annotation System (DAS), 348–51, 351F

distributed databases, 48, 52 divergent evolution, 75B

divide-and-conquer method (multiple alignment), 91, 91F

vs other alignment methods, 92, 93F

DNA, 4

central dogma concept, 10, 10F, 10FD

complementary see cDNA double helix formation, 7–9, 8F mutations see mutations noncoding see junk DNA strands, 7–9, 8F, 11–12 structure, 5–9, 5FD, 8F transcription see transcription DNA gyrases (GyrA and GyrB), 249 DNA microarrays, 9, 600, 601–4

basic principle, 602

databases see microarray databases data clustering methods, 606–10,

643B

data sharing and integration, 606 gene expression studies, 602–4,

603F

principal component analysis of data, 618

two-color, 602–3, 603F

uses of clustered data, 610–11, 611F

DNA polymerase, 8 DNA repeats, 22B

see also repeat sequences detection, 152B

exclusion from analysis, 319–21 DNA replication, 8, 8F

DNA sequence databases, 56, 57F nomenclature for base uncertainty,

63, 63T DNA sequences

alignment scoring matrices, 124F, 125

detecting homology, 75–6 gene prediction from see gene

prediction

multiple alignments, 92 nucleotide bias, 275–6

phylogenetic tree reconstruction, 249

preliminary examination, 318–22, 319FD

searching with, 97 docking, 587–93, 588FD

accounting for water molecules, 592–3

conformational flexible, 590 fragment, 591

scoring functions, 590 simple strategies, 588

specialized programs, 588–92, 592F DOCK program, 590–1

domains protein, 41

see also multidomain proteins families, 259B

identifying, 574–6, 576F shuffling, 570

taxonomic, 21

donor splice sites, 18F, 380F, 392 dot-plots, 77–8, 77F, 79F

low-complexity regions, 101–2, 102F

double dynamic programming, 534 downhill simplex method, 711, 712F downstream sequences, 16

d-patterns, 217

drawhca program, 110F, 111 drug design, rational, 588, 589B DSC method, 512–13

DSSP program, 417

defining secondary structures, 464–6, 465F, 465T, 467, 467F length distributions of secondary

structures, 467, 468F

nearby sequence effects, 479–80, 480F

duplication

chromosome and genome, 248 gene see gene duplication sequence, 158F, 245

Durbin, Richard, 363 DUST program, 152B

dynamic programming algorithms double, 534

gene model, 399, 402F global–local, 533

pairwise alignment, 86–7 database searching, 95–7 discarding intermediate

calculations, 138B

extension to multiple alignment, 198

function optimization, 710 local and suboptimal, 135–9 optimal global, 129–35

principles and methods, 127–41, 128FD

time methods, 139–41, 139F, 140F

Sankoff algorithm for weighted parsimony, 300–2, 301F threading, 533–4, 534F

E

E-Cell Project, 668

EcoCyc database, 671, 673F EcoKI restriction enzyme, 420B EcoParse gene model, 375F, 376–7 Eddy, Sean, 293, 362, 363

edges see branches Efron, Bradley, 310B

EGFR see epidermal growth factor receptor

eigensamples, 633

Eisenberg hydrophobicity scale, 450

Elber, Ron, 532

electronic resonance, 31

electrostatic interactions, 33B, 704 EMAP modeling tool, 692

emergent properties, 669 emissions, 179, 181–2 eMOTIF, 213–15, 214F

end state, 179, 180, 182–3, 183F energies

free see free energy molecular, 700–8

potential see potential energy energy gradient, 528

energy minima, global, 524, 528–9 energy minimization, 527–8, 528F applied to homology modeling,

548, 559–60 Ensembl, 103, 403

enthalpy see potential energy entropy, 695–7

component of free energy, 525 relative, 125F, 126, 697 Shannon, 695–6

Index

enzymes, 40

analogous, 244, 244F

convergent evolution, 243–4, 244F phylogenetic analysis, 259–63 simulation modeling, 690F, 691–2,

691F

epidermal growth factor receptor (EGFR), 436, 436B

mitogen-activated protein kinase system, 683F

pathway modeling, 681, 682F, 690 epitope, 555B

ergodic systems, 717, 718–19 errors

random, 627–8 systematic, 625, 627–8 type I, 653, 658 types and rates, 657–8 Erwinia carotovora, 262 Escherichia coli, 21, 378

detection of tRNA genes, 320–1, 320F

EcoCyc database, 671, 673F EcoParse gene model, 375F, 376–7 engineered O_RO_lacpromoter, 676,

676F

gene classification by codon usage, 370

GeneMark.hmm gene model, 375–6

genome segment annotations, 322, 323F

heat shock response, 680, 680F length distributions of

coding/noncoding regions, 374F, 375

promoters, 339–40

pyruvate formate-lyase, 467F pyruvate kinase, 480F robustness, 684 start codons, 366F, 367 ESPript, 93

ESTs see expressed sequence tags ESyPred3D, 554, 563, 563T Euclidean distance, 634–5, 636F Eukarya see eukaryotes

eukaryotes, 14, 21–2, 21F control of translation, 19

exon prediction see exon prediction gene detection, 323–37, 323FD, 360 finding correct start codon, 327,

330F

homology searching, 322 with only query sequence,

327–32

with query sequence and gene model, 332–4

sequence features used, 377–81, 378FD

series of steps, 346T

using correct reading frame, 325–7, 325T, 328F, 329F using gene control signals,

381–9, 382FD

using gene model and sequence similarity, 334–6

using genomes of related organisms, 336–7

variety of approaches, 324–5 vs methods used in prokaryotes,

377–9

gene models, 397–9, 398FD gene structure, 319, 325F intron prediction see intron

prediction

mRNA modifications, 18–19 origins, 292B

promoter prediction, 339, 340–2 indefinite nature of results, 341,

341T

online methods, 340–1 theoretical basis, 381–9 regulation of transcription, 15,

17–18, 17F

splice site detection see splice sites, detection

tRNA gene detection, 362–3 Eukaryotic Promoter Database (EPD),

339, 340

European Bioinformatics Institute (EMBL-EBI), 52, 55, 606 databases, 55–6, 60 E-values, 98

cut-off thresholds, 98–100, 99F, 101F

PSSM construction, 176 statistical significance, 156 EVA program, 551T

evolution, 5, 20–3, 20FD aiding sequence analysis, 38 basic concepts of molecular,

235–48, 235FD

convergent, 74–5, 75B, 243–4, 244F Darwinian concept, 235

divergent, 75B gene level, 239–47 genome level, 247–8

minimum see minimum evolution nucleotide level, 236–9

evolutionary clustering algorithms, 646–7, 646F

evolutionary distance, 81, 199, 224–5 see also p-distance

additive phylogenetic trees, 228, 229F

calculation, 268–76, 269F evaluating tree topologies using,

293–7

PAM matrices and, 84 sources of errors, 277

tree construction, 251–2, 276–9, 277FD

evolutionary history

phylogenetic trees see phylogenetic trees

recovering, 223–64, 224MM evolutionary models

practical application, 251–5, 253T selection of appropriate, 253–5,

254F, 256T

sequence alignment, 117–19 theoretical basis, 268–76 time-reversible, 302

evolutionary trace method, identifying binding sites, 586–7, 587F exclusive classification, 637–8, 638F exon prediction, 319, 323–37

assessing accuracy, 343–6, 343F, 344F, 392B

with only query sequence, 327–32

with query sequence and gene model, 332–4

theoretical basis, 379–81, 389–97, 391FD

using correct reading frame, 325–7, 325T, 328F, 329F, 391–2

using gene model and sequence similarity, 334–6

using general sequence properties, 390–2

using genomes of related organisms, 336–7

using homology searches, 397 variety of approaches, 324–5 exons, 18, 18F, 19

initial and terminal, detection, 390, 396–7

length distributions, 379, 379F translating predicted, 343, 344F use of term, 379–80

ExPASy program, 345, 412, 620 expectation maximization (EM), 191,

216

expectation values see E-values expected number of offspring (EO),

209

expected score, 119, 126 see also E-values

explicit state duration hidden Markov model (HMM), 374

expressed (genes), 11 see also gene expression

expressed sequence tags (ESTs), 321B databases, 56, 103

digital differential display (DDD), 605–6, 605F

exon prediction using, 397 gene-prediction methods using,

334–5 Index

expression level ratios, 628–30, 629F, 630F

in different samples, 652

log transformation, 629–30, 630F eXtensible Markup Language (XML),

50–1

external nodes, 226, 227F

extracellular matrix (ECM), modeling tumor invasion, 677, 677F Extreme Pathways, 678

extreme-value distribution, 97–8, 155–6, 155F

extrinsic classification, 638

extrinsic gene detection methods, 361, 368FD

eye, gene expression patterns, 607F, 608

F

false discovery error rate (FDR), 658, 659

false negatives

in gene prediction, 365B in sequence analysis, 212 false positives

in gene prediction, 365B in sequence analysis, 212 statistical tests, 653

families, protein see protein families family-wise error rate (FWER), 658,

659

Fano definition of mutual information, 481

Fasman, Gerald, 472 FASTA program, 95

algorithmic approximations, 141 chaining, 144–6

comparing nucleotide with protein sequences, 150–3

database searching method, 143, 144–6, 145F

E-values, 98, 100, 101F, 156 restriction of matrix coverage, 140 versions available, 95–6, 96T whole genome alignments, 157–9 fast Fourier transform (FFT), 206 FATCAT program, 579–80, 580F feedback control, 680, 680F feedforward control, 680, 680F Felsenstein, Joseph, 253, 275

Felsenstein 81 (F81) model, 253, 253T, 254F, 256T

Felsenstein zone (long-branch attraction), 292, 308–9, 309F Ferrell, J.E., 689F

FGENESH program, 332, 333–4, 334F comparative results, 331T, 332T,

333F

rice genome prediction, 335B

fibrin, 451–2

fibrous proteins, 41, 435 fields (database), 46–7

fingerprints, multiple motif, 109 finite-state automata (FSA), 147–50,

147F, 148F

vs hidden Markov models, 147, 179, 180–1

FirstEF, 332, 396–7

Fitch algorithm see post-order traversal Fitch–Margoliash method, 250, 251T

evaluating tree topologies, 293. 297 generating single trees, 279–80,

280F, 281F

vs neighbor-joining, 282, 284F, 285 fitness, 235

evolutionary clustering, 646–7, 646F

flavin adenine dinucleotide (FAD), 259B, 260, 261F, 262

flavodoxin family, 573F Fletcher–Reeves formula, 714 Flicker program, 614, 620, 620F Flux Balance Analysis (FBA), 678 FoldIndex method, 513

folding, protein see protein folding folding funnel, 525

fold recognition see threading folds, protein see protein folds force fields, 522, 524–9, 701–5

additive, 701 class I and II, 702 nonadditive, 701 forward algorithm, 190

fractional alignment difference, 269 frameshift, 150

Franklin, Rosalind, 7, 7F free energy

folded proteins, 41–2

RNA secondary structures, 456, 457–8

surface, molecular systems, 525, 525F

free insertion modules (FIMs), 184–5 fructose-1,6-bisphosphate aldolases

(FBPAs), 569F, 570, 570F FSSP database, 574, 578–9 Fuchs, Patrick, 475

FUGUE program, 532, 535–6, 536F fully resolved trees, 227

function (protein and gene), 40–1 see also structure–function

relationships

conservation, 568–74, 568FD evolution, 242, 243–4 genome annotation, 400–3 orthologs, 239, 243 patterns and, 109–11

phylogenetic trees for predicting, 262

protein folding and, 40–1, 41F using orthologs to predict, 245 functional homology, 569–70, 569F,

570F

function optimization see optimization, function FunSiteP algorithm, 340, 341, 341T fusion

gene, 72 genome, 292B

G

Gamma distance (correction), 239, 269F, 270

Gamma distribution (G), 269F, 270 evolutionary model variation, 253T,

254F

gap extension penalty (GEP), 85, 127 gap insertion operator, 210–11, 211F gap opening penalty (GOP), 127, 202,

202F

gap penalties, 85–6, 87, 126–7 global alignments, 131F, 132–5,

132F, 134F

local alignments, 137 manual adjustment, 93

multiple alignments, 202, 205, 206 position-specific scoring matrices,

170, 177

suboptimal alignments, 137F, 139 gaps, 74

inserting, 85–7

in multiple alignments, 204, 205F scoring, 126–7

Garnier, J, 422

Gaussian distributions see normal distributions

GAZE program, 399, 402F

GC box, detection algorithms, 383, 384–5

GC content

bacterial genomes, 238F, 239 evolutionary models and, 273 promoter prediction using, 386,

387F

regions of different (isochores), 275, 378

GenBank, 55–6, 102–3 flat-file format, 47, 47F sample extract, 57F gene(s), 5, 10–11

evolution, 239–47

families see protein families function see function fusion, 72

nested, 399 nonfunctional, 242 overlapping, 12, 12F, 360 prokaryotic vs eukaryotic, 377–9

Index

structure and control, 14–20, 15FD, 318–19

structure in eukaryotes, 319, 324 GeneBee program, 457F, 458

GeneBuilder program, 331T, 332T, 335, 336

GeneCluster2 program, 608 gene duplication, 73, 239–42, 242F

acetolactate synthase (ALS), 262, 263F

effects on phylogenetic analyses, 245

identified from synonymous mutations, 241B

phylogenetic trees, 226, 231F structure–function relationships,

570

use for rooting trees, 292–3 gene expression, 11

co-expression, 600 databases, 58

digital differential display (DDD), 605–6, 605F

microarrays, 602–4, 603F see also DNA microarrays patterns, 638, 639F

SAGE method, 604–5, 604F sample classification, 659–62,

660FD

uses of clustered data, 610–11, 611F

gene expression analysis, 599–600, 600MM, 601–11, 601FD clustering methods see clustering

methods

data preparation for, 626–33, 627F, 627FD

statistics, 652–9 gene loss, 242–3, 243F

effects on phylogenetic analyses, 245

GeneMark algorithm, 328–9, 368–70 comparative results, 331–2, 331T,

332T

GeneMark.hmm algorithm, 373–6, 374F

gene models, eukaryotic, 397–9, 398FD

Gene Ontology, 54, 348 gene ontology

evaluating validity of clusters, 651 genome annotation, 348–52, 402 gene prediction (detection), 317–46,

318MM

assessing accuracy, 342–6, 342FD at exon level, 343, 344F, 392B at nucleotide level, 343, 343F,

365–6B

at protein level, 343–6, 345F eukaryotes see under eukaryotes

evaluation and reevaluation of methods, 405

exon prediction see exon prediction further analysis, 399–405, 400FD intrinsic and extrinsic methods,

361, 368FD

intron prediction see intron prediction

potential for errors, 65

preliminary steps, 318–22, 319FD prokaryotes see under prokaryotes promoter region, 338–42, 381–9 splice site detection see splice sites,

detection

theoretical basis, 357–99, 358MM general time-reversible model (GTR or

REV), 253T, 255, 262

general transcription initiation factors, 17

see also transcription factors generation, 209

GeneSplicer program, 394–5 genetic algorithms

cluster analysis, 646–7, 646F docking, 591–2, 592F

function optimization, 709, 716–18, 716F

multiple sequence alignment (SAGA), 209–11, 210F, 211F genetic code, 11, 12–13, 12T

degeneracy, 13

genetic distance, 224–5, 232F see also evolutionary distance gene (phylogenetic) trees, 226, 230,

231F

combined with species trees, 243, 244F

reconstruction example, 259–63, 261F, 263F

GeneWalker program, 331T, 332T, 335–6

GeneWise program, 345–6 Genie program, 329F, 386 Geno3D program, 554, 563, 563T genome(s), 4, 10

comparisons see genome sequence alignments

completely sequenced, 71 databases, 56, 103 evolution, 247–8 fusion, 292B

identifying features, 317–54, 318MM

known prokaryotic, 324T problems of defining, 23B genome annotation, 65, 399–405

see also gene prediction comparing genomes to check

accuracy, 353–4, 353F, 354F, 403–5, 403F, 404F

E. coli segment, 322, 323F evaluation and reevaluation, 405 functional, 400–3

pathway information aiding, 348, 349–50F

pipeline approach, 319

practical aspects, 346–52, 347FD quality of information used, 403 role of gene ontology, 348–52, 402 theoretical basis, 357–9, 358MM Genome Browser, 352, 352F GenomeNet, 84

GenomeScan, 397

genome sequence alignments to verify annotation, 353–4, 353F,

354F, 403–5, 403F, 404F whole genomes, 156–9, 157FD genome sequences

excluding noncoding regions, 319–21

gene prediction from see gene prediction

preliminary examination, 318–22, 319FD

splitting, 319 genome sequencing, 71

multiple genomes, 376B shotgun procedure, 376B genomic imprinting, 7 genomics

functional, 600

role in systems biology, 668 structural, 569

GenScan program, 334

comparative results, 331T, 332T, 336 exon detection, 390

promoter detection, 385, 385F splice site prediction, 394, 395F,

396

transcription stop signal detection, 389

translation start site detection, 389 use of gene models, 398–9, 401F use of homology searches, 397 GenTHREADER, 532–3, 534–5, 535F,

536F

GEPASI, 691–2, 691F

GES (Goldman, Engelman and Steitz) hydrophobicity scale, 438, 475, 477T

Gibbs program, 215–17 Gleevec®, 593

GLIMMER program, 323, 371–2 global alignments, 88–9, 89F

large genome sequences, 352F, 353 optimal, 128, 129–35, 129F, 130F,

131F

score significance, 154

time saving methods of deriving, 139–41, 139F, 140F

Index

global–local dynamic programming, 533

globular proteins, 41

length distributions of secondary structures, 467, 468F

secondary structure prediction, 509

secondary structures, 463 gluconeogenesis pathway, 348,

349–50F

glycolytic pathway, 671, 672F E. coli, 673F

interactions, 673F modularity, 686F, 687F

glycosylphosphatidylinositol (GPI) anchors, 513–14, 513F Godzik, Adam, 491

Gojobori, Takashi, 240B GOLD program, 591–2, 592F

GOR methods, 414, 422–5, 425F, 472–3 accuracy, 422, 423, 424T, 484 derivation, 480–4, 482F version III, 483, 484F version IV, 423–5, 427F, 483 version V, 423–5, 425–6, 426F, 483 Gotoh, Osamu, 206

GPI-SOM method, 513–14, 513F G-protein-coupled receptors, 436,

436B

GrailEXP program, 331T, 332T, 334–5, 336

Grail program, 323, 386, 387F, 389, 399

greedy alignment methods, 199 greedy permutation encoding method,

646–7

Greek Key structure, 40B GRID program, 591 GRIN program, 591 Grishin, Nick, 466

growth factors, 616–17, 617F guanine (G), 6, 6F

guide tree, 90, 199–200 construction, 204–6, 205F

multiple alignment from, 206, 206F pattern discovery, 214

Guigo, Roderic, 365–6B, 392B

Gumbel extreme-value distribution see extreme-value distribution

H

HbP method, 491–2, 492F, 493F Haemophilus influenzae, 371 hairpins, 36–7

harmonic approximation, 526, 702–3, 702F

hashing, 95

theoretical basis, 143–6 whole genome sequences, 158

heart

cellular modeling, 685T modeling of function, 677, 678F heat shock response, E. coli, 680, 680F helical wheels, 439F, 440–1, 448 helices, 435

see also 3₁₀-helices; a-helices;

p-helices; transmembrane helices

helix tails, 441

hemagglutinin, 34, 486, 486F hemoglobin, 43, 43F

Henikoff, Steven and Jorja, 122, 171F heptads, 451, 451F, 510

Hessian, 714–15

hexamers (hexanucleotides) see dicodons

HHsearch, 195F, 196

hidden layers, 431, 431F, 494, 499 hidden Markov models (HMMs), 166,

179, 179FD

with duration, or explicit state duration, 374–6

EcoParse gene model, 375F, 376–7 exon prediction, 328, 332

GAZE gene model, 402F

GeneMark.hmm algorithm, 374–6, 374F

genome annotation, 359 GenScan gene model, 399, 401F multiple sequence alignments, 200,

203–4

profile see profile hidden Markov models

secondary structure prediction, 504–10, 506FD

In document Understanding Bioinformatics (Page 23-46)